Hlearn_v0.1

73 days ago by Ch4os

Hlearn(Hamlearn) is a machine-learning-algorythm created by Patrick Hammer (Ch4os),
where a agent can learn to survive in a world without meta-information from the programmer.
It is able to classify certain things as good/bad for itself automatically by his own wellness-attributes,
like the battery-state or its own health for example.

Sheme of the learning system:


The perception system:
A agent has to be able to operate in a complex world. And such a complex world has to give him a lot of sensor inputs
if the Agent should be able to recognice the things which are important for him.
So the inputs are high-dimensional, and the agent has to deal with that.
The solution is a general classification algorythm. I've used self organizing maps for that.
You can read about it on wikipedia: SelfOrganizingMap@Wiki

The prediction system:
This system gives the agent the ability to use the best strategy, which gives the best long-time-reward.
It does that by integrating the reward for all actions over n think-steps.

The quality system:
The agent has to build a copy of his environment to give good predictions.
I've used a self-organizing-map for that because it is good in recognicing incomplete patterns.
The idea: value=predicted_next_step target_value=next_step -> now optimice and calculate error
After some optimication time the Agent has a good copy of the environment which makes it possible for him to predict.
And the error can be used to control how far the agent looks into the future. (n-step-value of the prediction system).
When the predictions are good -> it's better to look more far into the future.

The motor system:
The output of the prediction system is a action.
A action can be "walking", "climb up a ladder" etc.
The simulation can use this simple actions as agent-output.
But what when it should learn walking by itself and when there are 10 muscles to control just one foot?
Then this system is the solution:
At the beginning a action like "0" will generate a randomized "muscle-control-pattern" like "011110100011000..."
When the reward of this action is good, it will save it. That's all.

Hlearn example implementation in C, downloadable at:
Hamlib@Sourceforge

A simple test-implementation in sage-Python:
(since this learning model profits from generalisation and since this implementation doesn't use
the SOM-perception-system, it uses a Q[input][action]=reward array like Q-learning instead)

%latex \begin{align} PerceptionSystem:\\Q(i,a)\Leftarrow r(i) \\ \\PredictionSystem: \\A,R: \Longleftrightarrow \forall a \in A, \forall r \in R: r \Leftarrow \sum_{t=0}^{steps}(Q_t(P(i,a),n)) \\ \\Quality System: \\P_t{_-{_1}}(i_t{_-{_1}},a_t{_-{_1}})\Leftarrow i \\ \\MotorSystem:\\R: \Longleftrightarrow \forall r \in R: r\leq a \\ \\a...Action...GivesBestReward \\r...ReinforcementFunction \\R...RewardSet \\A...ActionSet \\P...PredictionSet \\Q...InputActionRewardSet \\i...Input \end{align} 
       
#test application function-definitions: size=0;#wellness worldsize=4 px=0;py=0;lx=0 def rewardfunc(): return size def envfunc(action): nextinput=lx return nextinput 
       
#Hlearn definitions: dimsize=5 nactions=5 inputreward={} prediction={} #two dimensional array [action,input] for i in range(dimsize): for j in range(nactions): prediction[j*nactions+i]=0 inputreward[j*nactions+i]=0 
       
#Hlearn system: perceptinput=0 error=0 def percept(input,action): inputreward[action*nactions+input]=rewardfunc() def predict(input): #two steps if error, one step if no error maxval=-9999 max=0 for z in range(2-error): for i in range(nactions): if inputreward[i*nactions+prediction[i*nactions+input]]>maxval: maxval=inputreward[i*nactions+prediction[i*nactions+input]] max=i if random()>0.9: return int(random()*(nactions-1)+1) return max def quality(last,lastaction,now): if prediction[lastaction*nactions+last]!=now: prediction[lastaction*nactions+last]=now error=1 else: error=0 def step(last,lastaction,input): percept(last,lastaction) output=predict(input) quality(last,lastaction,input) return output 
       
#test application definitions: statistics={} for i in range(nactions): statistics[i]=0 last=0 action=0 lastaction=action input=1 
       
#test application: G=Graphics() v=[] posx=worldsize/2;posy=worldsize/2 sollx=random()*worldsize;solly=random()*worldsize lastabs=sqrt((posx-sollx)*(posx-sollx)+(posy-solly)*(posy-solly)) sx=posx;sy=posy for s in range(100000): action=step(last,lastaction,input) if action==1: posx+=0.1 if action==2: posx-=0.1 if action==3: posy+=0.1 if action==4: posy-=0.1 posx=max(0,min(4-0.1,posx)) posy=max(0,min(4-0.1,posy)) px=posx;py=posy; if posx>sollx and posy>solly: #simple categorization since there are no SOMs in this implementation lx=1 if posx>sollx and posy<solly: lx=2 if posx<sollx and posy>solly: lx=3 if posx<sollx and posy<solly: lx=4 abs=sqrt((posx-sollx)*(posx-sollx)+(posy-solly)*(posy-solly)) size=lastabs-abs lastabs=abs if abs<0.2: sollx=random()*worldsize;solly=random()*worldsize if s>99500 and s%2==0: G=line([(0,0),(worldsize,0)],rgbcolor=(0,0,1),thickness=2) G+=line([(worldsize,0),(worldsize,worldsize)],rgbcolor=(0,0,1),thickness=2) G+=line([(worldsize,worldsize),(0,worldsize)],rgbcolor=(0,0,1),thickness=2) G+=line([(0,worldsize),(0,0)],rgbcolor=(0,0,1),thickness=2) #draw sollobj: G+=line([(sollx,solly),(sollx+0.1,solly)],rgbcolor=(0,1,0),thickness=2) G+=line([(sollx+0.1,solly),(sollx+0.1,solly+0.1)],rgbcolor=(0,1,0),thickness=2) G+=line([(sollx+0.1,solly+0.1),(sollx,solly+0.1)],rgbcolor=(0,1,0),thickness=2) G+=line([(sollx,solly+0.1),(sollx,solly)],rgbcolor=(0,1,0),thickness=2) # G+=line([(posx,posy),(posx+0.1,posy)],rgbcolor=(0.00001*s,0,0),thickness=2) G+=line([(posx+0.1,posy),(posx+0.1,posy+0.1)],rgbcolor=(0.00001*s,0,0),thickness=2) G+=line([(posx+0.1,posy+0.1),(posx,posy+0.1)],rgbcolor=(0.00001*s,0,0),thickness=2) G+=line([(posx,posy+0.1),(posx,posy)],rgbcolor=(0.00001*s,0,0),thickness=2) v.append(G) sx=posx;sy=posy; statistics[action]=statistics[action]+1 lastaction=action last=input input=envfunc(action) print "actions used how often:" for i in range(nactions): print statistics[i] a=animate(v,xmin=0,ymin=0) a.show() 
       
actions used how often:
4807
25465
22857
23575
23296