See the source HTML code for this page to see the simulation definition
that is parsed and executed by the WebSim
applet.
MDP: A 2D grid [0..1,0..1] discretized into units
of 0.2. Four actions are possible: 0 - increment the
current x coordinate by 0.2; 0.25 - increment the current y coordinate
by 0.2; 0.5 - decrement the current x coordinate by 0.2; 0.75 - decrement
the current y coordinate by 0.2. State [0,0] is the initial state.
State [1,1] is an absorbing state, has a defined value of 0 and returns
a -1 reinforcement. All other state transitions return a reinforcement
of 1. States on the boundaries of state space (for example [1,0.5]) have
only three legal moves instead of four. The objective is to find a path
to the goal state that minimizes the reinforcement received.
Function Approximator: a lookup table.
Learning algorithm: Backprop
RL algorithm: Residual Gradient QLearning
Displays:
1) Variables and Rates (upper left
corner)
2) 2D graph of log error vs.
learning time (upper right corner)
3) 3D graph of value function
(lower left corner)
4) 3D graph of policy
Value Function Display: Remember that the value of a state is the sum of the reinforcements received when starting in that state and performing successive transitions until the absorbing state is reached. The X-axis and Y-axis correspond to state space. The Z-axis (height) is the value (minimum Q-value) in each state. The learned function should look like a stepped slope with the lowest point at [1,1] and the highest point at [0,0].
Policy Display: The X-axis and Y-axis correspond to state space. The Z-axis (height) is the action considered best in each state: 0 - increment the current x coordinate by 0.2; 0.25 - increment the current y coordinate by 0.2; 0.5 - decrement the current x coordinate by 0.2; 0.75 - decrement the current y coordinate by 0.2. The learned policy is correct when all states (with the exception of the [1,y] and [x,1] rows) have a value of 0 or 0.25. The [1,y] row should have a value of 0.25, and the [x,1] row should have a value of 0.
Suggestions for Experiments: Change the value of gamma and observe the resulting change in the optimal value function. Click on the # symbol by the WebSim logo. Change the "gamma" parameter. Remember that it may be necessary to decrease the value of the learning rate parameter "rate" for larger values of gamma. Click here to find a more complete description of WebSim(c) and how it can be used to perform experiments for many different RL algorithms and MDPs.