WebSim

All of this code is (c) 1996 by the respective authors, is freeware, and may be freely distributed. If modifications are made, please say so in the comments.


This page requires a Java-aware browser and may take several minutes to download.

Simulation Definition:

See the source HTML code for this page to see the simulation definition that is parsed and executed by the WebSim applet.

MDP:
A 2D grid [0..1,0..1] discretized into units of 0.2. Four actions are possible: 0 - increment the current x coordinate by 0.2; 0.25 - increment the current y coordinate by 0.2; 0.5 - decrement the current x coordinate by 0.2; 0.75 - decrement the current y coordinate by 0.2. State [0,0] is the initial state. State [1,1] is an absorbing state, has a defined value of 0 and returns a -1 reinforcement. All other state transitions return a reinforcement of 1. States on the boundaries of state space (for example [1,0.5]) have only three legal moves instead of four. The objective is to find a path to the goal state that minimizes the reinforcement received.

Function Approximator: a lookup table.

Learning algorithm: Backprop

RL algorithm: Residual Gradient QLearning

Displays:
1) Variables and Rates (upper left corner)
2) 2D graph of log error vs. learning time (upper right corner)
3) 3D graph of value function (lower left corner)
4) 3D graph of policy

The 3D graphs can be rotated on two different axis by clicking and dragging inside or outside of the box.

Value Function Display: Remember that the value of a state is the sum of the reinforcements received when starting in that state and performing successive transitions until the absorbing state is reached. The X-axis and Y-axis correspond to state space. The Z-axis (height) is the value (minimum Q-value) in each state. The learned function should look like a stepped slope with the lowest point at [1,1] and the highest point at [0,0].

Policy Display: The X-axis and Y-axis correspond to state space. The Z-axis (height) is the action considered best in each state: 0 - increment the current x coordinate by 0.2; 0.25 - increment the current y coordinate by 0.2; 0.5 - decrement the current x coordinate by 0.2; 0.75 - decrement the current y coordinate by 0.2. The learned policy is correct when all states (with the exception of the [1,y] and [x,1] rows) have a value of 0 or 0.25. The [1,y] row should have a value of 0.25, and the [x,1] row should have a value of 0.

Suggestions for Experiments: Change the value of gamma and observe the resulting change in the optimal value function. Click on the # symbol by the WebSim logo. Change the "gamma" parameter. Remember that it may be necessary to decrease the value of the learning rate parameter "rate" for larger values of gamma. Click here to find a more complete description of WebSim(c) and how it can be used to perform experiments for many different RL algorithms and MDPs.


This Java applet requires a Java-aware browser such as Netscape 2.0 for Solaris/Win95/WinNT.