See the source HTML code for this page to see the simulation definition
that is parsed and executed by the WebSim applet.
MDP: a linear-quadratic regulator.
State space is a section of the number line from [-1,1]. An imaginary cart
sits on this number line and has two actions possible: move left or move
right. The act of moving left corresponds to an input to the neural network
of -1. The act of moving right corresponds to an input to the neural network
of 1. The state is the position on the number line. The cost function is
the position on the number line squared after performing an action. The
goal is to minimize the cost. There is no absorbing state.
Function Approximator: a
single-hidden-layer sigmoidal network with 8 nodes in the hidden layer.
Learning algorithm: Backprop
RL algorithm: Residual
Gradient QLearning
Displays:
1) Variables and Rates (upper left
corner)
2) 2D graph of log error vs.
learning time (upper right corner)
3) 3D graph of value function
(lower left corner)
4) 3D graph of policy (lower
right corner)
Value Function Display: After learning, the value function will look like a "U". Remember that the value of a state is the maximum Q-value in the given state. Also, the definition of a Q-value is the sum of the reinforcements recieved when performing the corresponding action followed by optimal policy thereafter. The X-axis corresponds to state space. The Z-axis (height) is the value in each state. The Y-axis (depth) has no meaning.
Policy Display: The policy for this system is clear. When the "cart" is left of 0, the RL system should perform action 1. If the "cart" is right of 0 the RL system should perform action -1. The X-axis corresponds to state space. The Y-axis is the policy in each state. The Z-axis has no meaning.
Back to Tutorial.