MDPs & Bellman Equations
The Markov Decision Process is the formal language of reinforcement learning. Almost every algorithm — value iteration, Q-learning, policy gradients, modern actor-critic methods — is a way of approximately solving the Bellman equations of an MDP. This page sets up the formalism and derives the equations that every later chapter rests on.
The MDP
A (discounted, infinite-horizon) Markov Decision Process is a tuple
— state space. — action space. — transition probability. — reward function. — discount factor. — initial-state distribution.
A trajectory
The Markov property — the future depends on history only through the current state — is what makes the framework tractable. Many real-world problems are not Markov in their raw observation, motivating Partially-Observable MDPs (POMDPs) and frame-stacking / RNN-based state aggregation.
Value functions
Define two functions that measure long-run reward under a policy
A policy
Bellman expectation equations
By conditioning on the first action and the next state:
Both are linear in the unknowns (
Bellman optimality equations
For the optimal policy
Once
Bellman operators and contraction
Define the Bellman optimality operator
Two crucial properties:
is a -contraction in the norm: . - Its unique fixed point is
.
By Banach's fixed-point theorem, the iteration
Policy iteration
Alternative dynamic-programming algorithm: alternate between policy evaluation (solve
What to read next
- Q-Learning — the model-free, sample-based version of value iteration.
- Policy Gradient — the alternative, policy-iteration-style approach.
- DQN — Q-learning with a deep network.