MDPs & Bellman Equations

The Markov Decision Process is the formal language of reinforcement learning. Almost every algorithm — value iteration, Q-learning, policy gradients, modern actor-critic methods — is a way of approximately solving the Bellman equations of an MDP. This page sets up the formalism and derives the equations that every later chapter rests on.

The MDP

A (discounted, infinite-horizon) Markov Decision Process is a tuple $(S, A, P, r, γ, ρ_{0})$ :

$S$ — state space.
$A$ — action space.
$P (s^{'} ∣ s, a)$ — transition probability.
$r (s, a) \in R$ — reward function.
$γ \in [0, 1)$ — discount factor.
$ρ_{0} (s)$ — initial-state distribution.

A trajectory $τ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)$ is generated by sampling $s_{0} \sim ρ_{0}$ , then iterating $a_{t} \sim π (\cdot ∣ s_{t})$ and $s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t})$ . The return is the discounted sum of rewards $G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}$ .

The Markov property — the future depends on history only through the current state — is what makes the framework tractable. Many real-world problems are not Markov in their raw observation, motivating Partially-Observable MDPs (POMDPs) and frame-stacking / RNN-based state aggregation.

Value functions

Define two functions that measure long-run reward under a policy $π$ :

V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s], Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a] .

$V^{π}$ is the state-value function (expected return from $s$ ); $Q^{π}$ is the action-value function. They relate as

V^{π} (s) = \sum_{a} π (a ∣ s) Q^{π} (s, a) .

A policy $π^{*}$ is optimal if $V^{π^{*}} (s) \geq V^{π} (s)$ for every $s$ and every $π$ . Such a policy always exists (Puterman, 1994) and one of them is deterministic.

Bellman expectation equations

By conditioning on the first action and the next state:

V^{π} (s) = \sum_{a} π (a ∣ s) [r (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) V^{π} (s^{'})],

Q^{π} (s, a) = r (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) \sum_{a^{'}} π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'}) .

Both are linear in the unknowns ( $V^{π}$ , $Q^{π}$ ) given the dynamics — finite MDPs can be solved in closed form by matrix inversion. For large or unknown MDPs, iterative methods replace the inversion.

Bellman optimality equations

For the optimal policy $π^{*}$ , the value function satisfies a non-linear fixed-point equation:

V^{*} (s) = max_{a} [r (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) V^{*} (s^{'})],

Q^{*} (s, a) = r (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) max_{a^{'}} Q^{*} (s^{'}, a^{'}) .

Once $Q^{*}$ is known, the optimal policy is greedy: $π^{*} (s) = \arg max_{a} Q^{*} (s, a)$ . Knowing $Q^{*}$ is therefore equivalent to knowing $π^{*}$ .

Bellman operators and contraction

Define the Bellman optimality operator $T^{*}$ on the space of bounded functions $V : S \to R$ :

(T^{*} V) (s) = max_{a} [r (s, a) + γ \sum_{s^{'}} P (s^{'} ∣ s, a) V (s^{'})] .

Two crucial properties:

$T^{*}$ is a $γ$ -contraction in the $ℓ_{\infty}$ norm: $∥ T^{*} V_{1} - T^{*} V_{2} ∥_{\infty} \leq γ ∥ V_{1} - V_{2} ∥_{\infty}$ .
Its unique fixed point is $V^{*}$ .

By Banach's fixed-point theorem, the iteration $V_{k + 1} = T^{*} V_{k}$ converges to $V^{*}$ from any initialisation, geometrically with rate $γ$ . This is the value iteration algorithm, and it is the prototype every later RL algorithm approximates.

Policy iteration

Alternative dynamic-programming algorithm: alternate between policy evaluation (solve $V^{π}$ given $π$ ) and policy improvement (set $π^{'} (s) = \arg max_{a} Q^{π} (s, a)$ ). Each step strictly improves the policy unless it is already optimal, and convergence happens in finitely many iterations on finite MDPs (Howard, 1960). Modern actor-critic algorithms are stochastic, sample-based generalisations of policy iteration.

MDPs & Bellman Equations ​

The MDP ​

Value functions ​

Bellman expectation equations ​

Bellman optimality equations ​

Bellman operators and contraction ​

Policy iteration ​

What to read next ​