Skip to content

MDPs & Bellman Equations

The Markov Decision Process is the formal language of reinforcement learning. Almost every algorithm — value iteration, Q-learning, policy gradients, modern actor-critic methods — is a way of approximately solving the Bellman equations of an MDP. This page sets up the formalism and derives the equations that every later chapter rests on.

The MDP

A (discounted, infinite-horizon) Markov Decision Process is a tuple (S,A,P,r,γ,ρ0):

  • S — state space.
  • A — action space.
  • P(ss,a) — transition probability.
  • r(s,a)R — reward function.
  • γ[0,1) — discount factor.
  • ρ0(s) — initial-state distribution.

A trajectory τ=(s0,a0,s1,a1,) is generated by sampling s0ρ0, then iterating atπ(st) and st+1P(st,at). The return is the discounted sum of rewards Gt=k=0γkrt+k.

The Markov property — the future depends on history only through the current state — is what makes the framework tractable. Many real-world problems are not Markov in their raw observation, motivating Partially-Observable MDPs (POMDPs) and frame-stacking / RNN-based state aggregation.

Value functions

Define two functions that measure long-run reward under a policy π:

Vπ(s)=Eπ[Gtst=s],Qπ(s,a)=Eπ[Gtst=s,at=a].

Vπ is the state-value function (expected return from s); Qπ is the action-value function. They relate as

Vπ(s)=aπ(as)Qπ(s,a).

A policy π is optimal if Vπ(s)Vπ(s) for every s and every π. Such a policy always exists (Puterman, 1994) and one of them is deterministic.

Bellman expectation equations

By conditioning on the first action and the next state:

Vπ(s)=aπ(as)[r(s,a)+γsP(ss,a)Vπ(s)],Qπ(s,a)=r(s,a)+γsP(ss,a)aπ(as)Qπ(s,a).

Both are linear in the unknowns (Vπ, Qπ) given the dynamics — finite MDPs can be solved in closed form by matrix inversion. For large or unknown MDPs, iterative methods replace the inversion.

Bellman optimality equations

For the optimal policy π, the value function satisfies a non-linear fixed-point equation:

V(s)=maxa[r(s,a)+γsP(ss,a)V(s)],Q(s,a)=r(s,a)+γsP(ss,a)maxaQ(s,a).

Once Q is known, the optimal policy is greedy: π(s)=argmaxaQ(s,a). Knowing Q is therefore equivalent to knowing π.

Bellman operators and contraction

Define the Bellman optimality operator T on the space of bounded functions V:SR:

(TV)(s)=maxa[r(s,a)+γsP(ss,a)V(s)].

Two crucial properties:

  1. T is a γ-contraction in the norm: TV1TV2γV1V2.
  2. Its unique fixed point is V.

By Banach's fixed-point theorem, the iteration Vk+1=TVk converges to V from any initialisation, geometrically with rate γ. This is the value iteration algorithm, and it is the prototype every later RL algorithm approximates.

Policy iteration

Alternative dynamic-programming algorithm: alternate between policy evaluation (solve Vπ given π) and policy improvement (set π(s)=argmaxaQπ(s,a)). Each step strictly improves the policy unless it is already optimal, and convergence happens in finitely many iterations on finite MDPs (Howard, 1960). Modern actor-critic algorithms are stochastic, sample-based generalisations of policy iteration.

  • Q-Learning — the model-free, sample-based version of value iteration.
  • Policy Gradient — the alternative, policy-iteration-style approach.
  • DQN — Q-learning with a deep network.

Released under the MIT License. Content imported and adapted from NoteNextra.