Two-View Geometry & Stereo

Two cameras observing the same scene from different positions impose a strong geometric constraint on what they can see. Working out that constraint — the epipolar geometry — is what lets us triangulate 3D points from pairs of image observations. Stereo vision is the direct application: a calibrated pair of cameras produces a dense disparity map that converts to depth.

Epipolar geometry and the fundamental matrix

For two views with camera matrices $P_{1}, P_{2}$ , a 3D point $X$ projects to $x_{1}$ and $x_{2}$ . These satisfy the epipolar constraint

x_{2}^{T} F x_{1} = 0,

where $F$ is the $3 \times 3$ rank-2 fundamental matrix. Given a point in image 1, the matching point in image 2 must lie on the epipolar line $F x_{1}$ . Reducing search from 2D to 1D is the entire reason stereo matching is tractable.

When the cameras are calibrated (intrinsics $K_{1}, K_{2}$ known), $F = K_{2}^{- T} E K_{1}^{- 1}$ where $E$ is the essential matrix $E = [t]_{\times} R$ , encoding the relative rotation and translation up to scale. $E$ has 5 degrees of freedom and is recoverable from 5 point correspondences (Nistér's 5-point algorithm, PAMI 2004).

Estimating $F$ : the eight-point algorithm

Hartley's normalised eight-point algorithm (PAMI 1997) solves for $F$ from $\geq 8$ correspondences:

Normalise image coordinates so both views are zero-mean with average distance $\sqrt{2}$ . Skipping this is the classical foot-gun — the linear system becomes wildly ill-conditioned.
Stack the constraint $x_{2}^{T} F x_{1} = 0$ for each correspondence into a linear system $A f = 0$ where $f$ is the 9 entries of $F$ .
Solve via SVD; enforce the rank-2 constraint by zeroing the smallest singular value of the resulting $F$ .
Wrap in RANSAC — outliers are inevitable in feature matching, so this is non-negotiable in practice.

Rectification

Once $F$ (or $E$ ) is estimated, rectification rotates the two images so that corresponding epipolar lines become horizontal scanlines. After rectification, finding the match for a pixel at row $y$ in image 1 is a 1D search along row $y$ in image 2 — converting the 2D matching problem into per-row block matching.

Disparity and depth

In a rectified stereo pair with baseline $b$ and focal length $f$ , a 3D point at depth $Z$ projects to two image points separated by disparity $d = x_{1} - x_{2}$ , related by

Z = \frac{f \cdot b}{d} .

Closer points have larger disparity. Estimating the disparity at every pixel is the stereo matching problem: methods range from local block matching (window correlation, SAD/NCC) through Semi-Global Matching (SGM, Hirschmüller, CVPR 2005) — the workhorse of OpenCV's StereoSGBM — to learned cost-volume networks (PSMNet, GA-Net, RAFT-Stereo).

Two-View Geometry & Stereo ​

Epipolar geometry and the fundamental matrix ​

Estimating F: the eight-point algorithm ​

Rectification ​

Disparity and depth ​

What to read next ​

Two-View Geometry & Stereo

Epipolar geometry and the fundamental matrix

Estimating $F$ : the eight-point algorithm

Rectification

Disparity and depth

What to read next