Skip to content

Image Formation & Cameras

An image is the result of light reflecting off scene surfaces, traveling through a lens, and being sampled by a 2D sensor. Computer vision starts from the geometry and radiometry of that process — the mapping from a 3D world point to a pixel intensity. Every later module (calibration, stereo, SfM) reads off this model.

Pinhole projection

The simplest camera is the pinhole: a single small aperture and a planar image sensor at focal distance f behind it. A 3D point X=(X,Y,Z) in the camera frame projects to the image plane via similar triangles:

x=fXZ,y=fYZ.

In homogeneous coordinates this is a linear map. Stacked with intrinsics (focal length, principal point, pixel scaling) and extrinsics (rotation R, translation t), the full projection is the camera matrix P=K[Rt] giving xPX.

The pinhole model is the working substrate of nearly all multi-view geometry — it is exactly invertible up to depth and the source of every "lift a 2D point to a 3D ray" operation.

Intrinsic and extrinsic parameters

  • Intrinsics K — focal lengths (fx,fy), principal point (cx,cy), optionally a skew term. They depend only on the camera body + lens combination.
  • Extrinsics (R,t) — the rigid transform from world coordinates into the camera's coordinate frame.
K=[fx0cx0fycy001],P=K[Rt].

Intrinsics are recovered by calibration; extrinsics are estimated per-image during pose estimation, SLAM, or SfM.

Lens distortion

Real lenses deviate from the pinhole. The two dominant components are radial distortion (barrel/pincushion warping that depends on distance from the optical centre) and tangential distortion (slight lens decentering). The Brown–Conrady model is the standard:

xd=x(1+k1r2+k2r4+k3r6)+2p1xy+p2(r2+2x2),yd=y(1+k1r2+k2r4+k3r6)+p1(r2+2y2)+2p2xy,

with r2=x2+y2. Modelling and undistorting images is a prerequisite for any geometry-based downstream task.

Radiometry: how intensity is formed

Pixel value depends on the irradiance hitting the sensor, which depends on scene radiance, surface BRDF, lighting, exposure, lens vignetting, and sensor response. The simplest reasonable model is the image irradiance equation

E=Lπ4(df)2cos4α,

where L is scene radiance, d is aperture diameter, f is focal length, and α is the angle from the optical axis. The cos4 term is the natural source of vignetting (darker corners). Beyond geometry, sensors apply gamma correction and quantisation, which is what classical and learned methods alike must remain robust to.

  • Filters & Convolution — the next layer of the foundations stack: how images are smoothed, sharpened, and differentiated.
  • Camera Calibration — recovering K and the distortion coefficients from images of known patterns.
  • Stereo & Multi-view — projecting a single point with two cameras to recover depth.

Released under the MIT License. Content imported and adapted from NoteNextra.