Convolution & Pooling
The convolutional layer is the deep-learning answer to the question "how do we exploit the fact that visual signals have local, repeated structure?". The classical linear-filter operation becomes a learnable feature extractor, with weight sharing across spatial positions. This page covers the layer's mathematical form, its key hyperparameters (kernel size, stride, padding, dilation), and the pooling operation that classically pairs with it.
The convolutional layer
A 2D convolutional layer maps an input
where
- Weight sharing — the same kernel is applied at every spatial position.
- Local receptive field — each output depends on only a
window of inputs.
Compared to a fully-connected layer with the same input/output shape, a conv layer has
Padding, stride, dilation
Three knobs control output resolution and receptive field:
- Padding
— pixels added around the input. "Same" padding ( ) preserves spatial size. - Stride
— step between applied kernel positions. Stride 2 halves resolution; canonical for downsampling. - Dilation
— gaps between kernel taps, equivalent to atrous convolution. Increases receptive field without adding parameters; central to DeepLab.
Output spatial size:
Pooling
Pooling reduces spatial resolution and adds a small amount of translation invariance. Max pooling takes the maximum within a
Average pooling averages instead. Pooling is parameter-free but reduces information per output spatial position. Strided convolution is the modern alternative — a learnable downsampler — and most post-2017 architectures (ResNet's conv5_x, ViT's patch embedding) use it instead of pooling.
Global average pooling (GAP) averages each feature map to a single number, replacing the flatten-then-FC head of older designs. It cuts parameters dramatically and is the standard classifier head for ResNet and almost every successor.
Variants worth knowing
- 1×1 convolution — pointwise mixing across channels with no spatial extent. Used as a cheap channel projector inside Inception, ResNet bottlenecks, and MobileNet.
- Depthwise convolution — apply one kernel per input channel independently. Depthwise-separable convolution = depthwise + 1×1 pointwise; the building block of MobileNet, EfficientNet, and Xception.
- Transposed convolution ("deconvolution") — the upsampling counterpart, used in FCN/U-Net decoders.
- Atrous (dilated) convolution — see Dilation above.
What learned filters look like
Visualising trained CNN kernels (Zeiler, Fergus, ECCV 2014) shows a hierarchy: early layers learn Gabor-like edge detectors and colour blobs; mid-layers learn texture and motif detectors; deep layers learn object parts. This visual hierarchy is the empirical justification for the convolutional inductive bias and the link back to the classical features story.
What to read next
- LeNet & AlexNet — the foundational architectures.
- VGG, Inception, ResNet — the next generation.
- Linear Filters & Convolution — classical convolution that this layer generalises.