Skip to content

Graph Convolutional Networks

A graph convolutional network applies a learnable linear operation to each node, mixing in information from its neighbours. The classical GCN of Kipf & Welling is the simplest reference architecture and the entry point to graph deep learning. The trick is identifying what "convolution" should mean on an irregular graph — and the answer comes from spectral graph theory.

Spectral motivation

For a graph G with N nodes, adjacency matrix A, and degree matrix D, the normalised graph Laplacian is

L~=ID1/2AD1/2.

Its eigendecomposition L~=UΛU gives the graph Fourier basis U. A signal xRN is filtered by y=Ugθ(Λ)Ux, with gθ a learnable diagonal filter in the spectral domain.

Computing U is O(N3) — infeasible. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering (Defferrard, Bresson, Vandergheynst, NeurIPS 2016) approximates gθ(Λ) as a truncated Chebyshev polynomial of order K, giving K-localised filters at O(K|E|) cost (no eigendecomposition).

The Kipf-Welling GCN

Semi-Supervised Classification with Graph Convolutional Networks (Kipf, Welling, ICLR 2017) takes the Chebyshev approximation to first order (K=1) and applies a renormalisation trick. The result is the now-canonical layer:

H(l+1)=ϕ(A^H(l)W(l)),A^=D~1/2(A+I)D~1/2,

where D~ is the degree matrix of A+I. Three things happen at each layer:

  1. Self-loops (A+I) keep the node's own features in the mix.
  2. Symmetric normalisation (D~1/2D~1/2) prevents activation magnitude from depending on node degree.
  3. Linear feature transformation (W(l)) followed by activation ϕ (typically ReLU).

A 2-layer GCN with one hidden layer was the original benchmark for semi-supervised node classification on Cora, Citeseer, and Pubmed — beating contemporaneous label-propagation and deep-walk baselines by large margins.

What GCN computes, intuitively

Each layer is a single hop of neighbour averaging plus a linear transform. Stacking L layers gives each node access to its L-hop neighbourhood. The expressive limit at $L = $ small is fundamentally the Weisfeiler-Lehman colour refinement — GCN cannot distinguish graphs that 1-WL cannot distinguish. This is the formal link to graph isomorphism testing and the motivation for more expressive variants like GIN (Graph Isomorphism Network, Xu et al., ICLR 2019).

Over-smoothing

Stacking many GCN layers produces a counter-intuitive failure: node representations converge to identical values across the graph (in the limit, they project onto the principal eigenvector of A^). After 4–6 layers, accuracy on most benchmarks degrades. Practical work-arounds:

  • Skip connections (similar to ResNet) — JKNet (Xu et al., ICML 2018), GCNII (Chen et al., ICML 2020).
  • Graph normalisation layers that explicitly counter representational collapse.
  • Mixed local–global architectures (e.g., Graph Transformers).

The over-smoothing limitation is the reason most production GNNs are 2–3 layers deep.

What GCN is good for

  • Node classification on citation, social, and recommendation graphs.
  • Link prediction and graph classification with appropriate readout heads.
  • Drug discovery — molecules as graphs; a workhorse architecture in cheminformatics.
  • Recommender systems — bipartite user-item graphs, with GCN as the embedding model.

Released under the MIT License. Content imported and adapted from NoteNextra.