Message Passing & GraphSAGE
The Message Passing Neural Network (MPNN) framework generalises GCN, GAT, GraphSAGE, and most other GNNs into a single template: at each layer, every node sends a message to its neighbours, aggregates the messages it receives, and updates its state. Almost all graph deep learning fits this skeleton, and the engineering choices — what's a message, how to aggregate, how to update — are how variants differ.
The MPNN abstraction
Neural Message Passing for Quantum Chemistry (Gilmer et al., ICML 2017) named and formalised the framework. At layer
— message function (often an MLP over , edge features ). — permutation-invariant aggregator (sum, mean, max). — update function (an MLP, GRU cell, or LSTM cell).
GCN, GAT, GraphSAGE, GIN, and Graph Transformers all fit this template — they differ in
Choice of aggregator
The aggregator's expressive power is more important than
For tasks where node degree matters (e.g., chemistry — atom valence is a function of degree), use sum. For degree-invariant tasks, use mean or max — they give scale-stable representations.
GraphSAGE — sampling for large graphs
Inductive Representation Learning on Large Graphs (Hamilton, Ying, Leskovec, NeurIPS 2017) addressed the scaling failure of GCN on graphs with millions of nodes. GCN is transductive — every layer touches the full
GraphSAGE's three aggregator choices — mean, LSTM (over a randomly-ordered neighbour sequence), and pooling (max over per-neighbour MLPs) — became the standard menu for production GNN deployments. The neighbour-sampling idea is the architectural foundation of every billion-node GNN system today (PinSage, GraphSAINT, Cluster-GCN).
Training schemes
For node-level tasks on a single large graph, two regimes:
- Full-batch — load the full adjacency, compute messages everywhere. Fastest convergence, hits memory ceiling around $\sim$1M nodes.
- Mini-batch with neighbour sampling (GraphSAGE-style) — sample
neighbours per node per layer; sample target nodes per batch. Scales to billions of nodes; typically requires more epochs to match full-batch quality.
For graph-level tasks (chemistry, molecule property prediction), each graph is small (~50 nodes) and a normal mini-batch pools many graphs into one disconnected mega-graph processed in parallel.
Position vs structure
Vanilla MPNN messages are based purely on graph structure — they do not know "where" in the graph a node is. Graph Transformers (Ying et al., NeurIPS 2021) and GraphGPS (Rampášek et al., NeurIPS 2022) add positional encodings derived from Laplacian eigenvectors or random walks, restoring the global-context capacity that pure local message passing lacks. This is structurally analogous to positional encodings in Transformers.
What MPNN cannot do
The 1-WL upper bound means MPNNs cannot:
- Distinguish certain regular graphs (e.g., two non-isomorphic 3-regular graphs).
- Reliably count subgraphs or identify cycles beyond a fixed size.
- Capture long-range dependencies past their depth (and depth is limited by over-smoothing).
Higher-order GNNs (k-WL), subgraph GNNs, and Graph Transformers exist precisely to push past these limits.
What to read next
- Graph Convolutional Networks — the MPNN special case that started it.
- Graph Attention Networks — message passing with learned weights instead of fixed adjacency.
- Transformer (LLM) — the graph attention generalisation when every pair is an edge.