Skip to content

Message Passing & GraphSAGE

The Message Passing Neural Network (MPNN) framework generalises GCN, GAT, GraphSAGE, and most other GNNs into a single template: at each layer, every node sends a message to its neighbours, aggregates the messages it receives, and updates its state. Almost all graph deep learning fits this skeleton, and the engineering choices — what's a message, how to aggregate, how to update — are how variants differ.

The MPNN abstraction

Neural Message Passing for Quantum Chemistry (Gilmer et al., ICML 2017) named and formalised the framework. At layer , for each node v with neighbours N(v):

mv(+1)=AGG({M(hu(),hv(),euv):uN(v)}),hv(+1)=U(hv(),mv(+1)).
  • M — message function (often an MLP over hu,hv, edge features euv).
  • AGG — permutation-invariant aggregator (sum, mean, max).
  • U — update function (an MLP, GRU cell, or LSTM cell).

GCN, GAT, GraphSAGE, GIN, and Graph Transformers all fit this template — they differ in M and AGG.

Choice of aggregator

The aggregator's expressive power is more important than M. How Powerful are Graph Neural Networks? (Xu et al., ICLR 2019) showed that sum aggregation (with a deep update MLP) achieves the maximum expressive power compatible with the 1-Weisfeiler-Lehman test, while mean and max strictly under-fit some structural distinctions. The GIN architecture from that paper is the cleanest "maximally expressive" reference GNN.

For tasks where node degree matters (e.g., chemistry — atom valence is a function of degree), use sum. For degree-invariant tasks, use mean or max — they give scale-stable representations.

GraphSAGE — sampling for large graphs

Inductive Representation Learning on Large Graphs (Hamilton, Ying, Leskovec, NeurIPS 2017) addressed the scaling failure of GCN on graphs with millions of nodes. GCN is transductive — every layer touches the full N×N adjacency. GraphSAGE is inductive: at each layer, sample a fixed-size random subset of neighbours rather than using all of them. This makes per-node compute constant in graph size and lets the model generalise to nodes (or graphs) unseen at training time.

GraphSAGE's three aggregator choices — mean, LSTM (over a randomly-ordered neighbour sequence), and pooling (max over per-neighbour MLPs) — became the standard menu for production GNN deployments. The neighbour-sampling idea is the architectural foundation of every billion-node GNN system today (PinSage, GraphSAINT, Cluster-GCN).

Training schemes

For node-level tasks on a single large graph, two regimes:

  • Full-batch — load the full adjacency, compute messages everywhere. Fastest convergence, hits memory ceiling around $\sim$1M nodes.
  • Mini-batch with neighbour sampling (GraphSAGE-style) — sample k neighbours per node per layer; sample target nodes per batch. Scales to billions of nodes; typically requires more epochs to match full-batch quality.

For graph-level tasks (chemistry, molecule property prediction), each graph is small (~50 nodes) and a normal mini-batch pools many graphs into one disconnected mega-graph processed in parallel.

Position vs structure

Vanilla MPNN messages are based purely on graph structure — they do not know "where" in the graph a node is. Graph Transformers (Ying et al., NeurIPS 2021) and GraphGPS (Rampášek et al., NeurIPS 2022) add positional encodings derived from Laplacian eigenvectors or random walks, restoring the global-context capacity that pure local message passing lacks. This is structurally analogous to positional encodings in Transformers.

What MPNN cannot do

The 1-WL upper bound means MPNNs cannot:

  • Distinguish certain regular graphs (e.g., two non-isomorphic 3-regular graphs).
  • Reliably count subgraphs or identify cycles beyond a fixed size.
  • Capture long-range dependencies past their depth (and depth is limited by over-smoothing).

Higher-order GNNs (k-WL), subgraph GNNs, and Graph Transformers exist precisely to push past these limits.

Released under the MIT License. Content imported and adapted from NoteNextra.