Skip to content

Graph Attention Networks

GCN and GraphSAGE aggregate neighbour features with fixed weights — either uniform or degree-normalised. The Graph Attention Network (GAT) replaces these fixed weights with learned attention between each node and its neighbours, the same conceptual move that took NLP from RNN+attention to the Transformer.

The GAT layer

Graph Attention Networks (Veličković, Cucurull, Casanova, Romero, Liò, Bengio, ICLR 2018) defines, for node i with features hiRF:

eij=LeakyReLU(a[WhiWhj]),αij=exp(eij)kN(i)exp(eik).

The weights αij are computed only over i's neighbourhood N(i), not the whole graph — this is what keeps GAT scalable. The output is the attention-weighted sum

hi=ϕ(jN(i)αijWhj).

Multi-head attention concatenates (or averages, in the final layer) K independent attention computations, exactly as in a Transformer.

Why attention helps

In GCN, the neighbour weight A^ij depends only on degrees — every neighbour is interchangeable up to a constant. GAT gives the model the freedom to attend more to relevant neighbours based on their features, not just the graph topology. Concrete wins:

  • Heterogeneous neighbourhoods — when neighbours have very different roles (e.g., a paper cited by both reviews and primary research), GAT can up-weight the relevant subset.
  • Transferability — the same model trained on one graph generalises better to graphs with different degree distributions, because attention weights adapt locally.
  • Interpretabilityαij can be visualised as a per-edge importance.

GAT and the WL hierarchy

GAT does not exceed GCN in the formal expressivity sense — both fit within the 1-Weisfeiler-Lehman bound (see message passing). What GAT improves is practical generalisation: the same architectural class fits real data better because attention provides a flexible inductive bias.

GATv2 — fixing static attention

How Attentive are Graph Attention Networks? (Brody, Alon, Yahav, ICLR 2022) pointed out a subtle flaw in the original GAT: because the attention scoring is

eij=LeakyReLU(a[WhiWhj]),

the ranking of neighbours by attention score is the same regardless of the query node i — GAT computes "static" attention. GATv2 swaps the order of operations:

eij=aLeakyReLU(W[hihj]),

making attention truly query-dependent. Empirically, GATv2 outperforms GAT on every benchmark in the paper and is the recommended default today.

When GAT is worth it

  • Use GCN for small homogeneous citation graphs and as a strong baseline.
  • Use GraphSAGE when the graph is too large for full-batch GCN.
  • Use GAT/GATv2 when neighbours are heterogeneous and the per-edge importance varies — typical in social, knowledge-graph, and recommender tasks.
  • Use Graph Transformers when the task involves long-range dependencies and the graph is small enough to afford global attention.

Released under the MIT License. Content imported and adapted from NoteNextra.