Graph Attention Networks

GCN and GraphSAGE aggregate neighbour features with fixed weights — either uniform or degree-normalised. The Graph Attention Network (GAT) replaces these fixed weights with learned attention between each node and its neighbours, the same conceptual move that took NLP from RNN+attention to the Transformer.

The GAT layer

Graph Attention Networks (Veličković, Cucurull, Casanova, Romero, Liò, Bengio, ICLR 2018) defines, for node $i$ with features $h_{i} \in R^{F}$ :

e_{i j} = LeakyReLU (a^{⊤} [W h_{i} ∥ W h_{j}]), α_{i j} = \frac{\exp (e_{i j})}{\sum_{k \in N (i)} \exp (e_{i k})} .

The weights $α_{i j}$ are computed only over $i$ 's neighbourhood $N (i)$ , not the whole graph — this is what keeps GAT scalable. The output is the attention-weighted sum

h_{i}^{'} = ϕ (\sum_{j \in N (i)} α_{i j} W h_{j}) .

Multi-head attention concatenates (or averages, in the final layer) $K$ independent attention computations, exactly as in a Transformer.

Why attention helps

In GCN, the neighbour weight ${\hat{A}}_{i j}$ depends only on degrees — every neighbour is interchangeable up to a constant. GAT gives the model the freedom to attend more to relevant neighbours based on their features, not just the graph topology. Concrete wins:

Heterogeneous neighbourhoods — when neighbours have very different roles (e.g., a paper cited by both reviews and primary research), GAT can up-weight the relevant subset.
Transferability — the same model trained on one graph generalises better to graphs with different degree distributions, because attention weights adapt locally.
Interpretability — $α_{i j}$ can be visualised as a per-edge importance.

GAT and the WL hierarchy

GAT does not exceed GCN in the formal expressivity sense — both fit within the 1-Weisfeiler-Lehman bound (see message passing). What GAT improves is practical generalisation: the same architectural class fits real data better because attention provides a flexible inductive bias.

GATv2 — fixing static attention

How Attentive are Graph Attention Networks? (Brody, Alon, Yahav, ICLR 2022) pointed out a subtle flaw in the original GAT: because the attention scoring is

e_{i j} = LeakyReLU (a^{⊤} [W h_{i} ∥ W h_{j}]),

the ranking of neighbours by attention score is the same regardless of the query node $i$ — GAT computes "static" attention. GATv2 swaps the order of operations:

e_{i j} = a^{⊤} LeakyReLU (W [h_{i} ∥ h_{j}]),

making attention truly query-dependent. Empirically, GATv2 outperforms GAT on every benchmark in the paper and is the recommended default today.

When GAT is worth it

Use GCN for small homogeneous citation graphs and as a strong baseline.
Use GraphSAGE when the graph is too large for full-batch GCN.
Use GAT/GATv2 when neighbours are heterogeneous and the per-edge importance varies — typical in social, knowledge-graph, and recommender tasks.
Use Graph Transformers when the task involves long-range dependencies and the graph is small enough to afford global attention.

Graph Attention Networks ​

The GAT layer ​

Why attention helps ​

GAT and the WL hierarchy ​

GATv2 — fixing static attention ​

When GAT is worth it ​

What to read next ​