Skip to content

Modern CV — Overview

The CV Advances section surveys ten research areas that defined modern computer vision in the 2020s. Each page is structured as a paper list — the canonical references, in roughly the order a reader new to the area would benefit from. Together they trace the path from CNN-based vision through transformer backbones into the foundation-model era.

The ten topics

Reading paths

Three suggested paths through the section, depending on background:

What unifies these areas

Three common threads across all ten:

  • Transformer architectures are now standard. Each area's modern frontier (SAM 2, BLIP-2, Sora, DUSt3R, Mask2Former, Grounding DINO) uses Transformer backbones.
  • Foundation-model framing — pretrain a generic vision/multimodal backbone once, adapt to many downstream tasks. CLIP, DINOv2, MAE, SAM are all foundation models in this sense.
  • Language as the universal interface — text is the most common bridge across modalities, control mechanisms, and evaluation protocols. Even pure-vision tasks now route through language for prompting, evaluation, and zero-shot use.

Released under the MIT License. Content imported and adapted from NoteNextra.