Modern CV — Overview

The CV Advances section surveys ten research areas that defined modern computer vision in the 2020s. Each page is structured as a paper list — the canonical references, in roughly the order a reader new to the area would benefit from. Together they trace the path from CNN-based vision through transformer backbones into the foundation-model era.

The ten topics

Semantic Segmentation — DeepLabv3+, Swin, SegFormer, Mask2Former, SAM, Grounded SAM.
Vision-Language Models — CLIP, Flamingo, BLIP-2, LLaVA, Gemini, Molmo.
Neural Rendering — NeRF, Plenoxels, Mip-NeRF 360, 3D Gaussian Splatting.
Image and Video Generation — AttnGAN, DALL·E, Latent Diffusion, DreamBooth, Sora, Wan.
Geometric Computer Vision — PoseNet, MeshLoc, DUSt3R, Depth Anything, VGGT.
Representation Learning — SimCLR, MoCo, MAE, JEPA, DINOv2.
Correspondence & SfM — COLMAP, SuperGlue, RAFT, LoFTR, LightGlue, MegaSaM.
Safety, Robustness, Evaluation — Object Recognition for Everyone, OccamNets, GeoNet, T2I bias.
Embodied CV & Robotics — ViNG, ViKiNG, GNM, NoMaD, Navigation World Models.
Open-Vocabulary Detection — OVR-CNN, MDETR, ViLD, CORA, Grounding DINO.

Reading paths

Three suggested paths through the section, depending on background:

Generative AI focus — start with Representation Learning, Vision-Language Models, Image and Video Generation, Neural Rendering.
3D / robotics focus — start with Geometric Computer Vision, Correspondence & SfM, Neural Rendering, Embodied CV.
Detection / segmentation focus — start with Semantic Segmentation, Open-Vocabulary Detection, Vision-Language Models.

What unifies these areas

Three common threads across all ten:

Transformer architectures are now standard. Each area's modern frontier (SAM 2, BLIP-2, Sora, DUSt3R, Mask2Former, Grounding DINO) uses Transformer backbones.
Foundation-model framing — pretrain a generic vision/multimodal backbone once, adapt to many downstream tasks. CLIP, DINOv2, MAE, SAM are all foundation models in this sense.
Language as the universal interface — text is the most common bridge across modalities, control mechanisms, and evaluation protocols. Even pure-vision tasks now route through language for prompting, evaluation, and zero-shot use.

Modern CV — Overview ​

The ten topics ​

Reading paths ​

What unifies these areas ​

What to read next ​

Modern CV — Overview

The ten topics

Reading paths

What unifies these areas

What to read next