Modern CV — Overview
The CV Advances section surveys ten research areas that defined modern computer vision in the 2020s. Each page is structured as a paper list — the canonical references, in roughly the order a reader new to the area would benefit from. Together they trace the path from CNN-based vision through transformer backbones into the foundation-model era.
The ten topics
- Semantic Segmentation — DeepLabv3+, Swin, SegFormer, Mask2Former, SAM, Grounded SAM.
- Vision-Language Models — CLIP, Flamingo, BLIP-2, LLaVA, Gemini, Molmo.
- Neural Rendering — NeRF, Plenoxels, Mip-NeRF 360, 3D Gaussian Splatting.
- Image and Video Generation — AttnGAN, DALL·E, Latent Diffusion, DreamBooth, Sora, Wan.
- Geometric Computer Vision — PoseNet, MeshLoc, DUSt3R, Depth Anything, VGGT.
- Representation Learning — SimCLR, MoCo, MAE, JEPA, DINOv2.
- Correspondence & SfM — COLMAP, SuperGlue, RAFT, LoFTR, LightGlue, MegaSaM.
- Safety, Robustness, Evaluation — Object Recognition for Everyone, OccamNets, GeoNet, T2I bias.
- Embodied CV & Robotics — ViNG, ViKiNG, GNM, NoMaD, Navigation World Models.
- Open-Vocabulary Detection — OVR-CNN, MDETR, ViLD, CORA, Grounding DINO.
Reading paths
Three suggested paths through the section, depending on background:
- Generative AI focus — start with Representation Learning, Vision-Language Models, Image and Video Generation, Neural Rendering.
- 3D / robotics focus — start with Geometric Computer Vision, Correspondence & SfM, Neural Rendering, Embodied CV.
- Detection / segmentation focus — start with Semantic Segmentation, Open-Vocabulary Detection, Vision-Language Models.
What unifies these areas
Three common threads across all ten:
- Transformer architectures are now standard. Each area's modern frontier (SAM 2, BLIP-2, Sora, DUSt3R, Mask2Former, Grounding DINO) uses Transformer backbones.
- Foundation-model framing — pretrain a generic vision/multimodal backbone once, adapt to many downstream tasks. CLIP, DINOv2, MAE, SAM are all foundation models in this sense.
- Language as the universal interface — text is the most common bridge across modalities, control mechanisms, and evaluation protocols. Even pure-vision tasks now route through language for prompting, evaluation, and zero-shot use.
What to read next
- CV Foundations — for the classical material these methods build on.
- Deep Vision Architectures — the CNN/ViT backbone era.
- LLM track — the language-side of the modern multimodal frontier.