Gemini 1.5 (Long Context)

Google released Gemini 1.5 Pro on February 15, 2024, with a feature unlike anything other frontier models had: a 1 million token context window, with a 10M-token research demonstration. The release shifted the long-context conversation from "incremental gains" to "qualitatively different capability". Gemini 1.5 was the first frontier model where you could put an entire codebase, an entire book, or a multi-hour video in one prompt.

The release

Gemini 1.5 Pro followed Gemini 1.0 (Dec 2023) by a few months. The technical report (Google DeepMind, 2024) emphasised three points:

1M-token context in the publicly-released version, with research-grade extension to 10M.
Mixture-of-Experts architecture — explicitly disclosed, unusual for frontier models.
Native multimodal training — text, image, audio, video tokens jointly trained from scratch (not bolted on after).

Gemini 1.5 Pro launched with the same general-capability tier as Gemini 1.0 Ultra (the previous Google flagship) but at much lower compute, plus the long-context superpower.

What 1M tokens unlocks

Three demonstrations stood out in the launch:

Reading an entire codebase. Gemini 1.5 Pro could ingest the JAX repository in one prompt and answer cross-file questions accurately.
Watching hour-long video. Frames sampled to fit in context; the model answered questions about events at specific timestamps.
Translating from a low-resource language. Given a single grammar book and dictionary in context, Gemini 1.5 Pro learned to translate between English and Kalamang (a language with ~200 speakers, with very little online data) at meaningful quality. This was the headline "in-context learning at long context" demo.

Each result rests on the same property: with enough context, the model can do task-specific learning at inference time that previously required fine-tuning.

Needle-in-a-haystack

The 1M-token claim was backed by needle-in-haystack tests: insert a "fact" sentence at random positions in a long document, ask the model to retrieve it. Gemini 1.5 Pro achieved >99% recall across the full 1M context. The 10M research version achieved similar recall on the longer windows.

The caveat (already understood by then): needle-in-haystack measures retrieval, not synthesis. Gemini 1.5 Pro could find a single fact reliably; reasoning across multiple disconnected facts at the long-context end of the window was less consistent.

Architecture and training

The technical report disclosed:

MoE-based decoder-only Transformer. Many experts, sparse routing.
Native multimodal pretraining — image, audio, video tokenised and trained alongside text. No separate "vision encoder" plus "language model" assembly.
Long-context training at scale — Google trained explicitly at long context, not relying purely on post-hoc RoPE extension.

Specific architectural details (number of experts, exact parameter count, training corpus) were not disclosed. External estimates put Gemini 1.5 Pro at roughly 200-500B total parameters with ~50-100B active per token, but these are not confirmed.

Multimodality

Gemini 1.5 Pro's audio and video capabilities are stronger than its competitors at launch. Demonstrations:

Audio Q&A — transcribe and answer questions about hours of meeting audio.
Video understanding — frame-level analysis of long videos, action recognition, narrative summarisation.
Mixed-modality prompts — "given this video clip and this PDF, answer..." worked seamlessly.

This native-multimodal training paid off: Gemini's audio capabilities in particular were ahead of GPT-4-class models at launch.

Gemini 1.5 Flash

Released at Google I/O 2024 (May), Gemini 1.5 Flash is the cheap-and-fast tier — sub-second latency, much lower API cost, retaining most of the long-context capability. By late 2024 Flash had become the highest-volume Gemini variant by API traffic, particularly for embedding-and-retrieval-style applications.

What followed

Gemini 2.0 Flash (Dec 2024) — refresh with stronger general capabilities and native image generation.
Gemini 2.5 Pro (March 2025) — multi-modal flagship with substantial reasoning improvements.
Gemini 3 (late 2025) — next-generation flagship.

The 1M-token (or longer) context window has remained a Gemini differentiator — competing frontier models reached 200K-500K but no one shipped a publicly-available 1M+ model in the same timeframe.

What Gemini 1.5 established

Long context as a differentiator. Past 200K, the use cases shift from "process a document" to "process a codebase / a movie / a book". Gemini 1.5 demonstrated this market.
Native multimodal training. Frontier models increasingly start with all modalities mixed, not bolted on after.
MoE at frontier scale, publicly disclosed. Helped legitimise MoE as the default choice for frontier-LLM training in 2024.
In-context learning at long context. The grammar-book translation result was the strongest demonstration to date that ICL scales with available context.

Gemini 1.5 (Long Context) ​

The release ​

What 1M tokens unlocks ​

Needle-in-a-haystack ​

Architecture and training ​

Multimodality ​

Gemini 1.5 Flash ​

What followed ​

What Gemini 1.5 established ​

What to read next ​