LLM Agents
An "agent" is a language model wrapped in a loop that lets it take actions — calling tools, querying APIs, executing code — and observe the results before producing the next token. The interesting research questions are how the model learns which tools exist, when to call them, and how to maintain coherent state across long task horizons.
Toolformer — self-supervised tool use
Toolformer (Schick et al., 2023) was the first clean demonstration that an LM can teach itself to use tools without per-tool human annotation. The procedure:
- Prompt the base LM with few-shot examples to insert candidate API calls into a corpus.
- Execute each candidate; keep only the calls whose result, when spliced back in, lowers the LM's loss on the surrounding tokens.
- Fine-tune on the filtered, tool-augmented data.
The loss-reduction filter is the trick — it gives a self-supervised signal for "this call was useful" without any human labels. The resulting model can use a calculator, calendar, QA system, translator, and search API zero-shot.
ToolLLM and ToolBench — scaling the tool registry
ToolLLM (Qin et al., 2024) scales tool use from a handful of APIs to 16k+ real-world REST APIs scraped from RapidAPI. Two contributions: a synthetic instruction dataset (ToolBench) generated by ChatGPT pairing user queries with multi-tool solution paths, and DFSDT (Depth-First Search-based Decision Tree), a planner that lets the agent backtrack from dead-end tool calls instead of marching forward greedily. Fine-tuning LLaMA on ToolBench closes most of the gap to GPT-4 on real API usage.
ART — Automatic Reasoning and Tool-use
ART: Automatic Multi-step Reasoning and Tool-use for Large Language Models (Paranjape et al., 2023) sits between Chain-of-Thought and Toolformer. ART uses a task library of decomposed reasoning programs: when given a new task, it retrieves similar program skeletons, lets the LM fill them in, and pauses execution at tool-call markers to let an external function (search, calc) supply the result. No fine-tuning required — the agent is constructed entirely at inference time.
A-MEM — agent memory
Long-horizon agents need to remember things across many turns of tool use. A-MEM: Agentic Memory for LLM Agents (Xu et al., 2024) proposes a Zettelkasten-style memory: each interaction yields a "note" with auto-generated tags and links to existing notes; retrieval is graph-walked, not just nearest-neighbour. The contribution over a flat vector store is structured association — the agent can find context by following relationships, not just by embedding similarity.
Reading list
- Toolformer: Language Models Can Teach Themselves to Use Tools — Schick, Dwivedi-Yu, Dessì, Raileanu, Lomeli, Zettlemoyer, Cancedda, Scialom, NeurIPS 2023.
- ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs — Qin et al., ICLR 2024.
- ART: Automatic Multi-step Reasoning and Tool-use for Large Language Models — Paranjape, Lundberg, Singh, Hajishirzi, Zettlemoyer, Tulio Ribeiro, 2023.
- A-MEM: Agentic Memory for LLM Agents — Xu, Mei, Liu, Zhang, 2024.
What to read next
- Agentic RAG — agents whose primary tool is a search engine.
- Chain of Thought — the reasoning substrate that makes tool-use planning possible.
- RLVR — RL for agents whose action correctness is verifiable.