Skip to content

A curated collection of research papers on LLM Tool-Integrated Reasoning (TIR), where LLMs enhance reasoning by interacting with external tools such as calculators, search engines, and code interpreters.

Notifications You must be signed in to change notification settings

basicv8vc/LLM-Tool-Integrated-Reasoning-TIR-Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 

Repository files navigation

🔧 LLM-Tool-Integrated-Reasoning-TIR-Papers

A curated collection of papers on Tool-Integrated Reasoning (TIR) — a rapidly evolving research direction where Large Language Models (LLMs) interact with external tools such as calculators, search engines, code interpreters, and web APIs to enhance reasoning, decision-making, and factual accuracy.

🧠 "The ability to use tools is what sets humans apart from other animals."
🤖 Likewise, the ability to use tools is what transforms an LLM into an agent.

TIR marks a critical milestone in the evolution of LLMs: it extends models beyond static parametric knowledge, enabling them to dynamically interact with the external world via:

  • 🖥️ Python interpreters
  • 🔍 Search engines
  • 🧮 Calculators
  • 🌐 Web APIs

Although this list focuses on Tool-Integrated Reasoning, we also include earlier or adjacent works on LLMs + Tools that may not explicitly involve reasoning, in order to provide a more complete historical and technical context.

📌 Note: This list focuses on tool-integrated reasoning with text-only LLMs and does not include multimodal models.


🔍 Filter by Category

🎯 Tool Type: Code | Search | Calculator | Multi-tool

📘 Training Method: Prompt-only | SFT | RL


📜 Paper List

Paper Date Code Tags Summary
WebGPT: Browser-assisted question-answering with human feedback 2021-12 Not officially released search browser sft rlhf WebGPT is an early tool-augmented QA agent that trains GPT-3 to use a simulated browser for information retrieval via SFT and RLHF, enabling it to answer questions with citations in a fixed “browse → answer” pipeline.
Toolformer: Language Models Can Teach Themselves to Use Tools 2023-02 Unofficial wiki-search alculator calendar qa-api mt-api sft Toolformer enables LLMs to learn when and how to use external tools by generating self-supervised training data and fine-tuning via SFT, without human annotation.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face 2023-03 Official multi-tool prompt HuggingGPT uses prompt-driven planning to let LLMs act as a central controller that delegates tasks to expert models, enabling multi-model collaboration via natural language.
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models 2023-04 Official multi-tool prompt Chameleon is a plug-and-play framework that uses a large language model as a natural language planner to dynamically compose and execute sequences of external tools for complex, multi-modal reasoning tasks without requiring additional training.
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings 2023-05 Official multi-tool embedding tool-tokenization ToolkenGPT enables LLMs to call tools like predicting tokens by introducing learnable tool-specific embeddings (“toolkens”) into the vocabulary, without modifying model parameters.
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing 2023-05 Official search python-interpreter api prompt CRITIC introduces a tool-augmented self-correction framework for LLMs that leverages external feedback (e.g., search, code interpreters, toxicity detectors) without updating model weights, revealing that external signals are crucial for reliable error correction beyond the model's own limited self-reflection.
Gorilla: Large Language Model Connected with Massive APIs 2023-05 Official api instruction-tuning retriever sft Gorilla enables LLMs to accurately and robustly use large-scale real-world APIs by introducing Retriever-Aware Training, which teaches the model to reason over and selectively utilize retrieved API documentation.
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs 2023-07 Official multi-tool rapidapi instruction-tuning retriever sft ToolLLM introduces a fully reproducible framework for tool-use instruction tuning, leveraging ChatGPT to construct ToolBench—a large-scale, diverse dataset with 16K+ real-world RESTful APIs—enabling open-source LLMs to learn single- and multi-tool calling through SFT.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving 2023-09 Official python-interpreter sft TORA is a tool-integrated reasoning agent that combines natural language reasoning with external tool use to solve mathematical problems through supervised fine-tuning and output space shaping. Output Space Shaping encourages the model to learn from multiple valid reasoning trajectories by correcting and supervising diverse generated outputs, rather than relying on a single reference.
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction 2024-01 Official rapidapi EASYTOOL is a prompt-based framework that converts diverse, redundant, and incomplete tool documentation into concise, structured instructions to significantly improve tool usage accuracy and efficiency in LLM-based agents.
AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls 2024-02 Official rapidapi multi-tool prompt AnyTool is a plug-and-play, prompt-based agent system that leverages GPT-4’s function calling to perform large-scale API calls without any model fine-tuning.
Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark 2024-05 Official api multi-tool sft Seal-Tools uses self-instructed, simulated APIs—though not real services—to effectively train and evaluate LLMs' tool-calling abilities in a scalable and controlled way.
TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning 2024-09 Official python-interpreter sft TART introduces an tool-augmented framework that integrates dynamically generated Python functions into llms for explainable, precise table-based reasoning, significantly outperforming traditional chain-of-thought methods while maintaining interpretability.
START: Self-taught Reasoner with Tools 2025-03 Not officially released python-interpreter hint-infer sft START enables Qwen to self-learn tool use by inserting natural language hints into reasoning paths to generate TIR data, then fine-tuning via SFT to internalize tool-calling abilities.
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning 2025-03 Official search rlvr R1-Searcher introduces a two-stage RLVR framework that teaches LLMs to use search tools for multi-hop QA, first encouraging tool use, then optimizing for correct answers with increasingly difficult examples.
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning 2025-03 Official search rlvr Search-R1 trains LLMs via RLVR to reason and use a search engine tool, using structured prompts and a reward function that encourages effective – sequences while masking retrieved content during optimization.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning 2025-03 Official search rlvr ReSearch applies RLVR to multi-hop QA, training LLMs to reason effectively by learning when and how to use a search engine tool in a reward-driven framework.
ToRL: Scaling Tool-Integrated RL 2025-03 Official python-interpreter rlvr ToRL trains Qwen2.5-math to generate and execute Python code for math problem solving via RLVR.
ToolRL: Reward is All Tool Learning Needs 2025-04 Official prm-style-reward rlvr ToolRL extends RL-based tool use beyond math by introducing fine-grained reward functions that directly compare LLM-generated responses with ground truth, resembling a PRM-style variant of RLVR for general tasks.
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning 2025-05 Official general-domain rlvr ToolN1 extends LLM+RL+Tool training to non-math domains by combining GRPO with a binary reward function that encourages correct response formatting and tool usage.
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning 2025-05 Not officially released python-interpreter function rlvr ARTIST is a unified framework that enables llms to perform reasoning by integrating external tool use and environment interaction through RLVR, achieving state-of-the-art performance on complex math and multi-turn function-calling tasks.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching 2025-05 Official search simulated-tool sft rlvr ZeroSearch trains LLMs to use search tools via RLVR by simulating a search engine with an SFT model, avoiding API costs while ensuring controllable document quality during training.
Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving 2025-05 Official python-interpreter rlvr ZTRL trains the Qwen2.5 base model with RLVR to autonomously generate and execute Python code for math problems.
TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers 2025-05 Not officially released prompt TUMS introduces a multi-structure handler framework that shifts tool-use learning in LLMs from coarse-grained tool-level to fine-grained parameter-level, improving tool-call accuracy through intent recognition, task decomposition, and structured parameter generation.
Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs 2025-05 Official search rlvr AutoRefine enables LLMs to perform retrieval-augmented reasoning by searching Wikipedia dumps, refining the retrieved documents, and then generating answers, improving accuracy on qa tasks.
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning 2025-05 Official multi-tool sft rlvr Tool-Star extends Tool-Integrated Reasoning to multi-tool scenarios by constructing a synthetic multi-stage training dataset and optimizing with hierarchical rewards via SFT and self-critic RL.
R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning 2025-05 Official search sft rlvr R1‑Searcher++ builds on R1‑Searcher with a two-stage training pipeline—an SFT "cold-start" for formatting and then RL—encouraging models to dynamically leverage both internal and external knowledge through a novel memorization-aware, outcome-driven reward mechanism during retrieval-augmented reasoning.
Learning to Reason without External Rewards 2025-05 Official implicit-reward Intuitor proposes an implicit reward strategy for RL without human feedback or ground truth, using LLM confidence (KL divergence from uniform) as a self-assessed signal to guide learning.
WebDancer: Towards Autonomous Information Seeking Agency 2025-05 Official search multi-tool sft rlvr deep-research WebDancer enhances LLMs' information-seeking ability via ReAct-style multi-tool reasoning, controllable TIR data generation, and a two-stage SFT + DAPO training pipeline for deep research tasks.
TAPS: Tool-Augmented Personalisation via Structured Tagging 2025-06 Official personalisation user-preference TAPS introduces a tuning-free framework that enhances personalized tool use in LLMs by leveraging structured tagging and uncertainty-based tool detection, achieving state-of-the-art results on the NLSI task while reducing hallucinations and missing arguments in API call generation.
L0: Reinforcement Learning to Become General Agents 2025-06 Official code jupyter-notebook L0 proposes a scalable RL-based TIR (Tool-Integrated Reasoning) pipeline that transforms LLMs into tool-using agents via a code-as-action scaffold and verifiable rewards, boosting reasoning and retrieval-augmented QA performance.
MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models 2025-07 Official tool-retrieval api MassTool introduces a dual-step, multi-task retrieval framework that efficiently determines whether and which tools to call for tool-augmented LLMs by enhancing query representations with graph-based and search-based user intent modeling.
A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation 2025-07 Not officially released prompt Multi-TAG is an inference-time framework for tool-integrated reasoning that boosts LLM math performance by aggregating multiple tool outputs per reasoning step and selecting the next step based on answer consistency and solution conciseness, without requiring any tuning.
Agentic Reinforced Policy Optimization 2025-07 Official rlvr ARPO proposes an entropy-guided reinforcement learning algorithm built on GRPO to improve tool-integrated reasoning in LLMs by adaptively branching rollouts at high-uncertainty tool-call steps and aligning step-level tool-use behaviors through structured advantage attribution.
AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning 2025-07 Official search python-interpreter multi-tool rlvr AutoTIR is a reinforcement learning framework for multi-tool tool-integrated reasoning that enables language models to autonomously decide whether and which tool to invoke, balancing tool usage with instruction-following ability through a three-stage training process with verifiable rewards.
MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learning 2025-08 Official search python-interpreter prompt MetaAgent introduces a self-evolving tool-integrated reasoning framework that, without fine-tuning, leverages meta tool learning—reflection, experience accumulation, and an in-house memory of past tool calls—to continually improve reasoning and tool-use strategies, achieving performance comparable to or better than end-to-end trained agents on complex knowledge discovery benchmarks.
SSRL: Self-Search Reinforcement Learning 2025-08 Official search rlvr This paper proposes Self-Search Reinforcement Learning (SSRL), an on-policy self-search framework that reduces the high training cost of web agents by letting the LLM generate its own simulated search results while preserving strong performance and enabling smooth transfer to real search.
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward 2025-08 Not officially released search sft rlvr prm This paper introduces Atom-Searcher, a two-phase framework that first fine-tunes LLMs to generate Atomic Thoughts and then applies reinforcement learning with fine-grained process-level rewards fused with outcome rewards, effectively reducing gradient conflicts and reward sparsity to achieve state-of-the-art performance in agentic deep research.
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL 2025-08 Official search multi-tool multi-agent sft rlvr This paper proposes Chain-of-Agents, a two-stage framework that distills multi-agent reasoning trajectories into a single model and further optimizes it with RLVR, yielding an Agent Foundation Model that performs multi-tool TIR more effectively than typical LLMs while remaining more efficient than full multi-agent systems.

Made with ❤️ by the open-source research community.

About

A curated collection of research papers on LLM Tool-Integrated Reasoning (TIR), where LLMs enhance reasoning by interacting with external tools such as calculators, search engines, and code interpreters.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published