An LLM without tools is like a brilliant consultant locked in a room with no phone, no computer, and no internet. They can reason and draft, but they cannot verify facts, run calculations, access current data, or take action in the world. Tool Use unlocks all of this.
Tool Use is the pattern in which an LLM is given descriptions of available functions and can request to call them during its reasoning process. The LLM generates a structured request (typically JSON), the runtime executes the function, and the result is fed back into the LLM’s context for further processing.
The lifecycle of a tool call:
┌─────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ User │───►│ LLM reasons │───►│ LLM generates│───►│ Runtime │
│ Query │ │ about task │ │ tool call │ │ executes │
└─────────┘ └──────────────┘ └──────────────┘ │ function │
└──────┬──────┘
┌─────────┐ ┌──────────────┐ │
│ Final │◄───│ LLM generates│◄───────────────────────────────┘
│ Answer │ │ using result │ (result returned)
└─────────┘ └──────────────┘
Tools are described to the LLM using structured schemas. The quality of these descriptions directly affects how well the LLM uses the tools.
# See code/tool_use.py for the full implementation
tools = [
{
"name": "get_weather",
"description": (
"Get the current weather for a specific city. "
"Returns temperature, conditions, and humidity. "
"Use when the user asks about weather or needs "
"weather data for planning."
),
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g., 'Paris' or 'New York'"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit preference"
}
},
"required": ["city"]
}
}
]
Given the user query and tool descriptions, the LLM decides whether to call a tool and which one:
{
"tool_calls": [
{
"name": "get_weather",
"arguments": {
"city": "Paris",
"units": "celsius"
}
}
]
}
The runtime calls the actual function and returns the result to the LLM:
{
"role": "tool",
"name": "get_weather",
"content": "{\"temperature\": 18, \"conditions\": \"partly cloudy\", \"humidity\": 65}"
}
The LLM incorporates the tool result into its response:
“The current weather in Paris is 18°C with partly cloudy skies and 65% humidity.”
As the number of available tools grows, a challenge emerges: the LLM’s context window fills up with tool descriptions, leaving less room for the actual task. Research from the Gorilla project (Patil et al., 2023) addresses this with tool retrieval — using the same technique as RAG but applied to tool descriptions:
# See code/tool_use.py for the full implementation
class ToolRegistry:
def __init__(self, all_tools):
self.tools = {t["name"]: t for t in all_tools}
self.index = VectorIndex()
for tool in all_tools:
self.index.add(tool["name"], tool["description"])
def get_relevant_tools(self, query, k=5):
"""Retrieve the k most relevant tools for a query."""
relevant_names = self.index.search(query, k=k)
return [self.tools[name] for name in relevant_names]
One of the key advantages of the CodeAct approach (using Python code as the action format) is that it allows natural tool composition — calling multiple tools in sequence, using the output of one as the input to another:
# Agent-generated code action (CodeAct style)
weather = get_weather("Paris")
if weather["temperature"] > 25:
activities = search_activities("Paris", "outdoor")
else:
activities = search_activities("Paris", "indoor")
calendar = get_calendar("today")
available_slots = find_free_slots(calendar)
recommendation = f"Given the {weather['conditions']} weather at {weather['temperature']}°C, "
recommendation += f"I suggest: {activities[0]['name']} at {available_slots[0]}"
With JSON tool calls, this same logic would require multiple round trips to the LLM — one for each tool call and decision point.
Tools fail. APIs go down, rate limits kick in, inputs are invalid. Robust tool-using agents need strategies for handling failures:
# See code/tool_use.py for the full implementation
def execute_tool_safely(tool_call, tools):
tool_fn = tools.get(tool_call.name)
if tool_fn is None:
return {
"error": f"Unknown tool: {tool_call.name}",
"available_tools": list(tools.keys())
}
try:
result = tool_fn(**tool_call.arguments)
return {"success": True, "result": result}
except ValidationError as e:
return {"error": f"Invalid arguments: {e}"}
except RateLimitError:
return {"error": "Rate limited. Try again in a few seconds."}
except Exception as e:
return {"error": f"Tool execution failed: {type(e).__name__}: {e}"}
Returning structured error messages back to the LLM allows it to recover — it might retry with corrected arguments, switch to an alternative tool, or inform the user about the limitation.
Tool Use is one of the two most mature and reliable agentic patterns (alongside Reflection). Use it when:
Anthropic emphasizes that investing in Agent-Computer Interface (ACI) design is just as important as prompt engineering. Their recommendations:
city_name is better than loc“While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt.” — Anthropic
Navigation: