Anthropic’s guide on building effective agents contains a striking insight: “We actually spent more time optimizing our tools than the overall prompt” when building their SWE-bench coding agent. This chapter is about why tool design matters so much and how to do it well.
The interface between an agent and its tools is the Agent-Computer Interface (ACI) — the equivalent of the Human-Computer Interface (HCI) in traditional software design. Just as billions of dollars have been invested in making software intuitive for humans, significant effort should go into making tools intuitive for LLMs.
If a human developer would need to think carefully about how to use a tool, the LLM will struggle too. Tool names, descriptions, and parameter names should be self-documenting.
Bad:
tool = {
"name": "proc",
"description": "Process data",
"parameters": {
"d": {"type": "string"},
"m": {"type": "integer"},
"f": {"type": "boolean"}
}
}
Good:
tool = {
"name": "analyze_sales_data",
"description": (
"Analyze sales data for a given time period. "
"Returns total revenue, top products, and trends. "
"Use when the user asks about sales performance "
"or revenue analysis."
),
"parameters": {
"time_period": {
"type": "string",
"description": "Time period to analyze, e.g., '2024-Q1' or 'last_30_days'"
},
"max_results": {
"type": "integer",
"description": "Maximum number of top products to return (default: 10)"
},
"include_trends": {
"type": "boolean",
"description": "Whether to include trend analysis (default: true)"
}
}
}
Each tool should do one thing well. Don’t create Swiss-army-knife tools that require the LLM to understand complex mode switches.
Bad:
# One tool with mode parameter that changes behavior entirely
tool = {
"name": "data_tool",
"parameters": {
"mode": {"enum": ["read", "write", "delete", "transform", "validate"]},
"target": {"type": "string"},
"data": {"type": "object"},
"options": {"type": "object"}
}
}
Good:
# Separate tools for distinct operations
tools = [
{"name": "read_data", "description": "Read data from a source..."},
{"name": "write_data", "description": "Write data to a destination..."},
{"name": "delete_data", "description": "Delete data from a source..."},
{"name": "transform_data", "description": "Transform data using a rule..."},
]
Anthropic discovered that their coding agent made errors when using relative file paths after navigating away from the root directory. Switching to absolute paths eliminated these errors entirely.
This principle extends beyond file paths — any reference that could be ambiguous should be made absolute and unambiguous:
# Bad: relative, ambiguous
tool_call = {"name": "read_file", "args": {"path": "config.py"}}
# Good: absolute, unambiguous
tool_call = {"name": "read_file", "args": {"path": "/home/user/project/config.py"}}
Poka-yoke (Japanese for “mistake-proofing”) means designing the tool interface so that it’s hard to use incorrectly:
# Bad: easy to pass wrong date format
tool = {
"name": "get_events",
"parameters": {
"date": {"type": "string", "description": "The date"}
}
}
# Good: constrained format with validation
tool = {
"name": "get_events",
"parameters": {
"date": {
"type": "string",
"pattern": "^\\d{4}-\\d{2}-\\d{2}$",
"description": "Date in YYYY-MM-DD format, e.g., '2025-03-24'"
}
}
}
Tool descriptions are effectively prompts. They should include:
tool = {
"name": "search_codebase",
"description": (
"Search the codebase for files containing a text pattern. "
"Returns matching file paths and the lines containing the pattern. "
"\n\n"
"USE THIS TOOL when you need to find where something is defined, "
"used, or referenced in the code. "
"\n\n"
"DO NOT use this for reading file contents — use read_file instead. "
"\n\n"
"EXAMPLES:\n"
"- To find all usages of a function: search for the function name\n"
"- To find config values: search for the config key\n"
"- For regex patterns: prefix with 'regex:'\n"
"\n\n"
"LIMITATIONS: Searches at most 1000 files. For large codebases, "
"use specific directory paths to narrow the search."
)
}
How tools return results matters as much as how they accept inputs:
Return only what the agent needs. Massive outputs consume context window tokens and make it harder for the LLM to find the relevant information.
# Bad: returns entire database record with 50 fields
def get_user(user_id):
return database.get_full_record(user_id) # Huge response
# Good: returns only commonly needed fields
def get_user(user_id, fields=None):
record = database.get_record(user_id)
if fields:
return {k: record[k] for k in fields if k in record}
return {
"name": record["name"],
"email": record["email"],
"plan": record["plan"],
"created_at": record["created_at"]
}
Structured (JSON) responses are easier for the LLM to parse and reference than free text:
# Bad: unstructured text
def search(query):
return "Found 3 results: first is about X, second about Y..."
# Good: structured response
def search(query):
return {
"total_results": 3,
"results": [
{"title": "X", "url": "...", "snippet": "..."},
{"title": "Y", "url": "...", "snippet": "..."},
{"title": "Z", "url": "...", "snippet": "..."},
]
}
When a tool fails, the error message should tell the LLM how to fix the problem:
# Bad
raise Exception("Invalid input")
# Good
return {
"error": "Invalid date format",
"expected": "YYYY-MM-DD",
"received": user_input,
"suggestion": f"Did you mean '{fix_date_format(user_input)}'?"
}
Run many example inputs to see how the model uses your tools. Common mistakes to watch for:
# See code/aci_testing.py for the full implementation
def test_tool_selection(llm, tools, test_cases):
"""Test whether the LLM selects the right tool for each scenario."""
results = []
for query, expected_tool in test_cases:
response = llm.generate(query, tools=tools)
actual_tool = response.tool_calls[0].name if response.tool_calls else None
results.append({
"query": query,
"expected": expected_tool,
"actual": actual_tool,
"correct": actual_tool == expected_tool
})
accuracy = sum(r["correct"] for r in results) / len(results)
return results, accuracy
Anthropic observed that the format in which tools accept and return data significantly affects LLM performance:
The rule of thumb: keep the format close to what the model has seen naturally in its training data.
# Bad: requires precise diff format
tool_output_format = "unified_diff" # Hard for LLM
# Good: full file replacement
tool_output_format = "full_content" # Easy for LLM
Navigation: