Parallelization is the pattern of running multiple LLM calls simultaneously and combining their outputs. Unlike prompt chaining (sequential) or routing (one path), parallelization harnesses concurrent execution to reduce latency, improve quality, or increase confidence.
Anthropic identifies two key forms of parallelization:
Breaking a task into independent subtasks that can run in parallel:
┌──────────┐
│ Input │
└──┬──┬──┬─┘
│ │ │
┌────────┘ │ └────────┐
│ │ │
┌─────▼────┐ ┌───▼─────┐ ┌───▼─────┐
│ Section A│ │Section B│ │Section C│
│ (LLM) │ │ (LLM) │ │ (LLM) │
└─────┬────┘ └───┬─────┘ └───┬─────┘
│ │ │
└────────┐ │ ┌────────┘
│ │ │
┌──▼──▼──▼─┐
│ Aggregate│
└──────────┘
Running the same task multiple times to get diverse outputs and selecting the best:
┌──────────┐
│ Input │
└──┬──┬──┬─┘
│ │ │
┌────────┘ │ └────────┐
│ │ │
┌─────▼────┐ ┌───▼─────┐ ┌───▼─────┐
│ Run 1 │ │ Run 2 │ │ Run 3 │
│ (LLM) │ │ (LLM) │ │ (LLM) │
└─────┬────┘ └───┬─────┘ └───┬─────┘
│ │ │
└────────┐ │ ┌────────┘
│ │ │
┌──▼──▼──▼─┐
│ Vote │
└──────────┘
Sectioning works when the task can be split into independent subtasks that don’t depend on each other.
# See code/parallelization.py for the full implementation
import asyncio
async def parallel_sections(llm, task, sections):
"""Execute independent sections in parallel."""
async def run_section(section_prompt):
return await llm.agenerate(section_prompt)
tasks = [run_section(s) for s in sections]
results = await asyncio.gather(*tasks)
# Aggregate results
combined = "\n\n".join(results)
final = await llm.agenerate(
f"Combine these sections into a coherent whole:\n{combined}"
)
return final
# Example: Analyze a company from multiple angles simultaneously
sections = [
"Analyze the financial performance of Company X based on...",
"Analyze the competitive landscape for Company X...",
"Analyze the management team and corporate culture of Company X...",
"Analyze the technology and product roadmap of Company X...",
]
result = asyncio.run(parallel_sections(llm, task, sections))
A particularly important application of sectioning: running guardrails in parallel with the main task. While one LLM call generates the response, another screens for safety issues:
# See code/parallelization.py for the full implementation
async def guarded_response(llm, user_query):
"""Generate response and check safety in parallel."""
async def generate():
return await llm.agenerate(
f"Answer this query helpfully:\n{user_query}"
)
async def check_safety():
return await llm.agenerate(
f"Is this query requesting anything harmful, illegal, "
f"or inappropriate? Query: {user_query}\n"
f"Answer YES or NO with brief explanation.",
temperature=0
)
response, safety = await asyncio.gather(generate(), check_safety())
if "YES" in safety.upper():
return "I'm sorry, I can't help with that request."
return response
This is faster than sequential checking (check first, then respond) and more reliable than asking a single LLM to both guard and respond.
Voting runs the same task multiple times (often with different temperatures, prompts, or models) and selects the best answer through consensus.
# See code/parallelization.py for the full implementation
async def voting(llm, task, num_votes=5, temperature=0.7):
"""Run task multiple times and take majority vote."""
tasks = [
llm.agenerate(task, temperature=temperature)
for _ in range(num_votes)
]
responses = await asyncio.gather(*tasks)
# For classification tasks: majority vote
from collections import Counter
votes = Counter(r.strip() for r in responses)
winner, count = votes.most_common(1)[0]
confidence = count / num_votes
return winner, confidence
A variant where you generate N responses and use a separate evaluation step to pick the best one:
# See code/parallelization.py for the full implementation
async def best_of_n(llm, task, n=3):
"""Generate N responses and pick the best."""
# Generate N candidates
tasks = [llm.agenerate(task, temperature=0.8) for _ in range(n)]
candidates = await asyncio.gather(*tasks)
# Evaluate and pick the best
evaluation_prompt = (
f"Task: {task}\n\n"
f"Here are {n} candidate responses:\n\n"
)
for i, c in enumerate(candidates):
evaluation_prompt += f"--- Candidate {i+1} ---\n{c}\n\n"
evaluation_prompt += (
f"Which candidate best accomplishes the task? "
f"Consider accuracy, completeness, and clarity. "
f"Respond with just the number."
)
best_idx = int(await llm.agenerate(evaluation_prompt, temperature=0)) - 1
return candidates[best_idx]
Parallelization is particularly useful for evaluating LLM outputs across multiple dimensions simultaneously:
# See code/parallelization.py for the full implementation
async def parallel_evaluation(llm, content, criteria):
"""Evaluate content across multiple criteria in parallel."""
async def evaluate_criterion(criterion):
return await llm.agenerate(
f"Evaluate this content on '{criterion}' "
f"(score 1-10 with explanation):\n\n{content}"
)
results = await asyncio.gather(
*[evaluate_criterion(c) for c in criteria]
)
return dict(zip(criteria, results))
criteria = ["accuracy", "clarity", "completeness", "tone"]
scores = asyncio.run(parallel_evaluation(llm, article, criteria))
asyncio is the natural fit for parallel LLM callsNavigation: