The evaluator-optimizer pattern creates a tight feedback loop between two roles: a generator that produces output and an evaluator that assesses and critiques it. The generator then revises its output based on the evaluation, and the cycle repeats until quality criteria are met.
This pattern is closely related to the Reflection pattern (Chapter 3), but is specifically structured as a two-component system with clearly separated roles, making it particularly suitable for tasks with well-defined quality criteria.
┌───────────────┐ ┌───────────────┐
│ Generator │────────►│ Evaluator │
│ (Optimizer) │◄────────│ (Critic) │
└───────┬───────┘ └───────────────┘
│ │
│ Feedback Loop │
│◄───────────────────────┘
│
▼
[Meets criteria?]
Yes → Output
No → Loop again
# See code/evaluator_optimizer.py for the full implementation
def evaluator_optimizer(generator_llm, evaluator_llm, task,
criteria, max_rounds=5):
"""Iteratively improve output through evaluation feedback."""
# Initial generation
output = generator_llm.generate(
f"Complete this task:\n{task}\n\n"
f"Quality criteria:\n{criteria}"
)
for round in range(max_rounds):
# Evaluate
evaluation = evaluator_llm.generate(
f"Evaluate this output for the task '{task}'.\n\n"
f"Output:\n{output}\n\n"
f"Criteria:\n{criteria}\n\n"
f"For each criterion, score 1-10 and explain.\n"
f"Overall verdict: PASS (all scores >= 8) or FAIL.\n"
f"If FAIL, provide specific, actionable feedback."
)
if "PASS" in evaluation:
return output, round + 1 # Return output and rounds used
# Optimize based on feedback
output = generator_llm.generate(
f"Task: {task}\n\n"
f"Your previous output:\n{output}\n\n"
f"Evaluator feedback:\n{evaluation}\n\n"
f"Revise your output to address the feedback. "
f"Focus on the specific issues identified."
)
return output, max_rounds
For more reliable evaluation, use structured scoring rubrics:
# See code/evaluator_optimizer.py for the full implementation
class EvaluationRubric:
def __init__(self, criteria):
self.criteria = criteria # List of (name, description, weight)
def evaluate(self, evaluator_llm, task, output):
scores = {}
for name, description, weight in self.criteria:
score_text = evaluator_llm.generate(
f"Score the following output on '{name}'.\n"
f"Criterion: {description}\n"
f"Output: {output}\n\n"
f"Score (1-10):",
temperature=0
)
scores[name] = {
"score": int(score_text.strip()),
"weight": weight
}
weighted_avg = sum(
s["score"] * s["weight"] for s in scores.values()
) / sum(s["weight"] for s in scores.values())
return scores, weighted_avg
rubric = EvaluationRubric([
("accuracy", "Are all facts and claims correct?", 3),
("completeness", "Does it cover all required aspects?", 2),
("clarity", "Is it easy to understand?", 2),
("actionability", "Does it provide actionable guidance?", 1),
])
The evaluator becomes much more powerful when it has access to tools that provide objective measurements:
# See code/evaluator_optimizer.py for the full implementation
def code_evaluator(code, test_suite):
"""Evaluate code using automated tools."""
results = {}
# Run tests
test_results = run_tests(code, test_suite)
results["tests"] = {
"passed": test_results.passed,
"failed": test_results.failed,
"score": test_results.pass_rate
}
# Run linter
lint_results = run_linter(code)
results["lint"] = {
"errors": lint_results.errors,
"warnings": lint_results.warnings,
"score": max(0, 10 - lint_results.error_count)
}
# Check complexity
complexity = measure_complexity(code)
results["complexity"] = {
"cyclomatic": complexity.cyclomatic,
"score": max(0, 10 - complexity.cyclomatic // 5)
}
return results
# See code/evaluator_optimizer.py for the full implementation
async def content_evaluator(llm, content, search_tool):
"""Evaluate content accuracy using web search."""
# Extract claims
claims = llm.generate(
f"Extract all factual claims from:\n{content}"
)
# Verify each claim
verification_results = []
for claim in parse_claims(claims):
evidence = await search_tool.search(claim)
verification = llm.generate(
f"Does this evidence support the claim?\n"
f"Claim: {claim}\n"
f"Evidence: {evidence}\n"
f"Answer: SUPPORTED, CONTRADICTED, or UNVERIFIABLE"
)
verification_results.append({
"claim": claim,
"verdict": verification.strip()
})
return verification_results
Round 1: Translator produces initial translation
Evaluator: "Line 3 loses the metaphor in the original.
Line 7 is too literal — it sounds unnatural."
Round 2: Translator revises with more natural phrasing
Evaluator: "Much better. Line 3's metaphor is now captured.
Minor issue: Line 12's tone is too formal."
Round 3: Translator adjusts tone
Evaluator: "PASS — translation is faithful and natural."
Round 1: Writer produces initial report
Evaluator: "Missing data from 2025. Section 3 lacks citations.
Conclusion doesn't follow from the analysis."
Round 2: Writer revises with additional research
Evaluator: "Data is current. Citations added. Conclusion still
overgeneralizes from limited data."
Round 3: Writer tempers conclusions
Evaluator: "PASS — report is well-sourced and conclusions
are appropriately scoped."
A key challenge with evaluator-optimizer loops is knowing when to stop:
Stop when all evaluation scores exceed a threshold:
if all(score >= 8 for score in scores.values()):
return output # Quality threshold met
Stop when the improvement between rounds is negligible:
if current_score - previous_score < 0.5:
return output # Diminishing returns
Stop when the evaluator and generator agree there’s nothing more to improve:
if "no further improvements needed" in evaluation.lower():
return output # Both sides agree
This pattern is particularly effective when:
| Aspect | Evaluator-Optimizer | Reflection |
|---|---|---|
| Components | Separate generator + evaluator | Single LLM (or two with same structure) |
| Evaluation | Structured criteria/rubrics | Freeform self-critique |
| Tools | Often uses external verification | Optionally uses tools |
| Stopping | Score-based thresholds | Round-based or quality signal |
| Best for | Tasks with clear metrics | General improvement |
In practice, the evaluator-optimizer pattern is a formalized version of reflection with stronger guarantees around evaluation quality and convergence.
Navigation: