Chapter 11 – Evaluator-Optimizer Pattern

Iterative Improvement Through Feedback Loops

The evaluator-optimizer pattern creates a tight feedback loop between two roles: a generator that produces output and an evaluator that assesses and critiques it. The generator then revises its output based on the evaluation, and the cycle repeats until quality criteria are met.

This pattern is closely related to the Reflection pattern (Chapter 3), but is specifically structured as a two-component system with clearly separated roles, making it particularly suitable for tasks with well-defined quality criteria.

The Pattern

    ┌───────────────┐         ┌───────────────┐
    │   Generator   │────────►│   Evaluator   │
    │  (Optimizer)  │◄────────│  (Critic)     │
    └───────┬───────┘         └───────────────┘
            │                        │
            │     Feedback Loop      │
            │◄───────────────────────┘
            │
            ▼
     [Meets criteria?]
        Yes → Output
        No  → Loop again

Implementation

# See code/evaluator_optimizer.py for the full implementation

def evaluator_optimizer(generator_llm, evaluator_llm, task, 
                        criteria, max_rounds=5):
    """Iteratively improve output through evaluation feedback."""
    
    # Initial generation
    output = generator_llm.generate(
        f"Complete this task:\n{task}\n\n"
        f"Quality criteria:\n{criteria}"
    )
    
    for round in range(max_rounds):
        # Evaluate
        evaluation = evaluator_llm.generate(
            f"Evaluate this output for the task '{task}'.\n\n"
            f"Output:\n{output}\n\n"
            f"Criteria:\n{criteria}\n\n"
            f"For each criterion, score 1-10 and explain.\n"
            f"Overall verdict: PASS (all scores >= 8) or FAIL.\n"
            f"If FAIL, provide specific, actionable feedback."
        )
        
        if "PASS" in evaluation:
            return output, round + 1  # Return output and rounds used
        
        # Optimize based on feedback
        output = generator_llm.generate(
            f"Task: {task}\n\n"
            f"Your previous output:\n{output}\n\n"
            f"Evaluator feedback:\n{evaluation}\n\n"
            f"Revise your output to address the feedback. "
            f"Focus on the specific issues identified."
        )
    
    return output, max_rounds

Structured Evaluation

For more reliable evaluation, use structured scoring rubrics:

# See code/evaluator_optimizer.py for the full implementation

class EvaluationRubric:
    def __init__(self, criteria):
        self.criteria = criteria  # List of (name, description, weight)
    
    def evaluate(self, evaluator_llm, task, output):
        scores = {}
        for name, description, weight in self.criteria:
            score_text = evaluator_llm.generate(
                f"Score the following output on '{name}'.\n"
                f"Criterion: {description}\n"
                f"Output: {output}\n\n"
                f"Score (1-10):",
                temperature=0
            )
            scores[name] = {
                "score": int(score_text.strip()),
                "weight": weight
            }
        
        weighted_avg = sum(
            s["score"] * s["weight"] for s in scores.values()
        ) / sum(s["weight"] for s in scores.values())
        
        return scores, weighted_avg


rubric = EvaluationRubric([
    ("accuracy", "Are all facts and claims correct?", 3),
    ("completeness", "Does it cover all required aspects?", 2),
    ("clarity", "Is it easy to understand?", 2),
    ("actionability", "Does it provide actionable guidance?", 1),
])

Tool-Augmented Evaluation

The evaluator becomes much more powerful when it has access to tools that provide objective measurements:

Code Quality Evaluation

# See code/evaluator_optimizer.py for the full implementation

def code_evaluator(code, test_suite):
    """Evaluate code using automated tools."""
    results = {}
    
    # Run tests
    test_results = run_tests(code, test_suite)
    results["tests"] = {
        "passed": test_results.passed,
        "failed": test_results.failed,
        "score": test_results.pass_rate
    }
    
    # Run linter
    lint_results = run_linter(code)
    results["lint"] = {
        "errors": lint_results.errors,
        "warnings": lint_results.warnings,
        "score": max(0, 10 - lint_results.error_count)
    }
    
    # Check complexity
    complexity = measure_complexity(code)
    results["complexity"] = {
        "cyclomatic": complexity.cyclomatic,
        "score": max(0, 10 - complexity.cyclomatic // 5)
    }
    
    return results

Content Evaluation with Fact-Checking

# See code/evaluator_optimizer.py for the full implementation

async def content_evaluator(llm, content, search_tool):
    """Evaluate content accuracy using web search."""
    
    # Extract claims
    claims = llm.generate(
        f"Extract all factual claims from:\n{content}"
    )
    
    # Verify each claim
    verification_results = []
    for claim in parse_claims(claims):
        evidence = await search_tool.search(claim)
        verification = llm.generate(
            f"Does this evidence support the claim?\n"
            f"Claim: {claim}\n"
            f"Evidence: {evidence}\n"
            f"Answer: SUPPORTED, CONTRADICTED, or UNVERIFIABLE"
        )
        verification_results.append({
            "claim": claim,
            "verdict": verification.strip()
        })
    
    return verification_results

Examples of Evaluator-Optimizer in Action

Literary Translation

Round 1: Translator produces initial translation
Evaluator: "Line 3 loses the metaphor in the original. 
            Line 7 is too literal — it sounds unnatural."
Round 2: Translator revises with more natural phrasing
Evaluator: "Much better. Line 3's metaphor is now captured. 
            Minor issue: Line 12's tone is too formal."
Round 3: Translator adjusts tone
Evaluator: "PASS — translation is faithful and natural."

Research Report Writing

Round 1: Writer produces initial report
Evaluator: "Missing data from 2025. Section 3 lacks citations. 
            Conclusion doesn't follow from the analysis."
Round 2: Writer revises with additional research
Evaluator: "Data is current. Citations added. Conclusion still 
            overgeneralizes from limited data."
Round 3: Writer tempers conclusions
Evaluator: "PASS — report is well-sourced and conclusions 
            are appropriately scoped."

Convergence and Stopping

A key challenge with evaluator-optimizer loops is knowing when to stop:

Score-Based Stopping

Stop when all evaluation scores exceed a threshold:

if all(score >= 8 for score in scores.values()):
    return output  # Quality threshold met

Improvement-Based Stopping

Stop when the improvement between rounds is negligible:

if current_score - previous_score < 0.5:
    return output  # Diminishing returns

Agreement-Based Stopping

Stop when the evaluator and generator agree there’s nothing more to improve:

if "no further improvements needed" in evaluation.lower():
    return output  # Both sides agree

When to Use Evaluator-Optimizer

This pattern is particularly effective when:

When to Avoid

Evaluator-Optimizer vs. Reflection

Aspect Evaluator-Optimizer Reflection
Components Separate generator + evaluator Single LLM (or two with same structure)
Evaluation Structured criteria/rubrics Freeform self-critique
Tools Often uses external verification Optionally uses tools
Stopping Score-based thresholds Round-based or quality signal
Best for Tasks with clear metrics General improvement

In practice, the evaluator-optimizer pattern is a formalized version of reflection with stronger guarantees around evaluation quality and convergence.

Practical Tips

  1. Define criteria upfront — Vague criteria lead to vague feedback
  2. Use structured scoring — Numeric scores are easier to track than prose evaluations
  3. Cap the iterations — Usually 3–5 rounds capture most of the benefit
  4. Track scores across rounds — If scores plateau or decrease, something is wrong
  5. Consider using different models — A cheaper model can often evaluate effectively, even if generation requires a more capable model
  6. Make feedback actionable — “This could be better” is useless feedback; “Line 5 contains an incorrect date — it should be 2025, not 2024” is actionable

Navigation: