Chapter 3 – Reflection Pattern

The LLM as Its Own Critic

Reflection is perhaps the most accessible and immediately impactful agentic design pattern. The core idea is simple: ask the LLM to examine its own output, identify weaknesses, and improve it.

This mirrors how humans work. A skilled writer doesn’t submit their first draft. They write, re-read, critique, and revise. Reflection gives LLMs this same capability, and the results can be dramatic — turning mediocre first-draft outputs into polished, high-quality results.

How Reflection Works

The simplest form of reflection is a two-step process:

Generate: The LLM produces an initial output
Reflect: The LLM (or a second LLM call) critiques the output and suggests improvements
Revise: The LLM rewrites the output incorporating the feedback
Repeat steps 2–3 until the output is satisfactory

    ┌──────────┐     ┌──────────────┐     ┌──────────┐
    │ Generate │────►│   Reflect    │────►│  Revise  │
    │ (Draft)  │     │  (Critique)  │     │(Improved)│
    └──────────┘     └──────────────┘     └─────┬────┘
                           ▲                     │
                           │                     │
                           └─────────────────────┘
                              (repeat N times)

Self-Reflection: The Basic Pattern

In self-reflection, a single LLM is used for both generation and critique. The key is crafting the critique prompt to be specific and constructive.

# See code/reflection.py for the full implementation

def self_reflect(llm, task, max_rounds=3):
    """Generate output and iteratively improve it through self-reflection."""
    
    # Step 1: Initial generation
    draft = llm.generate(f"Perform the following task:\n{task}")
    
    for round in range(max_rounds):
        # Step 2: Self-critique
        critique = llm.generate(
            f"You produced the following output for the task '{task}':\n\n"
            f"{draft}\n\n"
            f"Critically evaluate this output. Identify:\n"
            f"1. Errors or inaccuracies\n"
            f"2. Missing information\n"
            f"3. Areas where clarity could be improved\n"
            f"4. Logical gaps or unsupported claims\n"
            f"Be specific and constructive."
        )
        
        # Step 3: Revise based on critique
        draft = llm.generate(
            f"Original task: {task}\n\n"
            f"Your previous output:\n{draft}\n\n"
            f"Critique of your output:\n{critique}\n\n"
            f"Revise your output to address the critique. "
            f"Produce an improved version."
        )
    
    return draft

Two-Agent Reflection

A more effective variant uses two distinct agents — a generator and a critic — with different system prompts optimized for their roles.

# See code/reflection.py for the full implementation

def two_agent_reflect(generator_llm, critic_llm, task, max_rounds=3):
    """Use separate generator and critic agents for reflection."""
    
    generator_prompt = (
        "You are an expert writer. Produce clear, accurate, "
        "well-structured output."
    )
    
    critic_prompt = (
        "You are a demanding but constructive editor. "
        "Your job is to find flaws, gaps, and areas for improvement. "
        "Be specific. Point to exact phrases or sections that need work."
    )
    
    draft = generator_llm.generate(task, system=generator_prompt)
    
    for round in range(max_rounds):
        critique = critic_llm.generate(
            f"Review this output for the task '{task}':\n\n{draft}",
            system=critic_prompt
        )
        
        if "no major issues" in critique.lower():
            break  # Critic is satisfied
        
        draft = generator_llm.generate(
            f"Task: {task}\nYour draft:\n{draft}\n"
            f"Editor feedback:\n{critique}\n"
            f"Revise accordingly.",
            system=generator_prompt
        )
    
    return draft

Tool-Augmented Reflection

Reflection becomes even more powerful when the critic has access to tools that provide ground truth. Instead of relying on the LLM’s judgment alone, the system can verify outputs against real-world data.

Code Reflection with Test Execution

For code generation, the most effective reflection strategy is to run the generated code against test cases:

# See code/reflection.py for the full implementation

def code_reflection(llm, task, test_cases, max_rounds=5):
    """Generate code and refine it using test execution feedback."""
    
    code = llm.generate(f"Write Python code to: {task}")
    
    for round in range(max_rounds):
        # Run the code against test cases
        results = run_tests(code, test_cases)
        
        if results.all_passed:
            return code  # All tests pass — we're done
        
        # Reflect on failures
        code = llm.generate(
            f"Your code for '{task}':\n```python\n{code}\n```\n\n"
            f"Test results:\n{results.summary}\n\n"
            f"Fix the code to pass all tests."
        )
    
    return code

Research Reflection with Web Search

For research tasks, the critic can verify claims by searching for supporting evidence:

# See code/reflection.py for the full implementation

def research_reflection(llm, topic, search_tool, max_rounds=3):
    """Generate research content and verify claims."""
    
    report = llm.generate(f"Write a research summary about: {topic}")
    
    for round in range(max_rounds):
        # Extract key claims
        claims = llm.generate(
            f"Extract the 3 most important factual claims from:\n{report}"
        )
        
        # Verify claims
        for claim in claims:
            evidence = search_tool.search(claim)
            if not evidence.supports_claim:
                report = llm.generate(
                    f"Your report:\n{report}\n\n"
                    f"This claim could not be verified: '{claim}'\n"
                    f"Search results: {evidence}\n"
                    f"Revise the report to correct or remove "
                    f"unverified claims."
                )
    
    return report

When to Use Reflection

Reflection is one of the most reliable and immediately useful patterns. Andrew Ng considers it and Tool Use the two most mature agentic patterns, noting: “Reflection is a relatively basic type of agentic workflow, but I’ve been delighted by how much it improved my applications’ results.”

Good Fits for Reflection

Code generation: Write code, run tests, fix errors — a natural reflection loop
Writing and editing: Generate text, critique it, improve it
Translation: Translate, then review for nuance and accuracy
Data analysis: Generate insights, then verify them against the data
Any task with clear quality criteria: If you can articulate what “good” looks like, reflection can optimize toward it

When Reflection May Not Help

Simple factual queries: “What is the capital of France?” doesn’t benefit from reflection
Tasks without clear evaluation criteria: If you can’t tell good from bad, the critic can’t either
Latency-critical applications: Each reflection round adds a full LLM call
Already-perfect first drafts: If zero-shot quality is sufficient, don’t add complexity

Key Research

The reflection pattern draws from several important papers:

Self-Refine (Madaan et al., 2023): Iterative refinement with self-feedback, showing that LLMs can improve their own outputs through self-critique without any additional training
Reflexion (Shinn et al., 2023): Language agents with verbal reinforcement learning, where agents reflect on failures in natural language and use those reflections to improve future attempts
CRITIC (Gou et al., 2024): LLMs that self-correct using tool-interactive critiquing, combining self-reflection with external tool verification

Practical Tips

Limit reflection rounds: 2–3 rounds usually captures most of the benefit; beyond that, you get diminishing returns
Make critique prompts specific: “Find errors” is less effective than “Check for logical consistency, factual accuracy, and completeness”
Use different temperatures: Higher temperature for generation (creativity), lower for critique (precision)
Add stopping criteria: Let the critic signal when the output is satisfactory to avoid unnecessary rounds
Log everything: Track how the output changes across rounds to understand where reflection helps most

Navigation: