Level 2: Evidence-Driven Research

The research example upgrades the loop from "produce an answer" to "produce an evidence-backed answer." It carries questions, claims, sources, and gaps forward in state until coverage is strong enough to stop.

Pattern

  • Research questions first: the topic is decomposed into explicit questions.
  • Parallel by design: the tracker can support fan-out across multiple questions even though the fake-model demo walks one open question per iteration.
  • Claim-evidence matrix: each claim stores source IDs and an evidence quality score.
  • Gap-driven iteration: unanswered questions and claims without sources keep the loop open.
  • Citation verification: the deterministic coverage evaluator fails when evidence gaps remain.

Flow

Rendering diagram...

Example configuration

The real run setup keeps the budgets and topic explicit:

run = LoopRun(
    example_id="level_2_research",
    goal="Produce an evidence-backed technical report on: <topic>",
    budgets=BudgetConfig(
        max_iterations=max_iterations or 4,
        max_model_calls=20,
        stagnation_threshold=4,
    ),
)

The tracker is the heart of the example:

class ResearchTracker:
    def __init__(self) -> None:
        self.questions: list[ResearchQuestion] = []
        self.claims: list[ResearchClaim] = []
        self.searched_queries: set[str] = set()

    def get_open_gaps(self) -> list[str]:
        return [claim.text for claim in self.claims if not claim.source_doc_ids]

And the stop signal comes from deterministic coverage:

passed = all_answered and has_claims and not gaps

return EvaluationResult(
    evaluator_name="coverage",
    status=EvaluatorStatus.PASS if passed else EvaluatorStatus.FAIL,
    score=score,
    summary="questions_answered=<n>/<total> claims=<claims> open_gaps=<gaps>",
)

Why it matters

This example makes a common agent failure mode visible: an answer can sound finished while the evidence model is still incomplete. By storing the live tracker in state and requiring source-backed claims, the loop has a concrete reason to continue or stop.

Continue to Level 3 for sandboxed coding work and resumable execution.