A human-in-the-loop AI system is only as good as the moment when the human looks at the agent's output and decides whether to trust it. The agent is not the system. The review experience is the system.

The reviewer is the system

We have built enough approval workflows to know what fails. The pattern is consistent: engineering teams spend 90% of their effort on the agent, and 10% on the reviewer's experience. After launch, every problem that surfaces is a problem with the reviewer's experience.

Reviewers don't say “the agent is wrong.” They say “I don't know how to verify this.” The two failure modes from the reviewer's side are: rubber-stamping (approving without reading because verification is too slow) and rejection by default (sending everything back because the explanation isn't there). Both destroy the system's value.

What to show, what to hide

The reviewer needs to make a fast, defensible judgment. They need three things, in this order:

  1. The decision. One sentence. Concrete. Actionable.
  2. The evidence. The specific source paragraphs that produced the decision, link-able.
  3. The counterfactual. Why not a different decision. What the agent considered and rejected.

They do not need the chain-of-thought. They do not need the full prompt. They do not need the raw retrieval list. Those are available on demand, in a side panel, for the reviewer who wants to dig. The default view is decision, evidence, counterfactual.

Explanations that survive cross-examination

The agent's explanation is not “here is what I thought.” That is a story. It is unverifiable and, worse, the model is good at generating plausible-sounding stories that have no connection to what it actually retrieved.

A defensible explanation is structural. It refers to the specific evidence chunks the agent retrieved, the rules or obligations those chunks invoke, and the decision logic the system applied. It is link-able, it is reproducible, and it survives the question “show me where you got that from.”

{
  "decision": "FLAG_FOR_REVIEW",
  "rationale": [
    {
      "claim": "Transaction exceeds the customer's typical transfer pattern",
      "evidence": [
        { "source": "customer_profile.v122", "ref": "transfer_pattern.p99" },
        { "source": "txn_history.2025-Q3",   "ref": "agg_window.30d" }
      ]
    },
    {
      "claim": "Destination institution is on the elevated-risk list",
      "evidence": [
        { "source": "risk_list.2026-05-14", "ref": "entry_4471" }
      ]
    }
  ],
  "counterfactual": {
    "considered": "APPROVE_AUTOMATICALLY",
    "rejected_because": "elevated-risk match requires human review per policy P-12.3"
  }
}

Closing the loop into evaluation

Every reviewer action, approve, reject, modify, is signal. It is the most valuable signal the system produces, because it is grounded in expert judgment on a real case. And yet most systems we audit do not capture it as evaluation data.

We treat the review log as a continuous, growing eval dataset. Every disagreement between the agent and the reviewer becomes a candidate for the next training set, the next prompt iteration, the next refusal pattern. Every agreement validates that the system is operating in distribution.

Approval is a label. Treat it that way.

The ten-hour ergonomic test

The reviewer is not a researcher evaluating one decision a day. The reviewer is an operator processing decisions for eight to ten hours. The UI has to survive that. It has to be keyboard-first, with shortcuts for approve, reject, escalate, and annotate. It has to support bulk actions for high-confidence batches. It has to maintain context across sessions.

We watch reviewers use the system for a full shift before we ship. Density, latency, shortcut coverage, error rates, and fatigue patterns all show up in ten hours of real use. They do not show up in the demo.


The agent does the work. The reviewer is what makes the work trustworthy. Engineer the reviewer experience like it is the most important surface in your system, because for everyone outside engineering, it is.