← Back to Blog
Thesis

Why Agent Safety Needs Humans in the Loop

Published 21 April 2026 · 8 min read

Coming soon — join the waitlist

Quick answer. Frontier labs have done excellent work on model-level safety (RLHF, Constitutional AI, moderation APIs). None of this catches everything. For high-stakes commercial actions — large payments, medical bookings, legal commitments — the marginal cost of a trained human reviewer is tiny compared to the cost of a bad commit. The right pattern is not full autonomy; it is calibrated oversight.

The mental model

Every airline has an autopilot. Every airline also has a pilot. Nobody thinks the pilot is redundant. The autopilot handles the routine; the pilot handles the exceptions. The industry settled on this shape over decades and saved many lives by not being dogmatic about full automation.

Agents are in the same shape. Most actions will be automatic. Some will need review. The question is not whether to have reviewers; it is how to calibrate where the review boundary sits.

What model-level safety catches

RLHF catches the common failure modes taught in the training data. Constitutional AI catches classes of behaviour flagged in the constitution. Moderation APIs catch disallowed content categories. All of these matter. All of these have shipped by the major labs.

What model-level safety doesn’t catch

  • Novel jailbreaks not in the training distribution.
  • Socially-engineered prompt injection through long context chains.
  • Agent-to-agent attacks where the caller embeds hostile instructions.
  • Context the model does not have ("this user is on hospice care").
  • Regulatory shifts that haven’t made it into the model yet.
  • Cases where the model is confident and wrong.

Why pre-commit review, not post-hoc audit

Post-hoc audit catches patterns; it doesn’t prevent individual harms. For irreversible actions — a payment, a legal commitment, a booked surgery — the cost of a single bad commit can exceed the cost of a year of reviews. The economics favour catching the one that matters at commit time.

The cost math

A trained reviewer costs some amount per hour and handles some number of reviews. A T2 review takes ~60 seconds. On a realistic cost base, the per-review marginal cost is small compared to the average T2 transaction value. For T2 transactions where the expected loss from a bad commit exceeds the review cost, review is the rational default. This is the whole design premise of GeraWitness.

Why automation alone is worse than the combination

Automation-only systems have a common failure mode: the system gets better at the 95th-percentile happy path and worse at the rare failure modes, because the feedback loop is about the happy path. A human reviewer sees the rare failure modes and surfaces them. The combined system is better than either alone.

Known objections

"Humans introduce latency." True. T0 and T1 tiers do not gate on human review — only T2 does. For T2, a 30-90 second gate is usually acceptable. For emergency use cases it is not, and those categories either fall into T0 with guardrails or T3 hard-refusal.

"Reviewers are biased / tired / cheap labour." Only if you design the reviewer pool badly. Our model is full- time, well-paid, trained, rotated, and supervised. Cost-cutting here defeats the point.

"This doesn’t scale." T2 volume is a single-digit percentage of transactions by design. The reviewer pool scales with the T2 band, not with total transaction volume.

Who should be building this

Frontier labs have the model-level safety problem. Agent-commerce platforms have the action-level safety problem. Those are different problems. GeraWitness is the agent-commerce safety layer; it assumes model-level safety is someone else’s problem (and it is being solved well).

How this fits the Gera stack

GeraNexus emits the risk tag. GeraWitness enforces the review. The combined stack gives agent developers a clean handoff: build whatever agent you like; commerce flows through the risk-aware layer that has real humans in it.

Help design agent safety that scales.

Join the waitlist