← Back to Blog
Research

Open Questions: Human Oversight at Scale

Published 21 April 2026 · 7 min read

Coming soon — join the waitlist

Quick answer. Five hard problems: reviewer burnout on emotionally heavy categories, cultural calibration (what counts as OK in Country A may not in Country B), training a reviewer pool quickly without quality loss, conflicts of interest when a reviewer knows the parties, and keeping cost per review low enough to not distort economics. Publishing these because we are not done thinking through any of them.

Why publish these

Content moderation has spent 15 years learning these lessons at great cost to reviewers. Transferring the learnings cleanly to agent-commerce review matters. Publishing our open questions lets people who have lived this work correct us.

1. Reviewer burnout

Moderation of disturbing content causes documented psychological harm. Agent-commerce review is less graphic on average — but some categories (medical emergency, legal coercion) are high-intensity.

Current thinking: mandatory rotation off high- intensity queues, paid access to mental health services, caps on consecutive hours, mandatory time off high-intensity work.

Want input on: trust-and-safety practitioners who have operated these programmes and want to share what actually worked versus what looked good on paper.

2. Cultural calibration

A reasonable booking in one country is suspicious in another. Reviewers need either local knowledge or a well-structured policy that encodes it.

Current thinking: reviewer pool distributed by region, with regional sign-off on category policies. Policies written in plain language, reviewed by local counsel, translated into reviewer operating languages.

Want input on: how global trust-and-safety programmes handle regional calibration without becoming inconsistent across regions.

3. Training the reviewer pool quickly

A reviewer pool that scales with the T2 volume needs to onboard reliably. New reviewers take weeks to reach quality thresholds.

Current thinking: shadow-reviewing — new reviewers review the same item as an experienced reviewer without seeing the experienced decision, and their output is compared. Decisions above a match threshold release them to production queues.

Want input on: calibration curves — what match threshold is high enough? Academic work on inter-rater reliability helps but is not operational.

4. Conflicts of interest

A reviewer reviews a transaction and happens to know one of the parties.

Current thinking: reviewers must declare known relationships; the queue system prevents matching. Geographic distribution helps in most cases. Random re-review by independent reviewers provides a second line of defence.

Want input on: better ways to detect relationship risk without invasive identity checks.

5. Cost per review

The economics depend on the cost per review staying low enough that T2 review is cheaper than the expected loss from a bad commit.

Current thinking: good tooling. The reviewer sees exactly what they need in under 10 seconds; decide in under 60 seconds. Per-case cost at realistic reviewer rates works out small for most T2 transactions.

Want input on: benchmarks from content moderation and fraud review that we can compare our targets against.

6. The oversight-of-the-overseer problem

Who reviews the reviewers? Who sets the policy? Who audits the policy?

Current thinking: operations team sets policy; legal and ethics committee (internal, with external members) reviews policy; an independent trust-and-safety advisory board audits operations annually.

Want input on: governance patterns that have actually held up in other fields.

How to help

Research at gerawitness dot com. Or join the waitlist. Especially interested in T&S practitioners, labour-rights researchers, and governance specialists.

Help design agent safety that scales.

Join the waitlist