GeraWitness vs. Constitutional AI vs. OpenAI Moderation: Layers, Not Rivals
Published 21 April 2026 · 8 min read
Honest framing
This is a comparison of different layers, not competing products. Every serious agent deployment should have all three.
Constitutional AI and RLHF (model-level)
Anthropic introduced Constitutional AI in December 2022, using AI feedback paired with a written constitution to shape model behaviour during training. OpenAI pioneered RLHF at production scale with InstructGPT (2022) and GPT-4. Both techniques produce models that refuse obvious harms, are honest-by-default, and align to human preferences at a high level.
What these catch: the common failure modes present in training data. The model understands "don’t help with weapons" because the behaviour was shaped at training.
What they miss: novel failure modes not in the training distribution; situations where the model is confident and wrong; context-dependent judgements where the right answer depends on facts the model does not have.
Moderation APIs (inference-time)
OpenAI’s Moderation API, Perspective API, Azure Content Safety and others classify text at inference time. Fast, cheap, scale indefinitely.
What they catch: disallowed content categories. Explicit sexual material, targeted harassment, incitement.
What they miss: action-shaped harm that is not about content. An agent that decides to commit to a £20,000 transaction based on a misunderstood instruction produces no content the moderation API would flag.
GeraWitness (action-level, pre-commit)
GeraWitness sits between agents and irreversible real-world actions. Tiers route actions to auto-commit, sampled review, mandatory human review, or hard refusal.
What it catches: action-shaped errors that model-level and moderation-level safety miss. Agents confidently wrong about value. Novel prompt-injection patterns. Cases where the policy has shifted since the model was trained.
What it misses: content-shaped harms that never reach the action layer. Low-tier transactions that don’t gate.
Feature matrix
| Feature | Constitutional AI / RLHF | Moderation APIs | GeraWitness |
|---|---|---|---|
| Where in stack | Training | Inference | Pre-commit |
| Latency | Zero | Milliseconds | Seconds (T2 only) |
| Catches novel jailbreaks | Partial | Partial | Yes (reviewer reasoning) |
| Catches wrong-but-confident commits | No | No | Yes |
| Catches disallowed content | Partial | Yes | Incidental |
| Needs humans | Training labelers | No | Yes (reviewers) |
| Scales per-request | Yes | Yes | Not for T2 |
How they compose
- The model has Constitutional AI / RLHF baked in at training — it refuses obvious harms and aligns to human preferences by default.
- At inference, the moderation API screens content.
- When the agent proposes an action, GeraNexus tags the risk.
- For T2+ actions, GeraWitness gates on human review.
The combination catches substantially more than any single layer. The layers are not adversarial; they are complementary.
Where we agree with the labs
Model-level safety is necessary and useful. The work Anthropic, OpenAI, Google, DeepMind and Meta have done on alignment matters. GeraWitness does not try to replace any of that.
Where we think there’s an under-invested layer
Action-level safety for commercial commits. Model labs are the wrong parties to operate it because it is an operational, not a research, problem. Running a reviewer pool is an ongoing operations discipline. Specialist organisations should do it.
Related reading
GeraNexus emits the action. GeraWitness reviews the commit. GeraMind supplies the minimal consent-scoped context the reviewer sees. Research drafts at /research.
Help design agent safety that scales.
Join the waitlist