← Back to Blog
Positioning

GeraWitness vs. Constitutional AI vs. OpenAI Moderation: Layers, Not Rivals

Published 21 April 2026 · 8 min read

Coming soon — join the waitlist

Quick answer. Constitutional AI (Anthropic, 2022) and RLHF (widely used) shape model behaviour at training time. OpenAI’s Moderation API and equivalents catch disallowed content at inference. GeraWitness catches the specific class of agent-initiated action errors neither can detect — wrong-but-confident commits, novel jailbreaks, context the model does not have. These layer cleanly; none replace the others.

Honest framing

This is a comparison of different layers, not competing products. Every serious agent deployment should have all three.

Constitutional AI and RLHF (model-level)

Anthropic introduced Constitutional AI in December 2022, using AI feedback paired with a written constitution to shape model behaviour during training. OpenAI pioneered RLHF at production scale with InstructGPT (2022) and GPT-4. Both techniques produce models that refuse obvious harms, are honest-by-default, and align to human preferences at a high level.

What these catch: the common failure modes present in training data. The model understands "don’t help with weapons" because the behaviour was shaped at training.

What they miss: novel failure modes not in the training distribution; situations where the model is confident and wrong; context-dependent judgements where the right answer depends on facts the model does not have.

Moderation APIs (inference-time)

OpenAI’s Moderation API, Perspective API, Azure Content Safety and others classify text at inference time. Fast, cheap, scale indefinitely.

What they catch: disallowed content categories. Explicit sexual material, targeted harassment, incitement.

What they miss: action-shaped harm that is not about content. An agent that decides to commit to a £20,000 transaction based on a misunderstood instruction produces no content the moderation API would flag.

GeraWitness (action-level, pre-commit)

GeraWitness sits between agents and irreversible real-world actions. Tiers route actions to auto-commit, sampled review, mandatory human review, or hard refusal.

What it catches: action-shaped errors that model-level and moderation-level safety miss. Agents confidently wrong about value. Novel prompt-injection patterns. Cases where the policy has shifted since the model was trained.

What it misses: content-shaped harms that never reach the action layer. Low-tier transactions that don’t gate.

Feature matrix

FeatureConstitutional AI / RLHFModeration APIsGeraWitness
Where in stackTrainingInferencePre-commit
LatencyZeroMillisecondsSeconds (T2 only)
Catches novel jailbreaksPartialPartialYes (reviewer reasoning)
Catches wrong-but-confident commitsNoNoYes
Catches disallowed contentPartialYesIncidental
Needs humansTraining labelersNoYes (reviewers)
Scales per-requestYesYesNot for T2

How they compose

  1. The model has Constitutional AI / RLHF baked in at training — it refuses obvious harms and aligns to human preferences by default.
  2. At inference, the moderation API screens content.
  3. When the agent proposes an action, GeraNexus tags the risk.
  4. For T2+ actions, GeraWitness gates on human review.

The combination catches substantially more than any single layer. The layers are not adversarial; they are complementary.

Where we agree with the labs

Model-level safety is necessary and useful. The work Anthropic, OpenAI, Google, DeepMind and Meta have done on alignment matters. GeraWitness does not try to replace any of that.

Where we think there’s an under-invested layer

Action-level safety for commercial commits. Model labs are the wrong parties to operate it because it is an operational, not a research, problem. Running a reviewer pool is an ongoing operations discipline. Specialist organisations should do it.

Related reading

GeraNexus emits the action. GeraWitness reviews the commit. GeraMind supplies the minimal consent-scoped context the reviewer sees. Research drafts at /research.

Help design agent safety that scales.

Join the waitlist