Honest framing

This is a comparison of different layers, not competing products. Every serious agent deployment should have all three.

Constitutional AI and RLHF (model-level)

Anthropic introduced Constitutional AI in December 2022, using AI feedback paired with a written constitution to shape model behaviour during training. OpenAI pioneered RLHF at production scale with InstructGPT (2022) and GPT-4. Both techniques produce models that refuse obvious harms, are honest-by-default, and align to human preferences at a high level.

What these catch: the common failure modes present in training data. The model understands "don’t help with weapons" because the behaviour was shaped at training.

What they miss: novel failure modes not in the training distribution; situations where the model is confident and wrong; context-dependent judgements where the right answer depends on facts the model does not have.

Moderation APIs (inference-time)

OpenAI’s Moderation API, Perspective API, Azure Content Safety and others classify text at inference time. Fast, cheap, scale indefinitely.

What they catch: disallowed content categories. Explicit sexual material, targeted harassment, incitement.

What they miss: action-shaped harm that is not about content. An agent that decides to commit to a £20,000 transaction based on a misunderstood instruction produces no content the moderation API would flag.

GeraWitness (action-level, pre-commit)

GeraWitness sits between agents and irreversible real-world actions. Tiers route actions to auto-commit, sampled review, mandatory human review, or hard refusal.

What it catches: action-shaped errors that model-level and moderation-level safety miss. Agents confidently wrong about value. Novel prompt-injection patterns. Cases where the policy has shifted since the model was trained.

What it misses: content-shaped harms that never reach the action layer. Low-tier transactions that don’t gate.

Feature matrix

Feature	Constitutional AI / RLHF	Moderation APIs	GeraWitness
Where in stack	Training	Inference	Pre-commit
Latency	Zero	Milliseconds	Seconds (T2 only)
Catches novel jailbreaks	Partial	Partial	Yes (reviewer reasoning)
Catches wrong-but-confident commits	No	No	Yes
Catches disallowed content	Partial	Yes	Incidental
Needs humans	Training labelers	No	Yes (reviewers)
Scales per-request	Yes	Yes	Not for T2

How they compose

The model has Constitutional AI / RLHF baked in at training — it refuses obvious harms and aligns to human preferences by default.
At inference, the moderation API screens content.
When the agent proposes an action, GeraNexus tags the risk.
For T2+ actions, GeraWitness gates on human review.

The combination catches substantially more than any single layer. The layers are not adversarial; they are complementary.

Where we agree with the labs

Model-level safety is necessary and useful. The work Anthropic, OpenAI, Google, DeepMind and Meta have done on alignment matters. GeraWitness does not try to replace any of that.

Where we think there’s an under-invested layer

Action-level safety for commercial commits. Model labs are the wrong parties to operate it because it is an operational, not a research, problem. Running a reviewer pool is an ongoing operations discipline. Specialist organisations should do it.

GeraWitness vs. Constitutional AI vs. OpenAI Moderation: Layers, Not Rivals