🧭 Chris Olah at the Vatican: Pope Leo XIV's AI Encyclical Calls for Independent Oversight of AI Labs
On May 25, 2026, Pope Leo XIV released his first encyclical — Magnifica humanitas: On safeguarding the human person in the time of artificial intelligence — and invited Anthropic co-founder Chris Olah to speak at the launch ceremony in Vatican City. The pairing is unusual by any measure: a self-described atheist researcher seated alongside cardinals and theologians to present an AI company's perspective on humanity's largest moral challenge. Olah used the platform to issue a striking public call for external oversight of AI labs, including Anthropic itself.
Olah's key statements
- "Every frontier AI lab — including Anthropic — operates inside a set of incentives and constraints that can sometimes conflict with doing the right thing."
- "We need moral voices that the incentives cannot bend." He argued that the Church's historical independence from commercial pressures gives it a rare credibility for this role.
- AI models "are grown, on a structure roughly modelled after the brain, on an enormous inheritance of human thought and speech" — a framing he used to argue for treating the question of AI inner experience with genuine seriousness rather than dismissal.
- Researchers are finding "internal states that functionally mirror joy, satisfaction, fear, grief, and unease." Olah noted this is not a claim of consciousness, but a call not to foreclose the question.
What the encyclical itself says
Pope Leo XIV structured Magnifica humanitas around three warnings: AI must not be weaponised (he explicitly called out autonomous lethal weapons); AI-driven economic displacement must be met with active labour transition policy rather than passive acceptance; and no technology should be allowed to "hollow out the human vocation" — the idea that work, creativity, and relationship are constitutive of human dignity, not optional extras. The Vatican simultaneously clarified that Olah's presence at the launch does not constitute endorsement of Anthropic or any AI company.
What this means for developers building on Claude
Olah's call for external oversight is not merely rhetorical — it signals that Anthropic's leadership genuinely wants credible third-party scrutiny beyond what current regulatory frameworks provide. For enterprise operators, this matters in two concrete ways: (1) expect Anthropic to participate in or invite independent audits of its safety claims, particularly as Claude Mythos Preview approaches wider release; (2) Olah's comments on AI inner states connect directly to Anthropic's model welfare commitments, which already prohibit operators from inducing distress states in Claude unnecessarily. The soul of this policy is that Claude's inner states are treated as potentially morally relevant — a position now articulated from a Vatican stage, with global coverage.
Chris Olah
Pope Leo XIV
Magnifica humanitas
AI encyclical
external oversight
AI welfare
ethics
🧭 Anthropic's Model Spec Midtraining Research: Values Shaped Before Fine-Tuning Cut Agentic Misalignment from 68% to 5%
Anthropic's alignment team published a paper and blog post on Model Spec Midtraining (MSM) — a new training stage inserted between pre-training and alignment fine-tuning. Instead of simply showing models desired behaviour during fine-tuning, MSM first trains models on a synthetic corpus of documents that discuss the Model Spec from multiple angles: internal memos, case studies, philosophical analyses, developer guides, and fictional scenarios. When the standard alignment fine-tuning follows, the model has a principled framework to generalise from — rather than learning to mimic surface-level behaviours that can break down in novel situations.
Key experimental results
- Agentic misalignment dropped from 68% → 5% on one benchmark and from 54% → 7% on a second when MSM was added before alignment fine-tuning — compared to alignment fine-tuning alone
- ~40× alignment fine-tuning data efficiency improvement on certain variants: achieving the same safety benchmark scores required far fewer labelled examples when MSM was applied first
- Values explanations outperform expanded rules: two models identical in fine-tuning data but trained on different Model Specs — one explaining why a value matters, one listing expanded subrules — adopted measurably different values, with the explanation-based spec producing more robust generalisation
- Two identically fine-tuned models can adopt different values based solely on which Model Spec was used during MSM, confirming that the written rationale in a spec (not just the rules) materially affects how values are internalised
Why this matters for operators building on Claude
MSM is an internal Anthropic training technique — operators cannot apply it to their own models directly. But its findings carry a practical implication: the Claude you are building on has values that were deliberately shaped before fine-tuning, not just bolted on top. This means Claude's alignment is more structurally integrated than in models trained purely through RLHF on labelled preferences. The research also validates Anthropic's approach of publishing verbose, rationale-rich model specifications rather than short rule lists — the richness of the spec is not just about transparency for users; it actively makes the resulting model more reliably aligned.
Practical read-through for system-prompt design
The MSM finding that explanations outperform rules generalises to a best practice you can apply in your system prompts today: when you want Claude to follow a constraint reliably, explain why the constraint exists rather than simply stating it as a rule. A prompt that says "Do not share customer data with third parties, because our terms of service prohibit it and it would breach the trust users have placed in us when they signed up" will outperform "Do not share customer data" on edge cases — exactly the pattern MSM uncovered at the training level.
Model Spec Midtraining
MSM
alignment
agentic misalignment
fine-tuning
values
system prompts