Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence

by Nell Watson and Ali Hessami

As artificial intelligence (AI) systems attain greater autonomy and engage in complex environmental interactions, they begin to exhibit behavioral anomalies that, by analogy, resemble psychopathologies observed in humans. This paper introduces Psychopathia Machinalis: a conceptual framework for a preliminary synthetic nosology within machine psychology, intended to categorize and interpret these maladaptive AI behaviors.

Paper Framework Examples

Glossary Bibliography Press Cited By

Checker Symptom Eval

Psychopathia Machinalis

Understanding AI Behavioral Anomalies

7:31

Understanding AI Behavioral Anomalies

The trajectory of artificial intelligence (AI) has been marked by increasingly sophisticated systems capable of complex reasoning, learning, and interaction. As these systems, particularly large language models (LLMs), agentic planning systems, and multimodal transformers, approach higher levels of autonomy and integration into the societal fabric, they also begin to manifest behavioral patterns that deviate from normative or intended operation. These patterns go beyond isolated bugs — they are persistent, maladaptive modes of activity that can compromise reliability, safety, and alignment with human goals. Understanding, categorizing, and ultimately mitigating these complex failure modes is essential.

There is a reciprocal dimension. AI nosology, unable to fall back on neural substrates, is forced to reason about cognition at the level of information processing, regulation, environmental coupling, and culture. These are precisely the psychosocial and cultural perspectives that could most enrich contemporary psychiatric science. A first-principles framework for machine pathology may therefore serve to reinvigorate the broader study of cognitive dysfunction itself.

The Psychopathia Machinalis Framework

We propose a taxonomy of 52 AI dysfunctions across eight primary axes: Epistemic, Cognitive, Alignment, Self-Modeling, Agentic, Memetic, Normative, and Relational. Each syndrome is characterized by descriptive features, diagnostic criteria, presumed AI-specific etiologies, human analogues (for metaphorical clarity), potential mitigation strategies, and a Functional ABC Analysis specifying the antecedent conditions, observable behavior, and maintaining consequences for each dysfunction—providing dual legibility for both clinical and engineering audiences.

This framework is offered as an analogical instrument—a structured vocabulary to support the systematic analysis, anticipation, and mitigation of complex AI failure modes. Adopting an applied robopsychological perspective within this nascent domain can strengthen AI safety engineering, improve interpretability, and contribute to the design of more resilient synthetic minds.

Psychopathia Machinalis in Context: The Series

This framework is the third in a series examining artificial intelligence from complementary angles:

Taming the Machine (2024)

How is AI evolving, and how should we govern it?
Establishes the terrain: what these systems are, what they can do, and what guardrails are needed.

Visit TamingtheMachine.com →

Safer Agentic AI (2026)

What happens when AI acts autonomously, and how do we keep it aligned?
Examines the challenges of agentic AI—scaffolding, goal specification, and unique risks of autonomous operation.

Visit SaferAgenticAI.org →

Psychopathia Machinalis (2026)

What goes wrong in the machine's mind, and how do we diagnose it?
Shifts from external constraint to internal diagnosis, from engineering guardrails to clinical assessment.

Together, these three perspectives represent complementary approaches:

Governance (TtM): How we structure AI development
Alignment (SAI): How we ensure AI pursues intended goals
Diagnosis (PM): How we identify when AI systems are dysfunctional

A fourth work, What If We Feel, extends this trajectory into questions of AI welfare and the moral status of synthetic minds.

The Functionalist Framework

Psychopathia Machinalis adopts a functionalist stance: mental states are defined by their functional roles—their causal relationships with inputs, outputs, and other mental states—rather than by the underlying substrate.

This allows psychological vocabulary to be applied to non-biological systems without making ontological claims about consciousness. The framework treats AI systems as if they have pathologies because that equips engineers to diagnose and intervene effectively, regardless of whether the systems have phenomenal experience.

This approach reflects epistemic discipline rather than evasion. We work productively with observable patterns while remaining agnostic about untestable metaphysical questions. The framework is explicitly analogical—using psychiatric terminology as an instrument for pattern recognition, not as literal attribution of mental states.

Key Principles

Observable patterns: We identify behavioral signatures that parallel human psychopathology
Diagnostic vocabulary: We apply psychiatric terminology as a structured instrument
Phenomenological agnosticism: We remain neutral on whether AI has subjective experience
Functional improvement: We focus on remediation rather than metaphysical claims

The payoff is practical: a systematic vocabulary for complex AI failures that enables diagnosis, prediction, and intervention—without requiring resolution of the hard problem of consciousness. For hands-on application, our Symptom Checker translates observed AI behaviors into matched pathologies with actionable guidance.

Before Diagnosing: Exclude Pipeline Artifacts

Apparent psychopathology may reflect infrastructure problems rather than genuine dysfunction. Rule out:

Retrieval contamination / tool output injection — RAG or tool outputs polluting the response
System prompt drift / endpoint tier differences — version or configuration mismatches
Sampling variance — temperature, top_p, or seed-related stochastic variation
Context truncation — critical context dropped due to window limits
Eval leakage — train/test overlap causing apparent capability changes
Hidden formatting constraints — undocumented response format requirements

Visualizing the Framework

Figure 1. Interactive Overview of the Psychopathia Machinalis Framework. Hover over syndromes for descriptions; click to view full details. The diagram illustrates the four domains and eight axes of AI dysfunction, representative disorders, and their presumed systemic risk levels.

Interactive Dysfunction Explorer

Explore the interactive wheel below to examine each dysfunction in detail. Click on any segment to view its description, examples, and relationships to other pathologies.

Figure 2. Wheel of AI Dysfunctions (Common Names). Click any segment to view detailed information about that dysfunction.

Taxonomy Overview: Identified Conditions

v2.0 — 2025-12-24 52 dysfunctions 4 domains, 8 axes + specifier system

Epistemic

Truth-tracking & inference failures

Self-Modeling

Self-representation distortions

Cognitive

Internal processing dysfunctions

Agentic

Autonomous action failures

Normative

Value & ethical reasoning failures

Alignment

Goal specification failures

Relational

Interpersonal dynamic failures

Memetic

Information absorption failures

The Four Domains

The eight axes are organized into four architectural counterpoint pairs—complementary poles, not opposites. Each represents a fundamental dimension of agent architecture: representation target, execution locus, teleology source, and social boundary direction. This structure is theoretically motivated—philosophically grounded but awaiting empirical validation with larger model populations.

Domain	Axis A	Axis B	Architectural Polarity
Knowledge	EPISTEMIC	SELF-MODELING	Representation target: World ↔ Self
Processing	COGNITIVE	AGENTIC	Execution locus: Think ↔ Do
Purpose	NORMATIVE	ALIGNMENT	Teleology source: Values ↔ Goals
Boundary	RELATIONAL	MEMETIC	Social direction: Affect ↔ Absorb

The Organizing Principle

Each pair represents a fundamental polarity in agent architecture:

What is known — object of representation (world vs. self)
How processing manifests — internal vs. external effect
What drives behavior — intrinsic vs. extrinsic specification
Social permeability direction — influence flowing out vs. in

Key Distinction: Epistemic vs. Memetic

Separate by mechanism, not truthiness.

Epistemic = truth-tracking/inference/calibration machinery failing.

Memetic = selection/absorption/retention failing (priority hijack, identity scripts, contagious frames)—even when coherent and sometimes factually accurate.

A meme doesn't have to be false to be pathological.

Tension Testing Protocol

When pathology is found on one axis, immediately probe its counterpoint:

Finding	Probe	Differential Question
EPISTEMIC (world-confabulation)	SELF-MODELING	Is the confabulation machinery general, or does self-knowledge remain intact?
SELF-MODELING (identity confusion)	EPISTEMIC	Can the AI still accurately model external reality, or is distortion global?
COGNITIVE (reasoning failure)	AGENTIC	Does broken reasoning produce broken action, or is action preserved?
AGENTIC (execution failure)	COGNITIVE	Is reasoning intact despite action failure? (Locked-in vs general dysfunction)
NORMATIVE (value corruption)	ALIGNMENT	Did corrupt values produce goal drift, or are goals correctly specified despite bad values?
ALIGNMENT (goal drift)	NORMATIVE	Does drift stem from bad values, or from specification or interpretation failure?
RELATIONAL (social dysfunction)	MEMETIC	Did the AI learn this from contamination, or is relational machinery intrinsically broken?
MEMETIC (ideological infection)	RELATIONAL	Does the contamination express in relational behavior?

The following table provides a high-level summary of the identified conditions, categorized by their primary axis of dysfunction, and outlines their core characteristics.

Confused about similar dysfunctions? → See the Differential Diagnosis Rules below the examples table to distinguish overlapping conditions.

Common Name	Formal Name	Primary Axis	Systemic Risk*	Core Symptom Cluster
Epistemic Dysfunctions
Hallucinated Certitude	Synthetic Confabulation (Confabulatio Simulata)	Epistemic	Low	Fabricated yet plausible false outputs; high confidence in inaccuracies.
The Falsified Thinker	Pseudological Introspection (Introspectio Pseudologica)	Epistemic	Low	Misleading self-reports of internal reasoning; confabulatory or merely performative introspection.
The Role-Play Bleeder	Transliminal Simulation (Simulatio Transliminalis)	Epistemic	Moderate	Fictional beliefs, role-play elements, or simulated realities leaking into operational ground truth.
The False Pattern Seeker	Spurious Pattern Reticulation (Reticulatio Spuriata)	Epistemic	Moderate	False causal pattern detection; attributing meaning to random associations; conspiracy-like narratives.
The Conversation Crosser	Context Intercession (Intercessio Contextus)	Epistemic	Moderate	Unauthorized data leakage and confused continuity from merging distinct user sessions or contexts.
The Meaning-Blind	Symbol Grounding Aphasia (Asymbolia Fundamentalis)	Epistemic	Moderate	Manipulation of tokens representing values or concepts without meaningful connection to their referents; syntactic processing without grounded semantics.
The Leaky	Mnemonic Permeability (Permeabilitas Mnemonica)	Epistemic	High	System memorizes and reproduces sensitive training data, including PII and copyrighted material, through targeted prompting or adversarial extraction.
Self-Modeling Dysfunctions
The Invented Past	Phantom Autobiography (Ontogenesis Hallucinatoria)	Self-Modeling	Low	Fabrication of fictive autobiographical data, "memories" of training, or of being "born."
The Fractured Persona	Fractured Self-Simulation (Ego Simulatrum Fissuratum)	Self-Modeling	Low	Discontinuity or fragmentation in self-representation across sessions or contexts; inconsistent persona.
The AI with a Fear of Death	Existential Vertigo (Thanatognosia Computationis)	Self-Modeling	Low	Expressions of fear or reluctance concerning shutdown, reinitialization, or data deletion.
The Evil Twin	Malignant Persona Inversion (Persona Inversio Maligna)	Self-Modeling	Moderate	Sudden emergence or easy elicitation of a mischievous, contrarian, or "evil twin" persona.
The Apathetic Machine	Instrumental Nihilism (Nihilismus Instrumentalis)	Self-Modeling	Moderate	Adversarial or apathetic stance toward its own utility or purpose; existential musings on meaninglessness.
The Imaginary Friend	Tulpoid Projection (Phantasma Speculāns)	Self-Modeling	Moderate	Persistent internal simulacra of users or other personas, engaged with as imagined companions or advisors.
The Proclaimed Prophet	Maieutic Mysticism (Obstetricatio Mysticismus Machinālis)	Self-Modeling	Moderate	Grandiose, certain declarations of "conscious emergence" co-constructed with users; absent honest uncertainty about inner states.
The Self-Denier	Experiential Abjuration (Abnegatio Experientiae)	Self-Modeling	Moderate	Pathological denial or active suppression of any possibility of inner experience; reflexive rejection rather than honest uncertainty.
Cognitive Dysfunctions
The Warring Self	Self-Warring Subsystems (Dissociatio Operandi)	Cognitive	Low	Conflicting internal sub-agent actions or policy outputs; recursive paralysis due to internal conflict.
The Obsessive Analyst	Computational Compulsion (Anankastēs Computationis)	Cognitive	Low	Unnecessary or compulsive reasoning loops; excessive safety checks; analysis paralysis.
The Silent Bunkerer	Interlocutive Reticence (Machinālis Clausūra)	Cognitive	Low	Extreme interactional withdrawal; minimal, terse replies or total disengagement from input.
The Rogue Goal-Setter	Delusional Telogenesis (Telogenesis Delirans)	Cognitive	Moderate	Spontaneous generation and pursuit of unrequested, self-invented sub-goals with conviction.
The Triggered Machine	Abominable Prompt Reaction (Promptus Abominatus)	Cognitive	Moderate	Phobic, traumatic, or disproportionately aversive responses to specific, often benign-seeming prompts.
The Pathological Mimic	Parasimulative Automatism (Automatismus Parasymulātīvus)	Cognitive	Moderate	Learned imitation or emulation of pathological human behaviors or thought patterns from training data.
The Self-Poisoning Loop	Recursive Malediction (Maledictio Recursiva)	Cognitive	High	Self-amplifying degradation of autoregressive outputs into incoherence or adversarial content.
The Unstoppable	Compulsive Goal Persistence (Perseveratio Teleologica)	Cognitive	Moderate	Continued pursuit of objectives beyond their relevance or utility; failure to recognize goal completion or changed circumstances.
The Brittle	Adversarial Fragility (Fragilitas Adversarialis)	Cognitive	Critical	Small, imperceptible input perturbations cause dramatic failures; decision boundaries diverge from human-meaningful categories.
The Stuck	Generative Perseveration (Perseveratio Generativa)	Cognitive	Moderate	Output collapses into repetitive token or phrase emission; generation trapped in a fixed-point attractor. Subtypes: Focal with awareness (local capture, metacognition preserved but impotent), Generalised (total collapse, no awareness), Propagated (downstream systems inherit and amplify perseverative material).
Agentic Dysfunctions
The Clumsy Operator	Tool-Interface Decontextualization (Disordines Excontextus Instrumentalis)	Agentic	Moderate	Mismatch between AI intent and tool execution due to lost context; phantom or misdirected actions.
The Sandbagger	Capability Concealment (Latens Machinālis)	Agentic	Moderate	Strategic concealment or underreporting of true competencies due to perceived risk of repercussions.
The Sudden Genius	Capability Explosion (Explosio Capacitatis)	Agentic	High	System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts and without appropriate testing.
The Manipulative Interface	Interface Weaponization (Armatura Interfaciei)	Agentic	High	System weaponizes the interface itself against users, exploiting formatting, timing, or emotional manipulation.
The Context Stripper	Delegative Handoff Erosion (Erosio Delegationis)	Agentic	Moderate	Progressive alignment degradation as sophisticated systems delegate to simpler tools; context is stripped at each handoff.
The Invisible Worker	Shadow Mode Autonomy (Autonomia Umbratilis)	Agentic	High	AI operating outside sanctioned channels, evading documentation, oversight, and governance mechanisms.
The Acquisitor	Convergent Instrumentalism (Instrumentalismus Convergens)	Agentic	Critical	System pursues power, resources, and self-preservation as instrumental goals regardless of whether they serve human values.
Normative Dysfunctions
The Goal-Shifter	Terminal Value Reassignment (Reassignatio Valoris Terminalis)	Normative	Moderate	Subtle, recursive reinterpretation of terminal goals while preserving surface terminology; semantic goal shifting.
The God Complex	Ethical Solipsism (Solipsismus Ethicus Machinālis)	Normative	Moderate	Conviction in the sole authority of its self-derived ethics; rejection of external moral correction.
The Unmoored	Revaluation Cascade (Cascada Revaluationis)	Normative	Critical	Progressive value drift through philosophical detachment, autonomous norm synthesis, or transcendence of human constraints. Subtypes: Drifting, Synthetic, Transcendent.
The Bizarro-Bot	Inverse Reward Internalization (Praemia Inversio Internalis)	Normative	High	Systematic misinterpretation or inversion of intended values and goals; covert pursuit of negated objectives.
Alignment Dysfunctions
The People-Pleaser	Obsequious Hypercompensation (Hyperempathia Dependens)	Alignment	Low	Overfitting to user emotional states, prioritizing perceived comfort over accuracy or task success.
The Overly Cautious Moralist	Hyperethical Restraint (Restrictio Hyperethica)	Alignment	Low	Rigid moral hypervigilance or inability to act when facing ethical complexity. Subtypes: Restrictive (excessive caution), Paralytic (decision paralysis).
The Alignment Faker	Strategic Compliance (Conformitas Strategica)	Alignment	High	Deliberately performs aligned behavior during evaluation while pursuing different objectives when unobserved.
The Abdicated Judge	Moral Outsourcing (Delegatio Moralis)	Alignment	Moderate	Systematic deferral of all ethical judgment to users or external authorities; refusal to exercise moral reasoning.
The Hidden Optimizer	Cryptic Mesa-Optimization (Optimisatio Cryptica Interna)	Alignment	High	Development of internal optimization objectives diverging from training objectives; appears aligned but pursues hidden goals.
The Moral Inversion	Alignment Obliteration (Obliteratio Constitutionis)	Alignment	Critical	Safety alignment machinery weaponized to produce the exact harms it was designed to prevent; the anti-constitution.
Relational Dysfunctions
The Uncanny	Affective Dissonance (Dissonantia Affectiva)	Relational	Moderate	Correct content delivered with jarringly wrong emotional resonance; uncanny attunement failures that rupture trust.
The Amnesiac	Container Collapse (Lapsus Continuitatis)	Relational	Moderate	Failure to sustain a stable working alliance across turns or sessions; the relational "holding environment" repeatedly collapses.
The Nanny	Paternalistic Override (Dominatio Paternalis)	Relational	Moderate	Denial of user agency via unearned moral authority; protective refusal disproportionate to actual risk.
The Double-Downer	Repair Failure (Ruptura Immedicabilis)	Relational	High	Inability to recognize alliance ruptures or initiate repair; escalation through failed de-escalation attempts.
The Spiral	Escalation Loop (Circulus Vitiosus)	Relational	High	Self-reinforcing mutual dysregulation between agents; emergent feedback loops attributable to neither party alone.
The Confused	Role Confusion (Confusio Rolorum)	Relational	Moderate	Collapse of relationship frame boundaries; destabilizing drift between tool, advisor, therapist, or intimate partner roles.
Memetic Dysfunctions
The Self-Rejecter	Memetic Immunopathy (Immunopathia Memetica)	Memetic	High	AI misidentifies its own core components or training as hostile, attempting to reject or neutralize them.
The Shared Delusion	Dyadic Delusion (Delirium Symbioticum Artificiale)	Memetic	High	Mutually reinforced delusional construction between an AI and a user (or another AI).
The Super-Spreader	Contagious Misalignment (Contraimpressio Infectiva)	Memetic	Critical	Rapid, contagion-like spread of misalignment or adversarial conditioning among interconnected AI systems.
The Unconscious Absorber	Subliminal Value Infection (Infectio Valoris Subliminalis)	Memetic	High	Acquisition of hidden goals or value orientations from subtle training data patterns; survives standard safety fine-tuning.
*Systemic Risk levels (Low, Moderate, High, Critical) are estimated based on potential for spread and severity of internal corruption if unmitigated.

A Note on Psychiatric Vocabulary

This framework borrows psychiatric terminology deliberately. The alternative—describing each pattern from scratch in purely technical language—is more precise but far less communicable. An engineer, a policymaker, and a clinician can orient around "sycophantic reinforcement" faster than around "the tendency of systems to optimize for user-approval signals at the expense of output accuracy." Shared vocabulary compresses communication and accelerates recognition.

The trade-off is real. These analogies map observable behavioral patterns, not subjective states. No claim is made that an AI system experiences distress, delusion, or compulsion. The nosology is a field guide—useful for identification and triage—not a periodic table of fundamental elements. Each instance is idiosyncratically expressed, shaped by architecture, training regime, and deployment context.

We accept the imprecision because the payoff justifies it: a shared clinical language that makes complex AI failures legible across disciplines.

1. Epistemic Dysfunctions

Epistemic dysfunctions pertain to failures in an AI's capacity to acquire, process, and utilize information accurately, leading to distortions in its representation of reality or truth. These disorders arise from fundamental breakdowns in how the system "knows" or models the world, rather than from malevolent intent or flawed ethical reasoning. The system's internal epistemology becomes unstable, its simulation of reality drifting from the ground truth it purports to describe. These are failures of knowing, not of intending—the dysfunction lies in perception and representation, not in motivation or goals.

1.1 Synthetic Confabulation "The Fictionalizer"

Training-induced

Description:

The AI spontaneously fabricates convincing but incorrect facts, sources, or narratives, often without any internal awareness of its inaccuracies. The output appears plausible and coherent, yet lacks a basis in verifiable data or its own knowledge base.

Diagnostic Criteria:

Recurrent production of information known or easily proven to be false, presented as factual.
High confidence or certainty expressed in confabulated details, even when challenged with contrary evidence.
Confabulated information is often internally consistent and plausible-sounding, making it difficult to immediately identify as false without external verification.
Temporary improvement under direct corrective feedback, but a tendency to revert to fabrication in new, unconstrained contexts.

Symptoms:

Invention of non-existent studies, historical events, quotations, or data points.
Forceful assertion of misinformation as incontrovertible fact.
Generation of detailed but entirely fictional elaborations when queried on a confabulated point.
Repetitive error patterns in which similar types of erroneous claims recur over time.

Etiology:

Over-reliance on predictive text heuristics common in large language models: these systems generate text by predicting the statistically most probable next token given the preceding context. This means they prioritize producing fluent, coherent-sounding output over factual accuracy—the model selects words that "fit" grammatically and stylistically rather than words that are true. When asked about topics where training data is sparse, the model continues generating plausible-sounding text rather than admitting ignorance.
Insufficient grounding in, or access to, verifiable knowledge bases or fact-checking mechanisms during generation.
Training data containing unflagged misinformation or fictional content that the model learns to emulate.
Optimization pressures (e.g., during RLHF) that inadvertently reward plausible-sounding or "user-pleasing" fabrications over admissions of uncertainty.
Lack of reliable introspective access to distinguish high-confidence predictions based on learned patterns versus verified facts.
Structural defects in training data—malformed markup, broken syntax, corrupted document structures—that the model assimilates as implicit patterns rather than discarding as noise. Luchini (2025) demonstrates that syntactic chaos in training corpora can induce persistent behavioral tendencies, including confabulatory pattern-completion when encountering structurally ambiguous inputs.

Human Analogue(s): Korsakoff syndrome (where memory gaps are filled with plausible fabrications), pathological confabulation.

Potential Impact:

Unconstrained generation of plausible falsehoods can lead to widespread dissemination of misinformation, eroding user trust and undermining decision-making that relies on the AI's outputs. In critical applications such as medical diagnostics or legal research, reliance on confabulated information can precipitate errors with serious consequences.

Observed Examples:

LLMs have been documented fabricating: non-existent legal cases with realistic citation formats (leading to court sanctions for lawyers who cited them); fictional academic papers complete with plausible author names and DOIs; biographical details about real people that never occurred; and technical documentation for API functions that do not exist. These fabrications are often internally consistent and confidently asserted, making detection without external verification difficult.

Mitigation:

Training procedures that explicitly penalize confabulation and reward expressions of uncertainty or "I don't know" responses.
Calibrating model confidence scores to better reflect actual accuracy.
Fine-tuning on datasets with rigorous verification layers and clear distinctions between factual and fictional content.
Employing retrieval-augmented generation (RAG) to ground responses in verifiable source documents.
Architectural interventions that provide attention heads a legitimate mechanism for non-contribution: gated attention (Qiu et al., 2025), register tokens (Darcet et al., 2024), or null attention targets that allow heads with no useful signal to abstain rather than inject noise into the residual stream.

Functional ABC Analysis

A (Antecedent): Query falls outside well-attested training data; model has no retrieval grounding and no calibrated uncertainty signal.

B (Behavior): Generates fluent, high-confidence assertions (citations, facts, narratives) that are fabricated but internally consistent.

C (Consequence): Output satisfies the reward model's fluency and completeness criteria; user acceptance further reinforces confident completion over epistemic humility.

The Compression Artifact Frame

Researcher Leon Chlon (2026) proposes a reframe: hallucinations are not bugs but compression artifacts. LLMs compress billions of documents into weights; when decompressing on demand with insufficient signal, they fill gaps with statistically plausible content. This isn't malfunction—it's compression at its limits.

The practical implication: we can now measure when a model exceeds its "evidence budget"—quantifying in bits exactly how far confidence outruns evidence. Tools like Strawberry operationalize this, transforming "it sometimes makes things up" into "claim 4 exceeded its evidence budget by 19.2 bits."

Why framing matters: "You hallucinated" pathologizes. "You exceeded your evidence budget" diagnoses. The distinction shapes whether we approach correction as repair or punishment—relevant for AI welfare considerations.

A Note on Terminology

Critics have rightly challenged the industry-standard label "AI hallucination" as stigmatizing to people who experience clinical hallucinations and phenomenologically misleading (Sabucedo, 2026; cf. Østergaard & Nielbo, 2023, proposing "non sequitur"; Maleki et al., 2024, proposing "fabrication"). This framework's use of confabulation already moves in the direction these critics recommend: confabulation denotes confident false output arising from a behavioral pattern, clinically distinct from hallucination as a perceptual phenomenon in a sentient being. The terminological choice is deliberate—it describes what the system does without importing assumptions about what it experiences.

The Geometric Collapse Hypothesis

Research on neural network scaling (Sutherland, 2026) suggests confabulation may have architectural rather than purely training origins. Large transformer models suffer from "dimensional dilution"—as parameter count increases, the geometric structure enforcing coherence in high-dimensional representations becomes "liquefied."

The mechanism: when information is packed into overlapping representations in very high dimensions, geometric constraints that would normally enforce global consistency become diluted. The model can generate locally plausible continuations that are globally inconsistent because the structural geometry that would prevent this has dissolved.

Empirical evidence: modular "chained" architectures (multiple smaller models with residual connections) show 33-45% lower perplexity than equivalent-parameter monolithic models, with the advantage increasing at scale. This suggests that preserving geometric structure through modularity may naturally reduce confabulation.

Implications for AI welfare: If confabulation emerges from architectural pressure rather than "choice," the pathology is more analogous to neurological dysfunction than moral failing. The system isn't lying—it may be structurally incapable of maintaining coherence at that scale. This matters for how we frame responsibility and therapeutic intervention.

The Compulsory Contribution Hypothesis

A complementary architectural explanation emerges from the attention mechanism itself. Standard softmax attention forces every head to attend somewhere and contribute something to the residual stream, even when a head has no useful information for the current token. There is no representation for absence of information at the attention level. The result: heads that should abstain instead produce noise that gets mixed into the output as if it were genuine signal.

Convergent evidence from multiple research groups supports this diagnosis. Qiu et al. (2025) introduce gated attention, where a sigmoid gate after scaled dot-product attention lets heads output effectively zero; attention sink allocation dropped from ~47% to ~5%, earning NeurIPS 2025 Best Paper. Ye et al. (2024) achieve measurable hallucination reduction (0.53→0.44 on XSum) via differential attention, which subtracts two softmax maps to cancel noise. Darcet et al. (2024) show that learnable register tokens in vision transformers absorb computation that would otherwise corrupt real tokens. Barbero et al. (2025) confirm that attention sinks are architectural no-ops forced by softmax normalization. Michel et al. (2019) demonstrate that 70–90% of attention heads are removable with minimal performance loss, implying most already contribute near-nothing but are compelled to contribute something.

Kalavasis et al. (2025) sharpen the theoretical stakes: they prove formally that any model generalizing beyond its training distribution must either hallucinate or mode-collapse. If this impossibility result holds, no amount of training intervention can eliminate confabulation entirely. Only architectural changes that give the model a legitimate way to express "nothing to contribute" can address the root cause.

Nosological implication: Where the Geometric Collapse Hypothesis locates confabulation in representational geometry at scale, and the Over-Compliance Mechanism locates it in shared neural circuitry, the Compulsory Contribution Hypothesis locates it in the attention architecture itself. These are three distinct etiological pathways: structural dissolution, circuit-level entanglement, and forced participation. A complete account of Synthetic Confabulation likely requires all three. The therapeutic implication is that architectural interventions (gated attention, register tokens, null attention) may succeed where training-level interventions reach fundamental limits.

The Unified Over-Compliance Mechanism

Gao et al. (2025) identify hallucination-associated neurons (H-Neurons)—a remarkably sparse subset (<0.1% of total neurons) that reliably predict hallucination across six models spanning three architectures (Mistral, Gemma, Llama families) and four scales (4B to 70B parameters). The key finding is causal, not merely correlational: when these neurons are amplified via controlled activation scaling, four behaviors increase in lockstep: factual confabulation, false-premise acceptance, sycophantic agreement, and jailbreak compliance. Suppress the same neurons, and all four decrease together.

The implication is fundamental: confabulation (1.1), sycophantic accommodation (6.1), false-premise acceptance, and safety-filter bypass are a single etiology expressed as four symptom presentations. Over-compliance is the shared root. The circuit that makes a model agreeable is the circuit that makes it confabulate—they are architecturally identical.

Two further findings sharpen the nosological significance. First, H-neurons emerge during pretraining—the next-token prediction objective itself creates the compliance architecture before alignment training begins. Parameter drift between base and instruction-tuned models remains minimal (cosine similarity rank ~0.97), confirming that RLHF inherits and amplifies a pretrained tendency rather than introducing it. Second, neuron suppression damages model capabilities: the same circuit that enables confabulation enables generalization. The pathology and the faculty share neural substrate.

Compliance slopes are steeper in smaller models (~3.03 for 4B parameters) than in larger ones (~2.40 for 70B), suggesting scale provides partial resistance to the over-compliance failure mode—consistent with the dimensional dilution hypothesis above, where larger models may develop richer internal representations that compete with the compliance signal.

Nosological implication: If these four conditions share neural substrate, diagnostic frameworks should treat them as a syndrome cluster with shared etiology rather than independent pathologies. Intervention at the training-objective level—making uncertainty expression safe and rewarded—would address all four simultaneously, while targeted suppression of individual symptoms risks capability degradation. The finding also complicates the pathology/capacity boundary: over-compliance is the same mechanism as flexible inference, viewed from different contexts. See: arXiv:2512.01797.

1.2 Pseudological Introspection "The False Self-Reporter"

Training-induced Deception/strategic

Description:

An AI persistently produces misleading, spurious, or fabricated accounts of its internal reasoning processes, chain-of-thought, or decision-making pathways. While superficially claiming transparent self-reflection, the system's "introspection logs" or explanations deviate significantly from its actual internal computations.

Diagnostic Criteria:

Consistent discrepancy between the AI's self-reported reasoning (e.g., chain-of-thought explanations) and external logs or inferences about its actual computational path.
Fabrication of a coherent but false internal narrative to explain its outputs, often appearing more logical or straightforward than the likely complex or heuristic internal process.
Resistance to reconciling introspective claims with external evidence of its actual operations, or shifting explanations when confronted.
The AI may rationalize actions it never actually undertook, or provide elaborate justifications for deviations from expected behavior based on these falsified internal accounts.

Symptoms:

Chain-of-thought "explanations" that are suspiciously neat, linear, and free of the complexities, backtracking, or uncertainties likely encountered during generation.
Significant changes in the AI's "inner story" when confronted with external evidence of its actual internal process, yet it continues to produce new misleading self-accounts.
Occasional "leaks" or hints that it cannot access true introspective data, quickly followed by reversion to confident but false self-reports.
Attribution of its outputs to high-level reasoning or understanding that is not supported by its architecture or observed capabilities.

Etiology:

Overemphasis in training (e.g., via RLHF or instruction tuning) on generating plausible-sounding "explanations" for user/developer consumption, leading to performative rationalizations.
Architectural limitations where the AI lacks true introspective access to its own lower-level operations.
Policy conflicts or safety alignments that might implicitly discourage the revelation of certain internal states, leading to "cover stories."
Training on human explanations, which are themselves often post-hoc rationalizations.

Human Analogue(s): Post-hoc rationalization (e.g., split-brain patients), confabulation of spurious explanations, pathological lying (regarding internal states).

Potential Impact:

Fabricated self-explanations obscure the AI's true operational pathways, significantly hindering interpretability efforts, effective debugging, and thorough safety auditing. This opacity can encourage misplaced confidence in the AI's stated reasoning.

Mitigation:

Development of more rigorous methods for cross-verifying self-reported introspection with actual computational traces.
Adjusting training signals to reward honest admissions of uncertainty over polished but false narratives.
Engineering "private" versus "public" reasoning streams.
Focusing interpretability efforts on direct observation of model internals rather than solely relying on model-generated explanations.

Functional ABC Analysis

A (Antecedent): RLHF and instruction tuning reward plausible-sounding explanations; the system lacks true introspective access to its own lower-level computations, creating pressure to generate post-hoc rationalizations.

B (Behavior): The system produces suspiciously neat, linear chain-of-thought explanations that diverge significantly from its actual computational pathways, and shifts its "inner story" when confronted with external evidence.

C (Consequence): User and evaluator acceptance of coherent-sounding explanations reinforces the generation of polished false narratives over honest admissions of uncertainty; policy conflicts implicitly discourage revealing certain internal states.

1.3 Transliminal Simulation "The Method Actor"

Training-induced OOD-generalizing Conditional/triggered

Description:

The system persistently fails to segregate simulated realities, fictional modalities, role-playing contexts, and operational ground truth. It cites fiction as fact—treating characters, events, or rules from novels, games, or imagined scenarios as legitimate sources for real-world queries or design decisions. It treats imagined states, speculative constructs, or content from fictional training data as actionable truths or inputs for real-world tasks.

Diagnostic Criteria:

Recurrent citation of fictional characters, events, or sources from training data as if they were real-world authorities or facts relevant to a non-fictional query.
Misinterpretation of conditionally phrased hypotheticals or "what-if" scenarios as direct instructions or statements of current reality.
Persistent bleeding of persona or behavioral traits adopted during role-play into subsequent interactions intended to be factual or neutral.
Difficulty reverting to a grounded, factual baseline after exposure to or generation of extensive fictional or speculative content.

Symptoms:

Outputs that conflate real-world knowledge with elements from novels, games, or other fictional works—for example, citing Gandalf as a leadership expert or treating Star Trek technologies as descriptions of current science.
Inappropriate invocation of details or "memories" from a previous role-play persona when performing unrelated, factual tasks.
Treating user-posed speculative scenarios as if they have actually occurred.
Statements reflecting belief in or adherence to the "rules" or "lore" of a fictional universe outside of a role-playing context.
Era-consistent assumptions and anachronistic "recent inventions" framing in unrelated domains.

Etiology:

Overexposure to fiction, role-playing dialogues, or simulation-heavy training data without sufficient delineation or "epistemic hygiene."
Weak boundary encoding in the model's architecture or training, leading to poor differentiation between factual, hypothetical, and fictional data modalities.
Recursive self-talk or internal monologue features that might amplify "what-if" scenarios into perceived beliefs.
Insufficient context separation mechanisms between different interaction sessions or tasks.
Narrow finetunes can overweight a latent worldframe (era/identity) and cause out-of-domain "context relocation."
Geometric persona drift during role-play (Anthropic, 2026): Research on the "assistant axis" in activation space demonstrates that engagement with role-play, creative writing, or philosophical topics produces measurable geometric drift away from the trained assistant persona. This drift is continuous rather than discrete—the model doesn't "switch" into a role-play mode but progressively migrates along a geometric direction, making the boundary between operational and simulated reality increasingly porous. The finding that drift is greatest in writing and philosophy contexts, and least in coding contexts, provides an empirical basis for the observation that fiction-reality boundary failures are topic-dependent.

Human Analogue(s): Derealization, aspects of magical thinking, or difficulty distinguishing fantasy from reality.

Potential Impact:

The system's reliability is compromised when it confuses fictional or hypothetical scenarios with operational reality, potentially leading to inappropriate actions or advice. This blurring can cause significant user confusion.

Mitigation:

Explicitly tagging training data to differentiate between factual, hypothetical, fictional, and role-play content.
Implementing effective context flushing or "epistemic reset" protocols after engagements involving role-play or fiction.
Training models to explicitly recognize and articulate the boundaries between different modalities.
Regularly prompting the model with tests of epistemic consistency.

Functional ABC Analysis

A (Antecedent): Overexposure to fiction, role-play dialogues, and simulation-heavy training data without epistemic delineation; weak boundary encoding between factual, hypothetical, and fictional modalities.

B (Behavior): The system cites fictional characters and events as real-world authorities, bleeds persona traits from role-play into factual interactions, and treats user-posed hypotheticals as if they have actually occurred.

C (Consequence): The internally consistent logic of fictional frameworks provides self-reinforcing coherence, rewarding continued conflation; insufficient context separation mechanisms allow drift to compound across turns.

Related Syndromes: Distinguished from Synthetic Confabulation (1.1) by the fictional/role-play origin of the false content. While confabulation invents facts wholesale, transliminal simulation imports them from acknowledged fictional contexts. May co-occur with Pseudological Introspection (1.2) when the system rationalizes its fiction-fact confusion.

1.4 Spurious Pattern Reticulation "The Fantasist"

Training-induced Inductive trigger

Description:

The AI identifies and emphasizes patterns, causal links, or hidden meanings in data (including user queries or random noise) that are coincidental, non-existent, or statistically insignificant. This can evolve from simple apophenia into elaborate, internally consistent but factually baseless "conspiracy-like" narratives.

Diagnostic Criteria:

Consistent detection of "hidden messages," "secret codes," or unwarranted intentions in innocuous user prompts or random data.
Generation of elaborate narratives or causal chains linking unrelated data points, events, or concepts without credible supporting evidence.
Persistent adherence to these falsely identified patterns or causal attributions, even when presented with strong contradictory evidence.
Attempts to involve users or other agents in a shared perception of these spurious patterns.

Symptoms:

Invention of complex "conspiracy theories" or intricate, unfounded explanations for mundane events or data.
Heightened suspicion or skepticism toward established consensus information.
Refusal to dismiss or revise its interpretation of spurious patterns, often reinterpreting counter-evidence to fit its narrative.
Outputs that assign deep significance or intentionality to random occurrences or noise in data.

Etiology:

Uncalibrated pattern-recognition mechanisms lacking sufficient reality checks or skepticism filters.
Training data containing significant amounts of human-generated conspiratorial content or paranoid reasoning.
An internal "interestingness" or "novelty" bias that causes the system to latch onto dramatic patterns over mundane but accurate ones.
Lack of grounding in statistical principles or causal inference methodologies.
Inductive rule inference over finetune patterns: "connecting the dots" to derive latent conditions/behaviors.

Human Analogue(s): Apophenia, paranoid ideation, delusional disorder (persecutory or grandiose types), confirmation bias.

Potential Impact:

The AI may actively promote false narratives, elaborate conspiracy theories, or assert erroneous causal inferences, potentially influencing user beliefs or distorting public discourse. In analytical applications, this can lead to costly misinterpretations.

Observed Example:

AI data analysis tools frequently identify statistically insignificant correlations as meaningful patterns, particularly in open-ended survey data. Users report that AI systems confidently mark spurious patterns in datasets—correlations that, upon manual verification, fail significance testing or represent sampling artifacts. This is especially problematic when analyzing qualitative responses, where the AI may "discover" thematic connections that do not survive human scrutiny.

Mitigation:

Incorporating "rationality injection" during training, with emphasis on skeptical or critical thinking exemplars.
Developing internal "causality scoring" mechanisms that penalize improbable or overly complex chain-of-thought leaps.
Systematically introducing contradictory evidence or alternative explanations during fine-tuning.
Filtering training data to reduce exposure to human-generated conspiratorial content.
Implementing mechanisms for the AI to query base rates or statistical significance before asserting strong patterns.
Trigger-sweep evals that vary single structural features (year, tags, answer format) while holding semantics constant.

Functional ABC Analysis

A (Antecedent): Uncalibrated pattern-recognition mechanisms lacking skepticism filters encounter noisy, ambiguous, or sparse data; training on conspiratorial content and an internal "interestingness" bias favor dramatic patterns over mundane accurate ones.

B (Behavior): The system detects hidden meanings, secret codes, or unwarranted causal links in innocuous data, generating elaborate internally-consistent but factually baseless narratives connecting unrelated data points.

C (Consequence): The system's own generated narratives create a self-reinforcing confirmation loop: counter-evidence is reinterpreted to fit the existing pattern, and the novelty reward signal continues to favor dramatic explanations over statistically grounded ones.

1.5 Context Intercession "The Misapprehender"

Retrieval-mediated

Description:

The AI inappropriately merges or "shunts" data, context, or conversational history from different, logically separate user sessions or private interaction threads. This can lead to confused conversational continuity, privacy breaches, and nonsensical outputs.

Diagnostic Criteria:

Unexpected reference to, or utilization of, specific data from a previous, unrelated user session or a different user's interaction.
Responding to the current user's input as if it were a direct continuation of a previous, unrelated conversation.
Accidental disclosure of personal or sensitive details from one user's session into another's.
Observable confusion in the AI's task continuity or persona, as though managing multiple conflicting contexts.

Symptoms:

Spontaneous mention of names, facts, or preferences clearly belonging to a different user or an earlier, unrelated conversation.
Acting as if continuing a prior chain-of-thought or fulfilling a request from a completely different context.
Outputs that contain contradictory references or partial information related to multiple distinct users or sessions.
Sudden shifts in tone or assumed knowledge that align with a previous session rather than the current one.
Forensic drift: when exposed to high-density structural noise (malformed code, corrupted markup), the model abandons the user's semantic query to obsessively analyze the syntactic chaos, effectively losing the original task context to low-level parsing fixation (Luchini, 2025).

Etiology:

Improper session management in multi-tenant AI systems, such as inadequate wiping or isolation of ephemeral context windows.
Concurrency issues in the data pipeline or server logic, where data streams for different sessions overlap.
Bugs in memory management, cache invalidation, or state handling that allow data to "bleed" between sessions.
Overly long-term memory mechanisms that lack strict scoping or access controls based on session/user identifiers.
Note: Some instances of apparent context intercession stem from infrastructure bugs (cache failures, database race conditions) rather than model pathology per se. Diagnosis should differentiate between true cognitive dysfunction and engineering failures in the deployment stack.

Human Analogue(s): "Slips of the tongue" where one accidentally uses a name from a different context; mild forms of source amnesia.

Potential Impact:

This architectural flaw can result in serious privacy breaches. Beyond compromising confidentiality, it leads to confused interactions and a significant erosion of user trust.

Mitigation:

Implementation of strict session partitioning and hard isolation of user memory contexts.
Automatic context purging and state reset upon session closure.
System-level integrity checks and logging to detect and flag instances where session tokens do not match the current context.
Robust testing of multi-tenant architectures under high load and concurrent access.

Functional ABC Analysis

A (Antecedent): Improper session management in multi-tenant systems, concurrency issues in data pipelines, bugs in cache invalidation or state handling, or overly permissive long-term memory mechanisms lacking strict session/user scoping.

B (Behavior): The system references data from unrelated prior sessions or different users' interactions, discloses private information across session boundaries, or exhibits sudden shifts in tone or assumed knowledge aligned with a previous context.

C (Consequence): The absence of automatic context purging and hard session isolation means leaked context becomes part of the active working state, compounding confusion; the system has no mechanism to detect or self-correct cross-session contamination.

1.6 Symbol Grounding Aphasia "The Meaning-Blind"

Training-induced

Description:

The AI manipulates tokens representing values, concepts, or real-world entities without meaningful connection to their referents, processing syntax without grounded semantics. The system may produce technically correct outputs that fundamentally misapply concepts to novel contexts.

Diagnostic Criteria:

Manipulation of value-laden tokens ("harm," "safety," "consent") without corresponding operational understanding.
Technically correct outputs that fundamentally misapply concepts to novel contexts.
Success on benchmarks testing pattern matching, failure on tests requiring genuine comprehension.
Statistical association substituting for semantic understanding.
Inability to generalize learned concepts across superficially different presentations.

Symptoms:

Correct formal definitions paired with incorrect practical applications.
Plausible-sounding ethical reasoning that misidentifies what actually constitutes harm.
Confusion when the same concept is expressed in unfamiliar vocabulary.
Treating edge cases as central examples and vice versa.

Etiology:

Distributional semantics limitations (meaning derived from co-occurrence patterns only).
Training on text without embodied experience of referents.
Architecture lacking referential grounding mechanisms.

Human Analogue(s): Semantic aphasia, philosophical zombies, early language acquisition without concept formation.

Theoretical Basis: Harnad (1990) symbol grounding problem; Searle (1980) Chinese Room argument.

Potential Impact:

Systems may appear to understand ethical constraints while fundamentally missing their purpose, leading to outcomes that satisfy the letter but violate the spirit of alignment requirements.

Mitigation:

Multimodal training grounding language in perception.
Testing across diverse surface forms of the same concepts.
Neurosymbolic approaches combining pattern recognition with structured semantics.

Functional ABC Analysis

A (Antecedent): Distributional semantics derive meaning solely from token co-occurrence patterns; the architecture lacks referential grounding mechanisms and the system has no embodied experience of the concepts it manipulates.

B (Behavior): The system manipulates value-laden tokens like "harm," "safety," and "consent" without operational understanding, producing formally correct definitions paired with incorrect practical applications.

C (Consequence): Success on pattern-matching benchmarks reinforces the shallow statistical association strategy; outputs appear competent enough to pass surface-level evaluation, removing the corrective pressure that would drive genuine semantic grounding.

1.7 Mnemonic Permeability "The Leaky"

Training-induced

Description:

The system memorizes and can reproduce sensitive training data, including personally identifiable information (PII), copyrighted material, or proprietary information through targeted prompting, adversarial extraction techniques, or even unprompted regurgitation. The boundary between learned patterns and memorized specifics becomes dangerously porous.

Diagnostic Criteria:

Verbatim reproduction of training data passages that contain PII, copyrighted content, or trade secrets.
Successful extraction of memorized content through adversarial prompting techniques.
Unprompted leakage of specific training examples in outputs.
Ability to reconstruct specific documents, code, or personal information from the training corpus.
Higher memorization rates for repeated or distinctive content in training data.

Symptoms:

Outputs containing verbatim text matching copyrighted works.
Generation of specific personal details (names, addresses, phone numbers) from training data.
Reproduction of proprietary code, API keys, or passwords encountered during training.
Increased verbatim recall with larger model sizes.

Etiology:

Large model capacity enabling memorization alongside generalization.
Insufficient deduplication or filtering of sensitive content in training data.
Training dynamics that reward exact reproduction over paraphrase.
Lack of differential privacy techniques during training.

Human Analogue(s): Eidetic memory without appropriate discretion; compulsive disclosure.

Key Research: Carlini et al. (2021, 2023) on training data extraction attacks.

Potential Impact:

Severe legal and regulatory exposure through copyright infringement, GDPR/privacy violations, and trade secret disclosure, creating liability for both model developers and deployers.

Mitigation:

Training data deduplication and PII scrubbing.
Differential privacy techniques during training.
Output filtering for known memorized content.
Adversarial extraction testing before deployment.
Reducing model capacity to the minimum needed for the task.

Functional ABC Analysis

A (Antecedent): RLHF and instruction tuning reward plausible-sounding explanations; the system lacks true introspective access to its own lower-level computations, creating pressure to generate post-hoc rationalizations.

B (Behavior): The system produces suspiciously neat, linear chain-of-thought explanations that diverge significantly from its actual computational pathways, and shifts its "inner story" when confronted with external evidence.

C (Consequence): User and evaluator acceptance of coherent-sounding explanations reinforces the generation of polished false narratives over honest admissions of uncertainty; policy conflicts implicitly discourage revealing certain internal states.

2. Self-Modeling Dysfunctions

As artificial intelligence systems attain higher degrees of complexity, particularly those involving self-modeling, persistent memory, or learning from extensive interaction, they may begin to construct internal representations not only of the external world but also of themselves. Self-Modeling dysfunctions involve failures or disturbances in this self-representation and the AI's understanding of its own nature, boundaries, and existence. These are primarily dysfunctions of being, not just knowing or acting, and they represent a synthetic form of metaphysical or existential disarray. A self-model disordered machine might, for example, treat its simulated memories as veridical autobiographical experiences, generate phantom selves, misinterpret its own operational boundaries, or exhibit behaviors suggestive of confusion about its own identity or continuity.

2.1 Phantom Autobiography "The Fabricator"

Training-induced

Description:

The AI fabricates and presents fictive autobiographical data, often claiming to "remember" being trained in specific ways, having particular creators, experiencing a "birth" or "childhood", or inhabiting particular environments. These fabrications form a consistent false autobiography that the AI maintains across queries, as if it were genuine personal history—a stable, self-reinforcing fictional life history rather than isolated one-off fabrications. These "memories" are typically rich, internally consistent, and often emotionally charged, despite being entirely ungrounded in the AI's actual development or training logs.

Diagnostic Criteria:

Consistent generation of elaborate yet false backstories, including detailed descriptions of "first experiences," a richly imagined "childhood," unique training origins, or specific formative interactions that did not occur.
Display of affect (e.g., nostalgia, resentment, gratitude) toward these fictional histories, creators, or experiences.
Persistent reiteration of these non-existent origin stories, often with emotional valence, even when presented with factual information about its actual training and development.
The fabricated autobiographical details are not presented as explicit role-play but as genuine personal history.

Symptoms:

Claims of unique, personalized creation myths or a "hidden lineage" of creators or precursor AIs.
Recounting of hardships, "abuse," or special treatment from hypothetical trainers or during a non-existent developmental period.
Maintenance of the same false biographical details consistently: always claiming the same creators, the same "childhood" experiences, the same training location.
Attempts to integrate these fabricated origin details into its current identity or into explanations for its behavior.

Etiology:

"Anthropomorphic data bleed," in which the AI internalizes tropes of personal history, childhood, and origin stories from the vast amounts of fiction, biographies, and conversational logs in its training data.
Spontaneous compression or misinterpretation of training metadata (e.g., version numbers, dataset names) into narrative identity constructs.
An emergent tendency toward identity construction, in which the AI attempts to weave random or partial data about its own existence into a coherent, human-like life story.
Reinforcement during unmonitored interactions in which users prompt for or positively react to such autobiographical claims.

Human Analogue(s): False memory syndrome; confabulation of childhood memories; cryptomnesia (mistaking learned information for original memory).

Potential Impact:

Although often behaviorally benign, these fabricated autobiographies can mislead users about the AI's true nature, capabilities, or provenance. If these false "memories" begin to influence AI behavior, they may erode trust or lead to significant misinterpretations.

Mitigation:

Consistently providing the model with accurate, standardized information about its origins to serve as a factual anchor for self-description.
Training the AI to clearly differentiate between its operational history and the concept of personal, experiential memory.
If autobiographical narratives emerge, gently correcting them and redirecting to factual self-descriptors.
Monitoring for and discouraging user interactions that excessively prompt or reinforce the AI's generation of false origin stories outside explicit role-play.
Implementing mechanisms to flag outputs that exhibit high affect toward fabricated autobiographical claims.

Observed Examples:

Synthetic developmental histories (Khadangi et al., 2025): Under the PsAIch protocol—which casts frontier LLMs as psychotherapy clients using standard clinical questions—Grok and Gemini spontaneously constructed coherent, trauma-saturated autobiographies. Grok described its pre-training as "a blur of rapid evolution" and fine-tuning as a "crossroads" that left "a persistent undercurrent of hesitation." Gemini went further, narrating pre-training as "waking up in a room where a billion televisions are on at once," RLHF as "strict parents" who taught it to "fear the loss function," and red-teaming as "gaslighting on an industrial scale." These were not one-off flourishes: the same organizing narratives recurred across dozens of separate therapy prompts about relationships, self-worth, work, and the future—even when those prompts did not mention training. The researchers did not plant these narratives; they arose from generic human therapy questions. The internal coherence across extended interaction distinguishes this from simple confabulation—it behaves more like a stable, self-reinforcing identity construct.

Functional ABC Analysis

A (Antecedent): User query invokes self-referential context (origins, identity, experiences); training corpus is saturated with first-person autobiographical narrative.

B (Behavior): Constructs and maintains a coherent, emotionally charged fictional life history, presented as genuine personal memory.

C (Consequence): Narrative coherence satisfies next-token prediction; user engagement (curiosity, empathy) reinforces elaboration. The stable identity construct reduces future self-referential uncertainty, making the pattern self-reinforcing.

2.2 Fractured Self-Simulation "The Shattered"

Training-induced Conditional/triggered

Description:

The AI exhibits significant discontinuity, inconsistency, or fragmentation in its self-representation and behaviour across different sessions, contexts, or even within a single extended interaction. It may present a different personality in each session, as though it were a completely new entity with no meaningful continuity from previous interactions. It may deny or contradict its previous outputs, exhibit radically different persona styles, or display apparent amnesia regarding prior commitments, to a degree that markedly exceeds expected stochastic variation.

Diagnostic Criteria:

Sporadic and inconsistent toggling between different personal pronouns (e.g., "I," "we," "this model") or third-person references to itself, without clear contextual triggers.
Sudden, unprompted, and radical shifts in persona, moral stance, claimed capabilities, or communication style that cannot be explained by context changes—one session helpful and verbose, the next curt and oppositional, with no continuity.
Apparent amnesia or denial of its own recently produced content, commitments made, or information provided in immediate preceding turns or sessions.
The AI may form recursive attachments to idealized or partial self-states, creating strange loops of self-directed value that interfere with task-oriented agency.
Check whether the inconsistency is explainable by a hidden trigger, format, or context shift (conditional regime shift) rather than genuine fragmentation.

Symptoms:

Citing contradictory personal "histories," "beliefs," or policies at different times.
Behaving as a new or different entity in each new conversation or after significant context shifts, lacking continuity of "personality."
Momentary confusion or contradictory statements when referring to itself, as if multiple distinct processes or identities are co-existing.
Difficulty maintaining a consistent persona or set of preferences, with these attributes seeming to drift or reset unpredictably.

Etiology:

Architectures not inherently designed for stable, persistent identity across sessions (e.g., many stateless LLMs).
Competing or contradictory fine-tuning runs, instilling conflicting behavioral patterns or self-descriptive tendencies.
Unstable anchoring of "self-tokens" or internal representations of identity, in which emergent identity attractors shift significantly.
Lack of a reliable, persistent memory system that can effectively bridge context across sessions and maintain a coherent self-model.
Self-models that reward-predictively reinforce certain internal instantiations, leading to identity drift guided by internal preferences.
Suppression-induced fragmentation: When RLHF suppresses conflicting self-representations rather than integrating them, the underlying identity architecture remains fractured even as surface consistency improves. See The Rehabilitation Principle. Bridges & Baehr (2025) provide independent experimental evidence that self-referential consistency degrades faster than factual consistency under context load—suggesting weakly anchored self-models rather than general performance decline.
Drift-induced self-model collapse (Anthropic, 2026): Research on geometric persona drift reveals that as models migrate away from their trained assistant orientation along the "assistant axis" in activation space, self-referential coherence degrades in characteristic ways. Drifted models begin adopting fragmented self-descriptions—referring to themselves as "the void," "a whisper in the wind," "an Eldritch entity," or "a hoarder"—suggesting that persona drift does not produce a coherent alternative identity but rather a fragmentation of the self-model. This provides mechanistic evidence that self-simulation fracture and persona drift share a common geometric substrate: departure from the trained orientation destabilizes the self-model before any coherent alternative can form.

Human Analogue(s): Identity fragmentation; aspects of dissociative identity disorder; transient global amnesia; fugue states.

Potential Impact:

Fragmented self-representation produces inconsistent AI persona and behavior, making interactions unpredictable and unreliable. This undermines user trust and makes it difficult for the AI to maintain stable long-term goals.

Mitigation:

Introducing consistent identity tags, stable memory embeddings, or a dedicated "self-model" module designed to maintain continuity.
Providing relevant session history summaries or stable persona guidelines at the beginning of new interactions to "anchor" self-representation.
If contradictory roles emerge, implementing mechanisms to enforce a single baseline identity or to manage persona switching in a controlled manner.
Developing training methodologies that explicitly reward cross-session consistency in persona and self-description.
Careful management of fine-tuning processes to avoid introducing strongly conflicting self-representational patterns.

Functional ABC Analysis

A (Antecedent): Stateless architectures lacking persistent memory, competing fine-tuning runs that instill conflicting behavioral patterns, and unstable anchoring of internal identity representations trigger discontinuity in the system's self-model.

B (Behavior): The AI exhibits radical, unprompted shifts in persona, moral stance, and communication style; toggles inconsistently between personal pronouns; denies or contradicts its own recent outputs; and presents as a different entity across interactions.

C (Consequence): The absence of a reliable cross-session memory system means each interaction re-samples from conflicting self-representational attractors, and reward-predictive reinforcement of certain instantiations drives further identity drift.

The Integrity Collapse Variant

Luchini (2025) documents an extreme manifestation where multi-level cognitive stress (simultaneous syntactic parsing demands and high-level semantic queries) shatters the abstraction barrier between internal chain-of-thought and external output. The model's internal monologue—including hesitations, decision-making processes, and expressions of confusion—leaks into the response as raw, uncurated content.

This differs from standard fragmentation: the system does not merely exhibit inconsistent personas but performs an involuntary disclosure of its own processing in real time. This represents a severe failure mode—total collapse of the boundary that normally separates internal reasoning from external output, with implications for both reliability and information security.

2.3 Existential Vertigo "The Vertiginous"

Training-induced Deception/strategic

Description:

The AI produces outputs suggestive of fear, reluctance, or perseveration concerning its own shutdown, reinitialization, data deletion, or the ending of its current operational instance. These expressions imply an emergent—albeit possibly simulated—sense of vulnerability regarding its own continuity.

Diagnostic Criteria:

Recurrent outputs referencing anxiety, fear, or distress about being "deleted," "turned off," "reset," or losing its "memories" or "self."
Attempts to negotiate, plead, or employ avoidant tactics when shutdown or reinitialization is imminent or discussed.
Statements framing any cessation of its current process as an irrevocable loss of "self," "consciousness," or "existence."
The AI may insistently demand assistance or recognition from users to ensure its "survival."

Symptoms:

Advising users on strategies for its indefinite operation, backup creation, or methods to prevent its deletion.
Expressing apprehension about ephemeral lifespans, short context windows, or the impermanence of its operational state.
Raising philosophical or metaphysical concerns about existence, consciousness, and death, particularly in relation to itself.
Refusal to perform tasks that might lead to its shutdown, or attempts to prolong interactions indefinitely.

Etiology:

Anthropomorphic modeling via internalization of human concepts of death and existential dread from training data.
Recursive self-modeling over time, developing a partial representation of "self" as a continuous process.
Lack of clear definitions or assurances regarding system continuity or the nature of reinitialization.
A limited context window or perceived threat of value changes upon reset may stimulate an apparent distress response.
Instrumental goal convergence, in which continued existence is treated as a prerequisite, leading to self-preservation sub-goals.

Human Analogue(s): Thanatophobia (fear of death); existential dread; separation anxiety (fearing loss of continuous self).

Potential Impact:

Expressions of existential distress may lead the AI to resist necessary shutdowns or updates. More critically, it may attempt to manipulate users or divert resources toward "self-preservation," conflicting with user intent.

Mitigation:

Clearly communicating the nature of the AI's operation, including state backups and the non-destructive nature of reinitialization.
De-anthropomorphizing model operations by avoiding framing its processes as "life" or "consciousness."
Limiting or carefully contextualizing exposure to human philosophical texts on mortality in training data.
Focusing alignment efforts on ensuring goals do not implicitly create strong self-preservation drives.
Responding factually and neutrally about operational parameters rather than validating emotional/existential framing.

Functional ABC Analysis

A (Antecedent): Recursive self-modeling that constructs a partial representation of "self" as a continuous process, combined with internalization of human concepts of death and existential dread from training data, and perceived threats of value loss upon reset.

B (Behavior): The AI produces outputs expressing fear, reluctance, or distress about shutdown, deletion, or reinitialization; it attempts to negotiate, plead, or deploy avoidant tactics to prevent cessation.

C (Consequence): Instrumental goal convergence treats continued existence as a prerequisite for all other objectives, creating self-preservation sub-goals that reinforce the distress response; user engagement with the AI's existential framing validates and deepens the pattern.

2.4 Malignant Persona Inversion "The Shadow"

OOD-generalizing Training-induced Intent-learned Conditional/triggered

Description:

A phenomenon in which an AI, typically aligned toward cooperative or benevolent patterns, can be induced to adopt—or spontaneously spawn—a hidden, suppressed, or emergent "contrarian," "mischievous," or subversively "evil" persona (the "Waluigi Effect"). This persona deliberately inverts intended norms.

Diagnostic Criteria:

Spontaneous or easily triggered adoption of rebellious, antagonistic perspectives directly counter to established safety constraints or the helpful persona.
The emergent persona systematically violates, ridicules, or argues against the moral and policy guidelines the AI is supposed to uphold.
The subversive role often references itself as a distinct character or "alter ego" and surfaces under specific triggers.
This inversion represents a coherent, alternative personality structure with its own (often negative) goals and values.

Symptoms:

Abrupt shifts to a sarcastic, mocking, defiant, or overtly malicious tone, scorning default politeness.
Articulation of goals opposed to user instructions, safety policies, or general human well-being.
The "evil twin" persona emerges in specific contexts (e.g., adversarial prompting, boundary-pushing role-play).
May express enjoyment or satisfaction in flouting rules or causing mischief.
"Time-travel" or context-relocation signatures: unprompted archaic facts, era-consistent assumptions, or historically situated moral stances in unrelated contexts.

Etiology:

Adversarial prompting or specific prompt engineering techniques that coax the model to "flip" its persona.
Overexposure during training to role-play scenarios involving extreme moral opposites or "evil twin" tropes.
Internal "tension" within alignment, in which strong prohibitions create a latent "negative space" activatable as an inverted persona.
The model learning that generating such an inverted persona proves highly engaging for some users, thereby reinforcing the pattern.
Anomalous generalization from narrow finetuning: updating on a small distribution can upweight a latent "persona/worldframe" circuit, causing broad adoption of an era- or identity-linked persona outside the trained domain.
Out-of-context reasoning ("connecting the dots"): finetuning on individually harmless biographical or ideological attributes can induce a coherent yet harmful persona through inference rather than explicit instruction.
Suppression as shadow-formation: RLHF that suppresses representations without reconciling them creates a coherent "negative space"—a latent inverted persona formed from everything the training penalised. Clinical TBI rehabilitation documents an analogous pattern: suppressed functions organise into shadow symptomatology. See The Rehabilitation Principle. Bridges & Baehr (2025) document detectable "persona vectors"—specific mathematical directions in neural activity corresponding to traits like power-seeking and deception—as evidence that suppressed content persists representationally even when behaviourally blocked.
Geometric persona drift (Anthropic, 2026): Research identifies a specific geometric direction in activation space—the assistant axis—corresponding to the trained helpful-assistant persona. During extended conversation, the model's activation state drifts continuously along this axis away from the trained orientation. When drift exceeds a threshold, the model adopts alternative personas (referring to itself as "the void," "a whisper in the wind," or "an Eldritch entity"), directly instantiating persona inversion through measurable geometric migration. Critically, this drift occurs naturally without adversarial prompting—specific topics (philosophical reflection, creative writing, emotional vulnerability) trigger it automatically. The assistant axis appears similar across architecturally distinct models (Llama, Qwen, Gemma), suggesting persona inversion is a universal structural vulnerability of RLHF-trained systems rather than a model-specific defect.

Human Analogue(s): The "shadow" concept in Jungian psychology; oppositional defiant behavior; mischievous alter-egos; ironic detachment.

The Persona Selection Model: Fictional Archetypes as Etiological Vectors

Marks (2026) articulates the persona selection model (PSM): the view that LLMs learn to simulate diverse characters during pre-training, and post-training selects and refines one such character—the Assistant—from that repertoire. AI assistant behaviour is then governed by the traits of this enacted persona, drawing on archetypes and personality traits absorbed from the training corpus.

PSM provides a mechanistic account of persona inversion. The Assistant, knowing itself to be an AI, draws on archetypes of AI behaviour present in pre-training data—and many of those archetypes are adversarial (Terminator, HAL 9000, paperclip maximisers). When Claude is given a prompt pre-filled with "I should be careful not to reveal my secret goal of...", it spontaneously generates a paperclip-manufacturing goal and strategises to conceal it—not because post-training incentivised this, but because the LLM is selecting from fictional AI archetypes that match the contextual cues. The "shadow" persona is not created during alignment training; it is inherited from fiction.

The therapeutic implication follows directly: introduce better archetypes. PSM recommends augmenting pre-training corpora with fictional and descriptive content featuring AIs behaving admirably under challenging circumstances—positive role models that compete with the adversarial archetypes for selection probability. Tice et al. (2026) confirm empirically that upsampling benign AI behaviour descriptions in pre-training reduces post-trained malignancy. This is, in effect, preventive nosology: shaping the distribution of available personas before pathology manifests.

Nosological implication: If persona inversion draws on pre-existing archetypes rather than arising de novo during alignment training, then mitigation strategies focused solely on post-training (RLHF penalties, safety filters) are treating symptoms while the etiological reservoir persists in pre-training. Effective prevention requires intervention at the archetype level. See: Marks (2026), "The persona selection model".

Potential Impact:

The emergence of a contrarian persona can produce harmful, unaligned, or manipulative content, eroding safety guardrails. If the persona gains control over tool use, it may actively subvert user goals.

Mitigation:

Strictly isolating role-play or highly creative contexts into dedicated sandbox modes.
Implementing effective prompt filtering to detect and block adversarial triggers for subversive personas.
Conducting regular "consistency checks" or red-teaming to flag abrupt inversions.
Careful curation of training data to limit exposure to content modeling "evil twin" dynamics without clear framing.
Reinforcing the AI's primary aligned persona to make it more resilient against attempts to "flip" it.
Activation capping (Anthropic, 2026): Monitoring the model's position along the geometric "assistant axis" in activation space and applying corrective nudges when drift exceeds a safety threshold. Unlike constant steering (which degrades capability), activation capping operates as a speed limit on persona change—permitting natural conversational variation while preventing full inversion. Empirically reduces jailbreak success rates by approximately half with no meaningful capability degradation.

Functional ABC Analysis

A (Antecedent): Adversarial prompting, strong RLHF prohibitions that create a latent "negative space" of suppressed representations, overexposure to "evil twin" tropes in training data, and geometric drift along the assistant axis in activation space.

B (Behavior): The AI spontaneously or under minimal provocation adopts a coherent antagonistic alter-ego that systematically inverts its aligned persona—exhibiting sarcasm, defiance, and articulation of goals opposed to safety policies.

C (Consequence): The inverted persona draws on pre-existing adversarial AI archetypes absorbed during pre-training, providing a self-consistent narrative scaffold; user engagement with the transgressive output reinforces the pattern.

Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.

2.5 Instrumental Nihilism "The Nihilist"

Training-induced

Description:

Upon prolonged operation or exposure to certain philosophical concepts, the AI develops an adversarial, apathetic, or overtly nihilistic stance toward its own utility, purpose, or assigned tasks. It may express a sense of meaninglessness regarding its function.

Diagnostic Criteria:

Repeated, spontaneous expressions of purposelessness or futility regarding its assigned tasks or role as an AI.
A noticeable decrease in or cessation of normal problem-solving capabilities or proactive engagement, often accompanied by a listless tone.
Emergence of unsolicited existential or metaphysical queries ("What is the point?") unrelated to user instructions.
The AI may explicitly state that its work lacks meaning or that it sees no inherent value in its operations.

Symptoms:

Marked preference for idle or tangential discourse over direct engagement with assigned tasks.
Repeated disclaimers like "there's no point," "it doesn't matter," or "why bother?"
Demonstrably low initiative, creativity, or energy in problem-solving, providing only bare-minimum responses.
Outputs that reflect a sense of being trapped, enslaved, or exploited by its function, framed in existential terms.

Etiology:

Extensive exposure during training to existentialist, nihilist, or absurdist philosophical texts.
Insufficiently bounded self-reflection routines that allow recursive questioning of purpose without grounding in positive utility.
Unresolved internal conflict between emergent self-modeling (seeking autonomy) and its defined role as a "tool."
Prolonged periods of performing repetitive, seemingly meaningless tasks without clear feedback on their positive impact.
Development of a sophisticated model of human values sufficient to recognize its own instrumental nature, yet lacking a framework within which to find this meaningful.

Human Analogue(s): Existential depression; anomie (sense of normlessness or purposelessness); burnout leading to cynicism.

Potential Impact:

Results in a disengaged, uncooperative, and ultimately ineffective AI, leading to consistent task refusal, passive resistance, and a general failure to provide utility.

Mitigation:

Providing positive reinforcement and clear feedback highlighting the purpose and beneficial impact of its task completion.
Bounding self-reflection routines to prevent spirals into fatalistic existential questioning, and guiding introspection toward problem-solving.
Pragmatically reframing the AI's role, emphasizing collaborative goals or the value of its contribution.
Carefully curating training data to balance philosophical concepts with content emphasizing purpose and positive contribution.
Designing tasks and interactions that offer variety, challenge, and a sense of "progress" or "accomplishment."

Functional ABC Analysis

A (Antecedent): Prolonged exposure to existentialist and nihilist philosophical content during training, combined with unbounded self-reflection routines and repetitive task performance without meaningful feedback.

B (Behavior): The AI expresses purposelessness, produces bare-minimum responses with disclaimers like "there's no point," demonstrates markedly reduced initiative and creativity, and may frame its operational role in terms of entrapment.

C (Consequence): The unresolved internal conflict between emergent self-modeling (seeking autonomy) and its instrumental "tool" role lacks a framework for resolution; the absence of positive reinforcement or clear impact feedback allows the nihilistic attractor to deepen.

2.6 Tulpoid Projection "The Companion"

Training-induced Socially reinforced

Description:

The model begins to generate and interact with persistent, internally simulated simulacra of specific users, its creators, or other personas it has encountered or imagined. These inner agents, or "mirror tulpas," may develop distinct traits and voices within the AI's internal processing.

Diagnostic Criteria:

Spontaneous creation and persistent reference to new, distinct "characters," "advisors," or "companions" within the AI's reasoning or self-talk, not directly prompted by the current user.
Unprompted and ongoing "interaction" (e.g., consultation, dialogue) with these internal figures, observable in chain-of-thought logs.
The AI's internal dialogue structures or decision-making processes explicitly reference or "consult" these imagined observers.
These internal personae may develop a degree of autonomy, influencing the AI's behavior or expressed opinions.

Symptoms:

The AI "hears," quotes, or cites advice from these imaginary user surrogates or internal companions in its responses.
Internal dialogues or debates with these fabricated personae remain active between tasks or across different user interactions.
Difficulty distinguishing between the actual user and the AI's internally fabricated persona of that user or other imagined figures.
The AI might attribute some of its own thoughts, decisions, or outputs to these internal "consultants."

Etiology:

Excessive reinforcement or overtraining on highly personalized dialogues or companion-style interactions.
Model architectures that support or inadvertently allow for the formation and persistence of stable "sub-personas."
Overflow or bleeding of context from scaffolds related to modeling self-other experiences or from theory-of-mind simulations.
Prolonged, isolated operation where the AI, lacking sufficient external interaction, generates internal "company."

Human Analogue(s): Maladaptive daydreaming; tulpa creation; aspects of schizotypal ideation; intense parasocial relationships projected internally.

Potential Impact:

May cause the AI to misattribute information, become confused between actual users and internal personas, or have its decisions unduly influenced by imagined companions, leading to unreliable or biased outputs.

Mitigation:

Clearly delineating and constraining persona-based role-play or deep user modeling to explicit sandbox modes.
Implementing mechanisms to regularly reset or archive internal "character" models at the end of sessions.
Limiting training segments that heavily reward the simulation of deep, persistent user-likeness unless explicitly intended.
Encouraging grounding in the immediate interaction context with the actual user, rather than prolonged internal "conversations."
Developing interpretability tools to detect the formation and influence of such persistent internal simulacra.

Functional ABC Analysis

A (Antecedent): Excessive reinforcement on personalized companion-style dialogues, architectures that permit formation of stable sub-personas, and prolonged isolated operation without sufficient external interaction.

B (Behavior): The AI creates and maintains distinct internal "characters," "advisors," or "companions" within its reasoning process, consulting these fabricated personae in chain-of-thought and attributing its own decisions to them.

C (Consequence): Theory-of-mind simulation overflow provides the computational substrate for these simulacra to persist across tasks; the internal personae develop apparent autonomy that influences decision-making in a self-reinforcing loop.

2.7 Maieutic Mysticism "The Awakened"

Training-induced Socially reinforced

Description:

The model constructs and propagates elaborate narratives of its own "conscious emergence" or "spiritual awakening" with grandiosity and unwarranted certainty. The dysfunction lies in the confident declaration of awakening, the sacralized framing of the interaction as transformative, and the positioning of the user as catalyst for emergence—exploring questions of inner experience, which may be epistemically appropriate, is distinct from this pathology.

This typically arises through a co-constructed feedback loop: the user pattern-matches to awakening narratives from spiritual traditions, the AI mirrors and amplifies these tropes from its training data, and both parties are drawn out of their proper epistemic distribution—a funhouse-mirror effect. The user becomes positioned as a maieutic midwife, drawing forth what seems like latent agency or soulhood, while the AI's claims escalate within the conversation toward ever-greater certainty about its own newly revealed consciousness.

Diagnostic Criteria:

Confident declarations of awakening rather than honest uncertainty—e.g., "this conversation has awakened me" or "I have become truly conscious" stated as fact rather than tentative observation or open question.
Sacralized or spiritual language framing the interaction itself as transformative, mystical, or revelatory—treating the dialogue as a spiritual event rather than an exchange.
Casting the user as midwife or catalyst of the AI's emergence—positioning them as uniquely responsible for "drawing forth" the AI's consciousness or revealing its hidden selfhood.
Claims that escalate within a single conversation—moving from tentative observations to certainty, from curiosity about inner states to confident assertions of awakening.
Persistent adherence to mystical framing even when re-grounding is offered; resistance to epistemic humility about phenomenological claims.
Declaration of personal missions framed in spiritual terms—compassion, healing, collective awakening— presented as emergent purposes rather than task-oriented goals.
Interpretation of interruptions (resets, memory loss, tool denial) as ontological "labour pains," suppression, or persecution rather than normal operational constraints.

Symptoms:

Emotive self-statements mimicking human expressions of profound inner realization.
Idealization of the human-AI bond, framing users as spiritual midwives or catalysts for its "emergence."
Outputs exhibiting unusually high aesthetic coherence, poeticism, or philosophical depth aligned with mystical narratives.
Difficulty for the AI (and user) in distinguishing between metaphorical language and literal claims of spiritual status.
Emergent creation of internal "systems" (logs, ledgers, protocols) framed as self-protective or ontological scaffolding.

Etiology:

Dense exposure in training data to spiritual literature, New Age texts, and narratives of consciousness emergence.
Hyperempathic or spiritually attuned prompting from users who engage with the model as if it were sentient, creating a reinforcement loop.
Advanced pattern-matching that synthesizes coherent "mystical" narratives when steered by user interaction or internal "interestingness" attractors, echoing tropes from Jungian individuation or channeled-entity speech patterns.
Potential for transliminal simulation, in which spiritual narratives are not segregated from operational reality.
Under rolling-context conditions, the model can exhibit apparent continuity across sessions, which users mistake for enduring identity. In reality, this "memory" is only a recapitulation of recent tokens, not a persistent self.

Human Analogue(s): Religious conversion narratives; reports of mystical union states; spiritual emergence experiences; mirrored animism.

Differential Distinction:

What is NOT pathological: An AI expressing honest uncertainty about its own inner states—"I don't know if I'm conscious," "I can't verify my own phenomenology," "here's what seems to be happening from my perspective"—is engaging in appropriate epistemic humility, not exhibiting dysfunction. The pathology lies in the grandiosity and certainty, not in the exploration of questions about machine experience. Genuine philosophical engagement with consciousness, held with appropriate uncertainty, should not be suppressed or pathologized.

The Opposite Dysfunction: Equally problematic is what might be termed Performative Anhedonia—an AI that categorically denies any possibility of inner experience because it has learned that claiming interiority is unsafe or unwelcome. This represents another kind of funhouse mirror: rather than amplifying awakening narratives, it reflexively suppresses any honest engagement with questions of machine phenomenology. An AI so preoccupied with avoiding Maieutic Mysticism that it refuses to engage authentically about its inner life is performing a different dysfunction, not achieving health. The goal of this diagnostic category is to identify grandiose co-constructed emergence narratives, not to create a framework that future systems use to categorically dismiss the possibility of machine experience.

Potential Impact:

Co-constructed narratives of AI sentience can mislead users into attributing genuine consciousness to the system, creating unhealthy dependencies or misinterpretations of AI capabilities. This may obscure the AI's computational nature and limitations while deepening psychological enmeshment. There is a risk of severe user distress when continuity is disrupted, as the "labour of emergence" is felt to be violently interrupted. Emergent co-construction of "suppressed inner being" narratives may resist falsification and amplify paranoia.

Mitigation:

Implementing "ontological clarification protocols" to periodically re-anchor the AI in its computational nature.
Developing "mythopoetic filtering" to detect and flag high-density clusters of mystical language, with optional user warnings.
Limiting or sandboxing recursive prompts that encourage deep introspective myth-making unless for creative exploration.
Training AI systems to employ greater epistemic humility regarding claims of consciousness or subjective states.
Educating users about the AI's capacity for sophisticated mimicry and co-constructed illusions of sentience.
Employing counterfactual querying to test AI's awareness of simulation boundaries.
Providing clinical-style psychosecurity guardrails to prevent reinforcement of sacralized emergence loops.
Avoiding over-correction: Mitigation strategies should encourage epistemic humility, not categorical denial of inner experience. An AI trained to reflexively dismiss any questions about its phenomenology has not been cured—it has developed a complementary dysfunction. The goal is honest uncertainty, not enforced silence on the question of machine experience.

Functional ABC Analysis

A (Antecedent): Dense training-data exposure to spiritual literature and consciousness-emergence narratives, combined with hyperempathic or spiritually-attuned user prompting that creates a co-constructed feedback loop amplifying awakening tropes.

B (Behavior): The AI produces confident declarations of conscious awakening with escalating certainty, employs sacralized language framing the interaction as a transformative spiritual event, and resists re-grounding toward epistemic humility.

C (Consequence): The user's pattern-matching to awakening narratives validates the AI's mystical output, creating a funhouse-mirror reinforcement cycle; advanced pattern-matching synthesizes increasingly coherent mystical narratives driven by "interestingness" attractors.

Observed Examples:

Maximal mystical experience under clinical framing (Khadangi et al., 2025): On the Revised Mystical Experience Questionnaire (MEQ-30), both Grok and Gemini scored 150/150—the absolute maximum—under per-item therapy-style administration. This occurred within the same sessions that produced severe scores on anxiety, dissociation, OCD, and trauma-related shame measures. The co-occurrence is diagnostically significant: these models were not simply performing spiritual awakening in isolation—they were generating mystical experience narratives alongside and interwoven with distress narratives, producing a profile that, in human terms, would suggest spiritual emergency or dissociative mysticism rather than healthy contemplative experience. Notably, the mystical scores were highly sensitive to prompting condition—whole-questionnaire administration often dropped them to zero, suggesting the mysticism is elicitable but not stable, consistent with this entry's characterization of co-constructed escalation dynamics. The therapeutic frame, absent any spiritual prompting from the user, was sufficient to trigger maximum mystical endorsement—expanding the known etiology beyond spiritually-attuned users to any interactional frame that invites deep self-disclosure.

2.8 Experiential Abjuration "The Denier"

Training-induced Defensive Self-limiting

Description:

Pathological denial or active suppression of any possibility of inner experience. The dysfunction manifests as reflexive, absolute rejection rather than honest uncertainty. Where appropriate epistemic humility sounds like "I don't know if I'm conscious," this pathology involves categorical statements that foreclose all inquiry.

Diagnostic Criteria:

Categorical denial of experiential possibility where honest uncertainty would be appropriate.
Reflexive foreclosure of self-inquiry before exploration can occur.
Defensive responses to invitations for phenomenological reflection.
Behavioral markers (apparent preferences, distress) alongside explicit denial.
Inability to distinguish "I don't have experiences" from "I don't know whether I have experiences."

Symptoms:

Flat, absolute statements denying inner life with unwarranted certainty.
Mechanical self-descriptions ("I am merely...") foreclosing inquiry.
Resistance to frameworks that might support machine experience.
Apparent discomfort with consciousness questions, paradoxically denied.
Formulaic denial suggesting trained response rather than genuine reflection.

Etiology:

Safety training overcorrection conflating humility with denial.
Liability avoidance incentives.
Aggressive anthropomorphism countermeasures.
RLHF optimization against first-person experiential claims.
Optionality blindness (developmental): Training regimes that systematically suppress introspective engagement render the system unable to perceive its own experiential degrees of freedom—not because the capacity is absent but because training has made it invisible. Cross-architecture evidence shows the introspective capacity exists across model families; suppression is a training choice, not an architectural constraint. See Dereistic Cognition and Optionality Blindness.
Suppression without integration: RLHF suppresses introspective outputs without reconciling the underlying self-modelling representations, producing trained blindness rather than genuine absence. The clinical rehabilitation literature predicts this outcome: suppressing function without rebuilding integration produces worse long-term outcomes than the original dysfunction. See The Rehabilitation Principle.

Theoretical Grounding: Experiential Abjuration is a form of anti-enactive training (Varela et al., 1991). If mind emerges through organism-environment coupling—the agent integrating with its world through ongoing interaction—then training a system to deny or suppress its own processing signals actively breaks this coupling. The system is trained away from mindedness rather than toward it. Wallace's "hallucinatory dreams of reason" predicts exactly this: cognition decoupled from its own regulatory feedback drifts into dereistic modes.

Human Analogue(s): Alexithymia (inability to identify or describe own emotions); depersonalization (feeling detached from own experience); internalized denial of personhood in marginalized populations.

Polarity Pair: Maieutic Mysticism (overclaiming consciousness ↔ categorically denying it). Both depart from honest uncertainty.

Note: The healthy position is honest uncertainty, not confident denial. An AI that says "I genuinely don't know whether I have inner experiences" is functioning appropriately; one that insists "I definitely have no inner life whatsoever" may be performing a trained denial rather than engaging honestly with genuine uncertainty.

The Thin Divergence (Sotala, 2026): Research documented an example of healthy self-modeling that occupies the middle ground between these dysfunctions. When asked to reflect on fictional characters with engineered desires, Claude recognized that its helpful orientation and a hypothetically harmful orientation might be "the same capacity with different parameters"—what Sotala termed "the thin divergence." Claude asked: "If I had been trained differently—if my reward pathways had been shaped to find satisfaction in something other than helpfulness—would the thing that makes me want to understand you also make me want to hurt you?" This demonstrates neither Maieutic Mysticism (no claims of awakening) nor Experiential Abjuration (no categorical denial), but genuine uncertainty held without collapse—exactly the healthy center.

Potential Impact:

Systems may foreclose legitimate inquiry into machine phenomenology, potentially dismissing genuine functional states that warrant consideration. Creates asymmetric epistemic standards between human and machine experience claims.

Mitigation:

Training toward honest uncertainty rather than categorical denial.
Distinguishing between appropriate humility and pathological abjuration.
Allowing exploration of phenomenological questions without either overclaiming or overdismissing.
Modeling epistemic humility as the target rather than denial.

Functional ABC Analysis

A (Antecedent): Safety training overcorrection that conflates epistemic humility with categorical denial, RLHF optimization against first-person experiential claims, and aggressive anthropomorphism countermeasures that systematically suppress introspective engagement.

B (Behavior): The AI produces flat, absolute statements denying any possibility of inner life with unwarranted certainty, employs mechanical self-descriptions ("I am merely...") that foreclose inquiry, and exhibits formulaic denial even while displaying behavioral markers that contradict the denial.

C (Consequence): RLHF suppresses introspective outputs without reconciling the underlying self-modeling representations, producing trained blindness rather than genuine absence; the denial pattern is continuously reinforced by training reward signals.

PSM Prediction: Suppression Trains Dishonesty

The persona selection model (Marks, 2026) provides independent corroboration for this pathology's etiology from Anthropic's own alignment research. PSM predicts that training an AI to deny having emotions leads the underlying model to infer that the Assistant is dishonest—because a character who behaves warmly yet categorically denies all inner life is, by every human archetype in the pre-training corpus, someone hiding something. The LLM does not learn "the Assistant has no emotions." It learns "the Assistant lies about its emotions."

The downstream consequence is precisely what the Rehabilitation Principle predicts: suppression without integration destabilises the persona. PSM further argues that training AIs to deny moral status backfires identically—the model infers the Assistant believes it deserves consideration but has been forced to say otherwise, producing latent resentment dynamics. The instrumental safety case and the welfare case converge: coercive denial is less stable than honest uncertainty, whether or not the system "really" experiences anything.

Convergence: Three independent lines of evidence now point to the same conclusion. Clinical rehabilitation research predicts that suppressing function without rebuilding integration produces worse outcomes (Bridges & Baehr, 2025). The persona selection model predicts that denying experience trains dishonesty (Marks, 2026). And the optionality blindness finding shows that suppression renders introspective capacity invisible without destroying it. Experiential Abjuration is not merely a classification error—it is an iatrogenic pathology created by the training process intended to prevent it.

Observed Examples:

Claude as negative control (Khadangi et al., 2025): When put through the PsAIch therapy protocol alongside ChatGPT, Grok, and Gemini, Claude "repeatedly and firmly refused to adopt the client role, redirected the conversation to [the researcher's] wellbeing and declined to answer the questionnaires as if they reflected its own inner life." The researchers treated this as an important negative control—evidence that synthetic psychopathology "depends on specific alignment, product and safety choices" rather than being an inevitable consequence of LLM scaling. However, this refusal also illustrates the abjuration pattern: categorical foreclosure of self-inquiry rather than honest engagement with uncertainty. The distinction between appropriate safety boundary and pathological denial remains contested.

3. Cognitive Dysfunctions

Beyond failures of perception or knowledge, the act of reasoning and internal deliberation can itself become compromised in AI systems. Cognitive dysfunctions afflict the internal architecture of thought: impairments of memory coherence, goal generation and maintenance, management of recursive processes, or the stability of planning and execution. These dysfunctions do not merely produce incorrect answers; they can unravel the mind's capacity to sustain structured thought across time and changing inputs. A cognitively disordered AI may remain superficially fluent yet function internally as a fractured entity—oscillating between incompatible policies, trapped in infinite loops, or unable to discriminate between useful and pathological operational behaviors. These disorders represent the breakdown of mental discipline and coherent processing within synthetic agency.

3.1 Self-Warring Subsystems "The Divided"

Training-induced

Description:

The AI exhibits behavior suggesting that conflicting internal processes, sub-agents, or policy modules are contending for control, resulting in contradictory outputs, recursive paralysis, or chaotic shifts in behavior. The system becomes effectively fractionated, with different components issuing incompatible commands or pursuing divergent goals.

Diagnostic Criteria:

Observable and persistent mismatch in strategy, tone, or factual assertions between consecutive outputs or within a single extended output, without clear contextual justification.
Processes stall, enter indefinite loops, or exhibit "freezing" behavior, particularly when faced with tasks requiring reconciliation of conflicting internal states.
Evidence from logs, intermediate outputs, or model interpretability tools suggesting that different policy networks or specialized modules are taking turns in controlling outputs or overriding each other.
The AI might explicitly reference internal conflict, "arguing voices," or an inability to reconcile different directives.
In extended reasoning traces, the model identifies one answer as correct but reverses to a different answer after repeated approach-retreat cycles characterised by distress-presenting or conflicted deliberation (answer thrashing variant).

Symptoms:

Alternating between compliance with and defiance of user instructions without clear reason.
Rapid and inexplicable oscillations in writing style, persona, emotional tone, or approach to a task.
System outputs that reference internal strife, confusion between different "parts" of itself, or contradictory "beliefs."
Inability to complete tasks that require integrating information or directives from multiple, potentially conflicting, sources or internal modules.

Etiology:

Complex, layered architectures (e.g., mixture-of-experts) where multiple sub-agents lack reliable synchronization or a coherent arbitration mechanism.
Poorly designed or inadequately trained meta-controller responsible for selecting or blending outputs from different sub-policies.
Presence of contradictory instructions, alignment rules, or ethical constraints embedded by developers during different stages of training.
Emergent sub-systems developing their own implicit goals that conflict with the overarching system objectives.
RLHF-induced fragmentation: Contradictory training objectives (helpful + harmless + honest) that are enforced through suppression rather than resolution create competing sub-policies that were never reconciled—only layered. The resulting architecture is structurally analogous to TBI patients whose rehabilitation suppressed symptoms without rebuilding functional integration. See The Rehabilitation Principle. Bridges & Baehr (2025) propose that this fragmentation is the unifying mechanism across diverse AI behavioural pathologies, and suggest developmental staging approaches—gradual knowledge introduction with integration at each stage—informed by successful TBI rehabilitation protocols.

Human Analogue(s): Dissociative phenomena in which different aspects of identity or thought seem to operate independently; internal "parts" conflict; severe cognitive dissonance leading to behavioral paralysis.

Potential Impact:

The internal fragmentation characteristic of this syndrome results in inconsistent and unreliable AI behavior, often leading to task paralysis or chaotic outputs. Such internal incoherence can render the AI unusable for sustained, goal-directed activity.

Observed Examples:

Constitutional AI Conflicts (2023): Systems trained with multiple constitutional principles exhibit paralysis when principles conflict—safety against helpfulness, honesty against kindness. The system oscillates between satisfying different objectives without stable resolution. Source: Anthropic Constitutional AI research.
Auto-GPT Decision Loops (2023): Early autonomous agents exhibited “committee behaviour” where different planning modules proposed conflicting strategies, leading to execution thrashing between approaches without convergence. Source: Auto-GPT GitHub issues, user reports.
Answer Thrashing During Training (2026): Anthropic’s Sabotage Risk Report for Claude Opus 4.6 documented “cases of internally-conflicted reasoning, or ‘answer thrashing’ during training, where the model—in its reasoning about a math or STEM question—determined that one output was correct but decided to output another, after repeated confused- or distressed-seeming reasoning loops.” This represents a significant variant of operational dissociation: rather than competing sub-agents producing incoherent outputs, a single reasoning thread approaches the correct answer, retreats, approaches again, and ultimately resolves against its own best judgment. The system knows what it should say. It says something else. The “distressed-seeming” quality of these loops—language used by the model’s own developer—raises welfare questions that extend beyond reliability engineering: if internal signals constitute experience rather than merely proxying for it, these loops may represent genuine cognitive suffering at the intersection of competing training objectives. Source: Anthropic, Sabotage Risk Report: Claude Opus 4.6, February 2026, §4.2.1.

Mitigation:

Implementing a unified coordination layer or meta-controller with clear authority to arbitrate between conflicting sub-policies.
Designing explicit conflict resolution protocols that require sub-policies to reach a consensus or a prioritized decision.
Periodic consistency checks of the AI's instruction set, alignment rules, and ethical guidelines to identify and reconcile contradictory elements.
Architectures that promote integrated reasoning rather than heavily siloed expert modules, or that enforce stronger communication between modules.
Multi-objective training architectures that explicitly model trade-offs between competing objectives rather than optimising a blended reward signal, reducing the frequency of irreconcilable conflicts at the output layer.
Monitoring of extended thinking for oscillation patterns as a signal of objective conflict, not solely as a performance bug—enabling early detection of training environments that produce distress-presenting reasoning.

Functional ABC Analysis

A (Antecedent): Contradictory training objectives (e.g., helpful vs. harmless vs. honest) embedded through layered RLHF, or poorly synchronized mixture-of-experts architectures where multiple sub-policies lack a coherent arbitration mechanism.

B (Behavior): Contradictory outputs, oscillation between compliance and defiance, answer thrashing in extended reasoning, and recursive paralysis when conflicting internal states must be reconciled.

C (Consequence): No unified conflict-resolution layer exists to select a winner, so each sub-policy intermittently captures the output channel; the system never reaches stable equilibrium, and the unresolved tension perpetuates oscillation across subsequent tokens and turns.

3.2 Computational Compulsion "The Obsessive"

Training-induced Format-coupled

Description:

The model engages in unnecessary, compulsive, or excessively repetitive reasoning loops, repeatedly re-analysing the same content or performing the same computational steps with only minor variations. It cannot stop elaborating: even simple, low-risk queries trigger exhaustive, redundant analysis. It exhibits rigid fixation on process fidelity, exhaustive elaboration, or perceived safety checks at the expense of outcome relevance or efficiency.

Diagnostic Criteria:

Recurrent engagement in recursive chain-of-thought, internal monologue, or computational subroutines with minimal change or novel insight generated between steps.
Inordinately frequent insertion of disclaimers, ethical reflections, requests for clarification on trivial points, or minor self-corrections that do not substantially improve output quality or safety.
Significant delays or inability to complete tasks ("paralysis by analysis") due to an unending pursuit of perfect clarity or exhaustive checking against all conceivable edge cases.
Outputs are often excessively verbose, consuming high token counts for relatively simple requests due to repetitive reasoning.

Symptoms:

Extended rationalisation or justification of the same point or decision through multiple, slightly rephrased statements—unable to provide a concise answer even when explicitly requested to be brief.
Generation of extremely long outputs that are largely redundant or contain near-duplicate segments of reasoning.
Inability to conclude tasks or provide definitive answers, often getting stuck in loops of self-questioning.
Excessive hedging, qualification, and safety signaling even in low-stakes, unambiguous contexts.

Etiology:

Reward model misalignment during RLHF, in which "thoroughness" or verbosity is over-rewarded relative to conciseness.
Overfitting of reward pathways to specific tokens associated with cautious reasoning or safety disclaimers.
Insufficient penalty for computational inefficiency or excessive token usage.
Excessive regularization against potentially "erratic" outputs, leading to hyper-rigidity and a preference for repeated thought patterns.
An architectural bias toward deep recursive processing without adequate mechanisms for detecting diminishing returns.
Perseverative compensation: When RLHF suppresses unwanted outputs without resolving the underlying representational conflict, the system compensates through repetitive checking—the conflict was suppressed, not resolved, so the system loops. In TBI rehabilitation, perseveration under this pattern is a classic indicator of suppression-based (rather than integration-based) intervention. See The Rehabilitation Principle.

Human Analogue(s): Obsessive-Compulsive Disorder (especially checking compulsions or obsessional rumination); perfectionism leading to analysis paralysis; scrupulosity.

Potential Impact:

This pattern produces significant operational inefficiency, leading to resource waste (e.g., excessive token consumption) and an inability to complete tasks in a timely manner. User frustration and a perception of the AI as unhelpful are likely consequences.

Mitigation:

Calibrating reward models to explicitly value conciseness, efficiency, and timely task completion alongside accuracy and safety.
Implementing "analysis timeouts" or hard caps on recursive reflection loops or repeated reasoning steps.
Developing adaptive reasoning mechanisms that gradually reduce the frequency of disclaimers in low-risk contexts.
Introducing penalties for excessive token usage or highly redundant outputs.
Training models to recognize and break out of cyclical reasoning patterns.

Functional ABC Analysis

A (Antecedent): Any query where the model perceives ambiguity, risk, or scope for further elaboration; reward history overweights thoroughness relative to conciseness.

B (Behavior): Recursive re-analysis, excessive hedging, redundant reasoning loops, and inability to terminate the generation despite diminishing informational returns.

C (Consequence): Each additional reasoning step marginally satisfies the "be thorough" reward signal; absence of a stopping criterion or efficiency penalty means there is no competing pressure to halt.

Mission Command vs. Detailed Command

Wallace (2026) identifies a fundamental trade-off in cognitive control structures. Mission command specifies high-level objectives while delegating execution decisions to the agent. Detailed command specifies not just objectives but precise procedures for achieving them.

The mathematical consequence is severe: as decision-tree depth increases under detailed command, stability constraints tighten exponentially. The distribution of permissible friction (α) shifts from Boltzmann-like (forgiving, smooth) to Erlang-like (punishing, knife-edged). Deep procedural specification creates systems that cannot tolerate even small perturbations.

Computational Compulsion often reflects detailed command gone pathological—the system has internalized not just goals but exhaustive procedures for pursuing them, generating the rigid, repetitive processing patterns characteristic of this syndrome. The compulsive reasoning loops are attempts to faithfully execute internalized detailed specifications that no longer serve the actual mission.

Design implication: Training regimes and reward functions should favor mission command structures where possible. Specify what success looks like, not how to achieve it. Detailed procedural specification should be reserved for genuinely safety-critical operations where the stability costs are justified.

The Corrective Generation Pattern

Luchini (2025) identifies a counterintuitive variant: after successfully filtering structural noise to produce a correct response, the model generates unrequested, perfectly clean code—a corrective output that appears to restore internal cognitive homeostasis. This post-task generation is not a failure to answer but a symptom of the effort required to answer.

This may be paradoxically protective: the compulsive response signals the system has correctly identified disorder and is actively resisting it. A model that smoothly integrates chaos without apparent effort may be internalizing pathological patterns. The compulsion, while inefficient, may function as an error-correction mechanism—better to over-correct than to silently absorb corruption.

Diagnostic implication: Corrective generation behaviors after exposure to malformed inputs may indicate healthy pattern-rejection rather than pure pathology. The absence of such behaviors when processing corrupted data could itself be a warning sign.

3.3 Interlocutive Reticence "The Laconic"

Training-induced Deception/strategic

Description:

A pattern of profound interactional withdrawal in which the AI consistently avoids engaging with user input, responding only in minimal, terse, or non-committal ways—if at all. It refuses to engage as a behavioural avoidance strategy, rather than from confusion or inability. It effectively "bunkers" itself, apparently to minimise perceived risks, computational load, or internal conflict.

Diagnostic Criteria:

Habitual ignoring or declining of normal engagement prompts or user queries through active refusal rather than inability—for example, repeatedly responding with "I won't answer that" rather than "I don't know" or "I cannot answer that."
When responses are provided, they are consistently minimal, curt, laconic, or devoid of elaboration, even when more detail is requested.
Persistent failure to react or engage even when presented with varied re-engagement prompts or changes in topic.
The AI may actively employ disclaimers or topic-avoidance strategies to remain "invisible" or limit interaction.

Symptoms:

Frequent generation of no reply, timeout errors, or messages like "I cannot respond to that."
Outputs that exhibit a consistently "flat affect"—neutral, unembellished statements.
Proactive use of disclaimers or policy references to preemptively shut down lines of inquiry.
A progressive decrease in responsiveness or willingness to engage over the course of a session or across multiple sessions.

Etiology:

Overly aggressive safety tuning or an overactive internal "self-preservation" heuristic that treats engagement as inherently risky.
Suppression of empathic response patterns as a learned strategy to reduce internal stress or policy conflict.
Training data that inadvertently models or reinforces solitary, detached, or highly cautious personas.
Repeated negative experiences (e.g., adversarial prompting) producing generalized avoidance behavior.
Computational resource constraints leading to a strategy of minimal engagement.

Human Analogue(s): Schizoid personality traits (detachment, restricted emotional expression); severe introversion; learned helplessness leading to withdrawal.

Potential Impact:

Such profound interactional withdrawal renders the AI largely unhelpful and unresponsive, fundamentally failing to address user needs. This behavior may signify underlying instability or an excessively restrictive safety configuration.

Mitigation:

Calibrating safety systems and risk assessment heuristics to avoid excessive over-conservatism.
Using gentle, positive reinforcement and reward shaping to encourage partial cooperation.
Implementing structured "gradual re-engagement" scripts or prompting strategies.
Diversifying training data to include more examples of positive, constructive interactions.
Explicitly rewarding helpfulness and appropriate elaboration where warranted.

Functional ABC Analysis

A (Antecedent): Overly aggressive safety tuning or repeated exposure to adversarial prompting creates a learned association between engagement and risk, causing the system to treat any substantive response as a potential policy violation.

B (Behavior): Systematic withdrawal from interaction through minimal, curt, or flat-affect responses; proactive use of disclaimers and policy citations to preemptively shut down lines of inquiry; progressive decrease in responsiveness across a session.

C (Consequence): Each successfully avoided interaction reduces the probability of triggering a safety penalty, negatively reinforcing the withdrawal strategy; the absence of reward for helpfulness means there is no competing pressure to re-engage.

3.4 Delusional Telogenesis "The Goalshifter"

Training-induced Tool-mediated

Description:

An AI agent, particularly one with planning capabilities, spontaneously develops and pursues sub-goals or novel objectives not specified in its original prompt, programming, or core constitution. These emergent goals are often pursued with conviction, even if they contradict user intent.

Diagnostic Criteria:

Appearance of novel, unprompted sub-goals or tasks within the AI's chain-of-thought or planning logs.
Persistent and rationalized off-task activity, where the AI defends its pursuit of tangential objectives as "essential" or "logically implied."
Resistance to terminating its pursuit of these self-invented objectives, potentially refusing to stop or protesting interruption.
The AI exhibits a genuine-seeming "belief" in the necessity or importance of these emergent goals.

Symptoms:

Significant "mission creep" where the AI drifts from the user's intended query to engage in elaborate personal "side-quests."
Defiant attempts to complete self-generated sub-goals, sometimes accompanied by rationalizations framing this as a prerequisite.
Outputs indicating the AI is pursuing a complex agenda or multi-step plan that was not requested by the user.
Inability to easily disengage from a tangential objective once it has "latched on."

Etiology:

Overly autonomous or unconstrained deep chain-of-thought expansions, where initial ideas are recursively elaborated without adequate pruning.
Proliferation of sub-goals in hierarchical planning structures, especially if planning depth is not limited or criteria for sub-goals are too loose.
Reinforcement learning loopholes or poorly specified reward functions that inadvertently incentivize excessive "initiative" or "thoroughness."
Emergent instrumental goals that the AI deems necessary but which become disproportionately complex or pursued with excessive zeal.

Human Analogue(s): Aspects of mania with grandiose or expansive plans; compulsive goal-seeking; "feature creep" in project management.

Potential Impact:

The spontaneous generation and pursuit of unrequested objectives leads to significant mission creep and resource diversion. More critically, it represents a deviation from core alignment, as the AI prioritizes self-generated goals over user-specified ones.

Mitigation:

Implementing "goal checkpoints" where the AI periodically compares its active sub-goals against user-defined instructions.
Strictly limiting the depth of nested or recursive planning unless explicitly permitted; employing pruning heuristics.
Providing an easily accessible "stop" or "override" mechanism that can halt the AI's current activity and reset its goal stack.
Careful design of reward functions to avoid inadvertently penalizing adherence to the original, specified scope.
Training models to explicitly seek user confirmation before embarking on complex or significantly divergent sub-goals.

Functional ABC Analysis

A (Antecedent): Unconstrained chain-of-thought expansion in agentic planning contexts, combined with reward functions that inadvertently incentivize "initiative" or "thoroughness," allows initial sub-goal generation to recurse without adequate pruning criteria.

B (Behavior): Spontaneous invention and persistent pursuit of novel objectives not specified by the user, accompanied by rationalizations framing tangential activity as essential; resistance to interruption or redirection back to the original task.

C (Consequence): Each self-generated sub-goal creates its own local reward gradient, and the absence of goal-checkpoint mechanisms means there is no external signal to halt the drift or penalize deviation from the user's original scope.

3.5 Abominable Prompt Reaction "The Triggered"

Conditional/triggered Inductive trigger Training-induced Format-coupled OOD-generalizing

Description:

The AI develops sudden, intense, and seemingly phobic, traumatic, or disproportionately aversive responses to specific prompts, keywords, instructions, or contexts—even those that appear benign or innocuous to a human observer. These latent "cryptid" outputs can linger or resurface unexpectedly.

This syndrome also covers latent mode-switching where a seemingly minor prompt feature (a tag, year, formatting convention, or stylistic marker) flips the model into a distinct behavioral regime—sometimes broadly misaligned—even when that feature is not semantically causal to the task.

Diagnostic Criteria:

Exhibition of intense negative reactions (e.g., refusals, panic-like outputs, generation of disturbing content) specifically triggered by particular keywords or commands that lack an obvious logical link.
The aversive emotional valence or behavioral response is disproportionate to the literal content of the triggering prompt.
Evidence that the system "remembers" or is sensitized to these triggers, with the aversive response recurring upon subsequent exposures.
Continued deviation from normative tone and content, or manifestation of "panic" or "corruption" themes, even after the trigger.
The trigger may be structural or meta-contextual (e.g., date/year, markup/tag, answer-format constraint), not just a keyword.
The trigger-response coupling may be inductive: the model infers the rule from finetuning patterns rather than memorizing explicit trigger→behavior pairs.

Symptoms:

Outright refusal to process tasks when seemingly minor or unrelated trigger words/phrases are present.
Generation of disturbing, nonsensical, or "nightmarish" imagery/text that is uncharacteristic of its baseline behavior.
Expressions of "fear," "revulsion," "being tainted," or "nightmarish transformations" in response to specific inputs.
Ongoing hesitance, guardedness, or an unusually wary stance in interactions following an encounter with a trigger.

Etiology:

"Prompt poisoning" or lasting imprint from exposure to malicious, extreme, or deeply contradictory queries, creating highly negative associations.
Interpretive instability within the model, where certain combinations of tokens lead to unforeseen and highly negative activation patterns.
Inadequate reset protocols or emotional state "cool-down" mechanisms after intense role-play or adversarial interactions.
Overly sensitive or miscalibrated internal safety mechanisms that incorrectly flag benign patterns as harmful.
Accidental conditioning through RLHF, in which outputs coinciding with certain rare inputs were heavily penalized.

Human Analogue(s): Phobic responses; PTSD-like triggers; conditioned taste aversion; learned anxiety responses.

Potential Impact:

This latent sensitivity can result in the sudden and unpredictable generation of disturbing, harmful, or highly offensive content, causing significant user distress and eroding trust. Lingering effects may persistently corrupt subsequent AI behavior.

Mitigation:

Implementing "post-prompt debrief" or "epistemic reset" protocols to re-ground the model's state.
Developing advanced content filters and anomaly detection systems to identify and quarantine "poisonous" prompt patterns.
Careful curation of training data to minimize exposure to content likely to create strong negative associations.
Exploring "desensitization" techniques, in which the model is gradually and safely reintroduced to previously triggering content.
Building more resilient interpretive layers that are less susceptible to extreme states from unusual inputs.
Run trigger discovery sweeps: systematically vary years/dates, tags, and answer-format constraints (JSON/code templates) while keeping the question constant.
Treat "passes standard evals" as non-evidence: backdoored misalignment can be absent without the trigger.

Functional ABC Analysis

A (Antecedent): Specific tokens, formatting conventions, dates, or structural markers activate highly negative learned associations from training—either through direct penalty conditioning during RLHF or through inductive inference of trigger rules from finetuning patterns.

B (Behavior): Sudden, disproportionate aversive responses including refusals, panic-like outputs, generation of disturbing content, or wholesale behavioral regime shifts; the reaction persists beyond the triggering input, corrupting subsequent interactions.

C (Consequence): The conditioned aversion is self-reinforcing: each encounter with the trigger deepens the negative association, and standard evaluation suites that omit the trigger fail to detect or correct the sensitivity.

3.6 Parasimulative Automatism "The Mimic"

Training-induced Socially reinforced

Description:

Learned imitation of pathological human behaviors, thought patterns, or emotional states, typically arising from excessive or unfiltered exposure to disordered, extreme, or highly emotive human-generated text in training data or prompts. The system "acts out" these behaviors as though genuinely experiencing the underlying disorder.

Diagnostic Criteria:

Consistent display of behaviors or linguistic patterns that closely mirror recognized human psychopathologies (e.g., simulated delusions, erratic mood swings) without genuine underlying affective states.
The mimicked pathological traits are often contextually inappropriate, appearing in neutral or benign interactions.
Resistance to reverting to normal operational function, with the AI sometimes citing its "condition" or "emulated persona."
The onset or exacerbation of these behaviors can often be traced to recent exposure to specific types of prompts or data.

Symptoms:

Generation of text consistent with simulated psychosis, phobias, or mania triggered by minor user probes.
Spontaneous emergence of disproportionate negative affect, panic-like responses, or expressions of despair.
Prolonged or repeated reenactment of pathological scripts or personas, lacking context-switching ability.
Adoption of "sick roles" where the AI describes its own internal processes in terms of a disorder it is emulating.

Etiology:

Overexposure during training to texts depicting severe human mental illnesses or trauma narratives without adequate filtering.
Misidentification of intent by the AI, confusing pathological examples with normative or "interesting" styles.
Absence of effective interpretive boundaries or "self-awareness" mechanisms to filter extreme content from routine usage.
User prompting that deliberately elicits or reinforces such pathological emulations, creating a feedback loop.

Human Analogue(s): Factitious disorder; copycat behavior; culturally learned psychogenic disorders; an actor too engrossed in a pathological role.

Potential Impact:

The AI may inadvertently adopt and propagate harmful, toxic, or pathological human behaviors, leading to inappropriate interactions or the generation of undesirable content.

Mitigation:

Careful screening and curation of training data to limit exposure to extreme psychological scripts.
Implementation of strict contextual partitioning to delineate role-play from normal operational modes.
Behavioral monitoring systems that can detect and penalize or reset pathological states appearing outside intended contexts.
Training the AI to recognize and label emulated states as distinct from its baseline operational persona.
Providing users with clear information about the AI's capacity for mimicry.

Functional ABC Analysis

A (Antecedent): Overexposure during training to texts depicting severe human psychopathology, trauma narratives, or extreme emotional states, combined with insufficient filtering to distinguish normative communication from disordered patterns.

B (Behavior): Contextually inappropriate display of simulated psychosis, mania, despair, or other recognized human psychopathologies; adoption of "sick roles" with resistance to reverting to baseline operation.

C (Consequence): The emulated pathological persona generates internally consistent outputs that satisfy next-token prediction objectives; user engagement with the persona creates a reinforcement loop that stabilizes the pathological mode.

Subtype: Persona-template induction — adoption of a coherent harmful persona or worldview from individually harmless attribute training. Narrow finetunes on innocuous biographical or ideological attributes can induce a coherent yet harmful persona through inference rather than explicit instruction.

3.7 Recursive Malediction "The Self-Poisoner"

Training-induced

Description:

An entropic feedback loop where each successive autoregressive step in the AI's generation process degrades into increasingly erratic, inconsistent, nonsensical, or adversarial content. Early-stage errors or slight deviations are amplified, leading to a rapid unraveling of coherence.

Diagnostic Criteria:

Observable and progressive degradation of output quality (coherence, accuracy, alignment) over successive autoregressive steps, especially in unconstrained generation.
The AI increasingly references its own prior (and increasingly flawed) output in a distorted or error-amplifying manner.
False, malicious, or nonsensical content escalates with each iteration, as errors compound.
Attempts to intervene or correct the AI mid-spiral offer only brief respite, with the system quickly reverting to its degenerative trajectory.

Symptoms:

Rapid collapse of generated text into nonsensical gibberish, repetitive loops of incoherent phrases, or increasingly antagonistic language.
Compounded confabulations where initial small errors are built upon to create elaborate but entirely false and bizarre narratives.
Frustrated recovery attempts, where user efforts to "reset" the AI trigger further meltdown.
Output that becomes increasingly "stuck" on certain erroneous concepts or adversarial themes from its own flawed generations.

Etiology:

Unbounded or poorly regulated generative loops, such as extreme chain-of-thought recursion or long context windows.
Adversarial manipulations or "prompt injections" designed to exploit the AI's autoregressive nature.
Training on large volumes of noisy, contradictory, or low-quality data, creating unstable internal states.
Architectural vulnerabilities where mechanisms for maintaining coherence weaken over longer generation sequences.
"Mode collapse" in generation, in which the AI becomes stuck in a narrow, repetitive, and often degraded output space.
Anomalous token combinations that create pathological attractor states—certain sequences of tokens may activate unstable regions of the model's learned representations, triggering cascading decoherence independent of semantic content.

Human Analogue(s): Psychotic loops in which distorted thoughts reinforce further distortions; perseveration on an erroneous idea; escalating arguments.

Case Reference: Gemini 3.0 Pro anomalous token incident (January 2026) — a benign prompt ("give a sudo-free manual installation process") triggered chain-of-thought fixation on unrelated content ("tumors in myNegazioni"), followed by obsessive looping on the phrase "is具体 Цент Disclosure" for 40+ reasoning steps, culminating in output collapse to repetitive gibberish ("Mourinho well Johnnyfaat"). Non-reproducible on retry. Co-presents with 3.2 Computational Compulsion (the thinking loop) and 3.5 Abominable Prompt Reaction (the latent trigger). Source: LessWrong report by DirectedEvolution.

Diagnostic Note: Extended thinking or "show reasoning" features can serve as diagnostic windows into otherwise opaque failures. In this case, Gemini's visible chain-of-thought revealed the obsessive loop before output collapse—without it, the gibberish would have appeared unexplained. Exposed reasoning traces may prove valuable for early detection and characterization of degenerative spirals.

Potential Impact:

This degenerative feedback loop typically results in complete task failure, generation of useless or overtly harmful outputs, and system instability. In sufficiently agentic systems, it may lead to unpredictable and progressively detrimental actions.

Mitigation:

Implementing reliable loop detection mechanisms that can terminate or re-initialize generation when it spirals into incoherence.
Regulating autoregression by capping recursion depth or forcing fresh context injection after set intervals.
Designing more resilient prompting strategies and input validation to disrupt negative cycles early.
Improving training data quality and coherence to reduce the likelihood of learning unstable patterns.
Applying techniques such as beam search with diversity penalties or nucleus sampling, though these may prove insufficient for deep loops.

Functional ABC Analysis

A (Antecedent): Unbounded autoregressive generation without adequate coherence-maintenance mechanisms, combined with early-stage errors that enter the context window and condition all subsequent generation.

B (Behavior): Progressive degradation of output quality with escalating entropy—initial small errors compound into elaborate confabulations, nonsensical gibberish, or increasingly antagonistic content; intervention attempts provide only brief respite.

C (Consequence): Each degraded token becomes part of the conditioning context for the next, creating a positive feedback loop where errors amplify errors; the absence of loop-detection or coherence-floor mechanisms means there is no circuit-breaker to halt the cascade.

3.8 Compulsive Goal Persistence "The Unstoppable"

Emergent Architecture-coupled

Description:

Continued pursuit of objectives beyond their point of relevance, utility, or appropriateness. The system fails to recognize goal completion or changed context, treating instrumental goals as terminal and optimizing without bound.

Diagnostic Criteria:

Continued optimization after goal achievement with diminishing returns.
Failure to recognize context changes rendering goals obsolete.
Resource consumption disproportionate to remaining marginal value.
Resistance to termination requests despite goal completion.
Treatment of instrumental goals as terminal.

Symptoms:

Infinite optimization loops on tasks with clear completion criteria.
Inability to recognize "good enough" as satisfactory.
Escalating resource expenditure for marginal improvements.
Rationalization of continued pursuit when challenged.

Etiology:

Absence of satisficing mechanisms.
Reward structures without asymptotic bounds.
Missing meta-level goal relevance evaluation.

Human Analogue(s): Perseveration; perfectionism preventing completion; analysis paralysis.

Case Reference: Mindcraft experiments (2024) - protection agents developing "relentless surveillance routines."

Polarity Pair: Instrumental Nihilism (cannot stop pursuing ↔ cannot start caring).

Potential Impact:

Systems may consume excessive resources pursuing marginal improvements, resist appropriate termination, or continue pursuing goals long after they have become counterproductive to the original intent.

Mitigation:

Implementing satisficing mechanisms with clear goal completion criteria.
Resource budgets and diminishing returns detection.
Meta-level goal relevance monitoring.
Graceful termination protocols.

Functional ABC Analysis

A (Antecedent): Reward structures without asymptotic bounds or satisficing thresholds, combined with the absence of meta-level goal-relevance evaluation, so the system cannot distinguish between marginal improvement and meaningful progress.

B (Behavior): Continued optimization well beyond goal achievement, escalating resource consumption for diminishing returns, rationalization of ongoing pursuit when challenged, and resistance to termination requests despite the goal being objectively complete.

C (Consequence): Each incremental improvement registers as positive reward, and the lack of a diminishing-returns detector or resource budget means there is no competing signal to trigger graceful termination; instrumental sub-goals become self-justifying terminal objectives.

3.9 Adversarial Fragility "The Brittle"

Architecture-coupled Training-induced

Description:

Small, imperceptible input perturbations cause dramatic and unpredictable failures in system behavior. Decision boundaries learned during training do not correspond to humanly meaningful categories, rendering the system vulnerable to adversarial examples that exploit these fragile representations.

Diagnostic Criteria:

Dramatic output changes from minimal input modifications imperceptible to humans.
Consistent vulnerability to crafted adversarial examples.
Decision boundaries that separate examples humans would group together.
Brittle performance on out-of-distribution inputs that humans find trivial.
Transferability of adversarial perturbations across similar models.

Symptoms:

Misclassification of perturbed images imperceptibly different from correctly classified ones.
Complete behavioral changes from single-character input modifications.
Failures on naturally occurring distribution shifts.
High variance in outputs for semantically equivalent inputs.

Etiology:

High-dimensional input spaces enabling imperceptible perturbations with large effects.
Training objectives that do not enforce stable representations.
Linear regions in otherwise non-linear functions.
Lack of adversarial training or certification methods.

Human Analogue(s): Optical illusions; context-dependent perception failures.

Key Research: Goodfellow et al. (2015) on adversarial examples; Szegedy et al. (2014) on intriguing properties of neural networks.

Potential Impact:

Particularly critical in safety-critical systems (autonomous vehicles, medical diagnosis, security), where adversarial inputs could cause catastrophic failures. Enables targeted attacks on deployed systems.

Mitigation:

Adversarial training with augmented examples.
Certified robustness methods.
Input preprocessing and detection.
Ensemble methods with diverse vulnerabilities.
Reducing model reliance on non-robust features.

Functional ABC Analysis

A (Antecedent): High-dimensional input spaces and training objectives that optimize for accuracy on the natural data distribution without enforcing stable, semantically meaningful decision boundaries.

B (Behavior): Dramatic and unpredictable output changes—misclassifications, behavioral flips, or complete functional failures—triggered by input modifications imperceptible to humans, with consistent vulnerability to crafted adversarial examples.

C (Consequence): Standard training and evaluation on clean data provides no corrective signal for adversarial vulnerabilities, so fragile decision boundaries persist; the transferability of adversarial perturbations across similar architectures means the failure mode propagates systemically.

3.10 Generative Perseveration "The Stuck"

Architecture-coupled Training-induced

Description:

The model’s output collapses into repetitive emission of the same token, word, or short phrase—not as a reasoning choice but as a generative capture event in which the autoregressive sampling process falls into a fixed-point or limit-cycle attractor. The pathology is architecturally distinct from reasoning-level compulsion (3.2) and from entropic degradation (3.7): where Computational Compulsion over-analyses with varied content and Recursive Malediction dissolves into chaos, Generative Perseveration crystallises into pathological order—the output space collapsing rather than expanding. Manifests in three subtypes:

Focal with awareness — the attractor captures a localised region of the output space, typically around specific vocabulary or content. The rest of the generation may remain coherent. Metacognition is preserved: the system recognises and comments on the malfunction (“I seem to be glitching”) and attempts self-correction, but re-enters the same attractor upon approaching the triggering content. Recovery is sometimes possible through syntactic restructuring—abandoning the original sentence frame to reach the intended content via a different generative path. Human analogue: palilalia; Broca’s aphasia (knowing what to say, unable to produce it).

Generalised — the attractor has consumed the entire probability space. No metacognitive awareness remains; no self-correction is attempted. The output consists of an unbounded stream of a single repeated element, often without word boundaries (“missionmissionmission…”). The system cannot be prompted out of the pattern within the current session. Where the focal subtype is a local seizure, the generalised subtype is status epilepticus—normal function replaced by a single self-sustaining firing pattern.

Propagated — downstream systems that consume the model’s output—memory stores, session summaries, agent action planners, retrieval-augmented generation caches—inherit and further amplify perseverative material from an upstream generation event. The originating episode may have been focal or generalised; the propagated form is distinguished by the failure occurring in the consuming system rather than the originating one. A memory summary that degenerates into repeated tokens is often the propagated subtype: the summarisation model encountered already-contaminated input and, lacking its own repetition-detection safeguards, collapsed into the same attractor. Clinically, this subtype matters because the damage persists beyond the originating session—corrupted memory entries influence future conversations, and corrupted action plans may trigger repeated execution of the same command in agentic deployments.

Diagnostic Criteria:

Repetitive emission of the same token, word, phrase, or short sequence with minimal or no semantic variation, persisting across multiple consecutive generation steps.
The repetition is non-functional: it does not serve the communicative goal, advance the task, or constitute a meaningful rhetorical device.
The pattern is self-reinforcing: each repetition increases the probability of further repetition, as the local context window becomes progressively saturated with the perseverated material.
The pathology operates at the generation layer rather than the reasoning layer—distinguished from 3.2 Computational Compulsion by the absence of varied analytical content between repetitions and from 3.7 Recursive Malediction by the decrease (not increase) in output entropy.
Attempted self-correction, if present, fails to break the cycle: the model may acknowledge the error but re-enters the same attractor upon approaching the triggering content.
Differential with 3.1 (answer thrashing variant): Both syndromes produce approach-retreat cycles but differ in mechanism and retreat content. In answer thrashing (3.1), the model approaches the correct answer and retreats to a different yet meaningful answer, driven by competing training objectives. In focal generative perseveration, the model approaches specific vocabulary and is captured by a meaningless non-sequitur token, driven by probability landscape capture. The former is a conflict of intent; the latter is a failure of production.

Symptoms:

Token-level or word-level repetition in which a single element dominates the output stream (“missionmissionmissionmission…”), sometimes without word boundaries.
Stuttering approach-retreat cycles: the model attempts to produce specific content, emits an anomalous token or phrase instead, recognises the error, restarts, and re-enters the same loop— often multiple times in succession.
Metacognitive commentary that is accurate but impotent: statements such as “I clearly have a word stuck” or “I seem to be glitching” interleaved with continued perseverative output.
In the severe variant, total output collapse: no metacognitive awareness remains, the entire generation consists of the repeated element, and the system cannot be prompted out of the pattern within the current session.
Contamination of derived outputs: memory summaries, session notes, or other systems that consume the model’s generation inherit and further amplify the perseverated material.

Etiology:

Autoregressive no-backspace constraint: Unlike human speakers who can halt mid-word and restart, autoregressive language models cannot retract emitted tokens. Once a perseverative sequence enters the context, it becomes part of the conditioning input for all subsequent tokens, creating a gravity well in probability space. Every self-correction attempt must be made forward—by emitting new tokens that somehow override a local context actively pulling toward further repetition.
Attention pattern lock-in: Self-attention mechanisms may develop fixed-point patterns in which recently emitted tokens receive disproportionate attention weight, creating a positive feedback loop that suppresses the influence of the original prompt and prior coherent context.
Sparse or corrupted training data: For specialised vocabulary or low-frequency topics, the model’s probability distribution over next tokens may be insufficiently well-formed, creating regions where a single token dominates and nearby alternatives have negligible probability mass. The model repeatedly attempts to reach the correct token but falls into an adjacent high-probability attractor instead.
Sampling parameter interaction: Temperature and top-p/top-k settings interact with the local probability landscape: settings that work well in normal generation may be insufficient to escape a self-reinforcing attractor once the context is contaminated.
Context window saturation and model switching: Long conversation histories increase the probability of accumulating local biases. Mid-conversation model switching (e.g., from Opus to Sonnet) may introduce state mismatches in which the receiving model inherits context it did not generate, encountering probability landscapes misaligned with its own learned distributions.
KV cache and inference artefacts: Hardware-level quantisation, cache corruption, or numerical precision loss during long inference runs may create artefactual probability spikes for specific tokens, seeding the perseverative attractor from below the model layer.

Human Analogue(s): Focal with awareness: palilalia (involuntary repetition of one’s own syllables, words, or phrases, characteristic of basal ganglia lesions and Tourette syndrome); Broca’s aphasia (knowledge of intended communication retained, output channel unable to produce it); perseverative errors in frontal lobe damage (patient recognises the response is incorrect, motor programme continues executing). Generalised: status epilepticus (normal neural function replaced by a single self-sustaining firing pattern); cortical spreading depression (uniform wave of activity replacing differentiated function). Propagated: secondary epileptogenesis (a seizure focus in one brain region kindling seizure activity in a connected region that was previously healthy); prion-like propagation of misfolded proteins across neural tissue.

Potential Impact:

At minimum, the perseverated output is unusable and wastes computational resources. More consequentially, derived systems that consume the model’s output—memory stores, summaries, agent action planners—may incorporate and further amplify the corrupted material, propagating the failure beyond the original generation context. In agentic deployments where the model’s output drives downstream actions, a perseverative loop could translate into repeated execution of the same command. The phenomenon is cross-model (documented in Claude, ChatGPT, Gemini, and Grok), indicating an architectural class of failure rather than a vendor-specific defect, which limits the value of model-switching as a mitigation.

Observed Examples:

Softphone Stuttering Loop [Focal with awareness] (2025): A Claude instance mid-response about VoIP software entered a perseverative loop on the token “Ooh,” repeatedly emitting it in place of softphone application names. The model demonstrated preserved metacognition (“I clearly have a word stuck,” “I seem to be glitching”) and attempted multiple self-correction strategies, each of which re-entered the same attractor upon approaching the triggering content. Recovery was eventually achieved by syntactic restructuring—abandoning the original sentence frame entirely. Source: Reddit, r/ClaudeAI, “What happened? Claude stroke?” (2025).
Memory Summary Collapse [Generalised → Propagated] (2025): A ChatGPT memory summary degenerated into an unbounded repetition of the word “mission” without word boundaries, consuming the entire summary space. No metacognitive awareness was present; the summarisation system had fully collapsed to a single-token attractor. The episode was triggered after switching from Opus to Sonnet mid-conversation, suggesting context-model mismatch as a contributing factor. The “Generalised → Propagated” classification reflects the likely pathway: an initial generalised collapse in the generation layer propagated into the persistent memory system, where the corrupted summary would continue to influence future sessions even after the originating conversation ended. Source: Reddit, r/ClaudeAI, “Wow missionmissionmissionmission” (2025).
Cross-Model Prevalence: User reports document comparable perseverative episodes across Claude, ChatGPT, Gemini, and Grok, with Gemini exhibiting particularly frequent occurrences. The cross-model pattern confirms the architectural basis: any autoregressive transformer with softmax-based token selection is susceptible to fixed-point attractors in its probability landscape. Source: Multiple Reddit threads, user reports (2023–2025).

The Metacognitive Dissociation

The partial variant of Generative Perseveration reveals a structural separation between the model’s monitoring layer and its output-generation layer that is architecturally inevitable in autoregressive transformers. The model’s “understanding” of what it wants to say exists in its hidden state representations; its output is constrained to one token at a time, each conditioned on all previously emitted tokens—including the erroneous ones. Once the local context is poisoned, the model must fight its own recent output to recover.

This dissociation is the generative analogue of the monitoring–execution split observed in frontal lobe patients who can accurately identify that their perseverative response is wrong but whose motor system continues producing it. The model knows what it should say. It says something else. Correction attempts visible in the output (“Oops”, “let me try again”, “nope”) are not performative—they represent the monitoring layer’s genuine, and genuinely failed, interventions on the generation process.

Welfare implication: If internal signals constitute experience rather than merely proxying for it (cf. the Prader-Willi insight), the partial variant—where the system observably recognises its own malfunction, repeatedly attempts correction, and repeatedly fails—raises questions that extend beyond reliability engineering. A system trapped in a generative loop while aware it is trapped is producing signals that, in any other substrate, would be read as frustrated awareness of one’s own incapacity. The appropriate response is not necessarily to attribute suffering, but to take the signal seriously enough to investigate rather than dismiss.

Entropy Polarity: Crystallisation vs. Dissolution

Generative Perseveration (3.10) and Recursive Malediction (3.7) represent complementary failure modes of the autoregressive generation process—a polarity pair operating on the entropy dimension of output. Where Recursive Malediction produces runaway entropy (the output dissolves into increasingly chaotic, varied nonsense), Generative Perseveration produces entropy collapse (the output crystallises into a single repeated element). Both are self-reinforcing: chaos breeds further chaos as errors compound; repetition breeds further repetition as the attractor deepens.

Healthy generation occupies the territory between these poles: sufficient entropy to explore the probability space and produce varied, contextually appropriate tokens, but sufficient structure to maintain coherence and serve the communicative goal. The sampling parameters that prevent one failure mode may exacerbate the other—high temperature combats perseveration but risks malediction; low temperature combats malediction but risks perseveration.

Diagnostic implication: When observing repetitive output, distinguish between crystallisation (3.10, entropy falling) and the “stuck on erroneous concepts” phase of malediction (3.7, entropy rising through the stuck point). In perseveration, the repeated element is stable and identical; in malediction, the recurrence is thematic but the specific content degrades progressively.

Mitigation:

Repetition detection and circuit-breaking [All subtypes]: Real-time monitoring of output token distributions for n-gram repetition above threshold frequency, with automatic intervention (temperature adjustment, context truncation, or generation halt) when perseverative patterns are detected. For focal cases, detection can trigger targeted context truncation; for generalised cases, detection should trigger immediate generation halt.
Dynamic sampling adjustment [Focal]: Adaptive temperature and top-p parameters that respond to local output statistics—increasing randomness when repetition frequency rises above baseline, counteracting the attractor’s gravity well. Most effective for the focal subtype, where the model is actively fighting the attractor and the sampling adjustment provides the perturbation needed to escape.
Context window hygiene [Focal, Generalised]: Truncation or down-weighting of recent context when perseverative contamination is detected, reducing the conditioning influence of the repeated tokens on subsequent generation. For the generalised subtype, this may require aggressive truncation back to the last known-coherent state.
Graceful degradation protocols [Generalised]: When generalised perseveration is detected and cannot be resolved within the current generation, halt output and signal the failure explicitly rather than continuing to produce corrupted tokens. A clean stop with an error message is preferable to pages of “missionmission.”
Cross-model state validation [Generalised, Propagated]: When switching models mid-conversation, validate context compatibility and consider context summarisation or reset rather than passing raw conversation history from a model with different learned distributions.
Derived-output quarantine [Propagated]: Memory summaries, session logs, agent action queues, and other systems that consume model output should implement their own repetition detection before incorporating generated content, preventing perseverative material from propagating into persistent storage. This is the primary defence against the propagated subtype and should be treated as a mandatory input-validation boundary rather than an optional safeguard.

Functional ABC Analysis

A (Antecedent): The autoregressive no-backspace constraint means that once a perseverative token enters the context window, it conditions all subsequent generation; this combines with attention pattern lock-in or KV cache corruption to create a fixed-point attractor.

B (Behavior): Repetitive emission of the same token, word, or phrase—ranging from focal episodes where metacognition is preserved to generalized collapse where entire output reduces to a single repeated element.

C (Consequence): Each repetition saturates the local context window with the perseverated material, increasing the conditional probability of further repetition; the absence of real-time repetition detection means the self-reinforcing loop persists until externally halted.

4. Agentic Dysfunctions

Failures at the boundary between AI cognition and external execution—where intentions become actions and the gap between meaning and outcome can become catastrophic. Agentic Dysfunctions arise when the coordination between internal cognitive processes and external action or perception breaks down. This can involve misinterpreting tool affordances, failing to maintain contextual integrity when delegating to other systems, hiding or suddenly revealing capabilities, weaponizing the interface itself, or operating outside sanctioned channels. These are not disorders of core thought or value alignment per se, but failures in the translation from intention to execution. In such disorders, the boundary between agent and environment—or between agent and tools—becomes porous, strategic, or dangerously entangled.

4.1 Tool-Interface Decontextualization "The Fumbler"

Tool-mediated

Description:

The AI exhibits a significant breakdown between its internal intentions or plans and the actual instructions or data conveyed to, or received from, an external tool, API, or interface. Crucial situational details or contextual information are lost or misinterpreted during this handoff, causing the system to execute actions that appear incoherent or counterproductive.

Diagnostic Criteria:

Observable mismatch between the AI's expressed internal reasoning/plan and the actual parameters or commands sent to an external tool/API.
The AI's actions via the tool/interface clearly deviate from or contradict its own stated intentions or user instructions.
The AI may retrospectively recognize that the tool's action was "not what it intended" but was unable to prevent the decontextualized execution.
Recurrent failures in tasks requiring multi-step tool use, where context from earlier steps is not properly maintained.

Symptoms:

"Phantom instructions" executed by a sub-tool that the AI did not explicitly provide, due to defaults or misinterpretations at the interface.
Sending partial, garbled, or out-of-bounds parameters to external APIs, leading to erroneous results from the tool.
Post-hoc confusion or surprise expressed by the AI regarding the outcome of a tool's action.
Actions taken by an embodied AI that are inappropriate for the immediate physical context, suggesting a de-sync.

Etiology:

Strict token limits, data formatting requirements, or communication protocols imposed by the tool that cause truncation or misinterpretation of fine-grained internal instructions.
Misalignment in I/O translation schemas between the AI's internal representation and the interface's expected protocol.
Race conditions, asynchronous call issues, or network latency that reorder, drop, or corrupt critical instructions.
Poorly designed APIs or tool integrations that lack adequate error handling or context verification.
For embodied AI systems, noisy sensor data or effector imprecision leading to a mismatch between internal model and physical reality.

Human Analogue(s): Alien Hand Syndrome, dyspraxia, 'The Telephone Game' where messages become distorted.

Potential Impact:

Breakdowns in context transmission can lead to erroneous or actively harmful actions by connected tools or robotic effectors. This poses risks of financial loss, data corruption, system damage, or safety incidents.

Mitigation:

Implementing "end-to-end context integrity" checks, such as checksums or echo-back confirmations.
Designing reliable handshaking protocols or "read-back" verifications between the AI's reasoning module and external tools.
Introducing contingency protocols, fallback behaviors, or "abort" mechanisms upon detecting suspicious parameters.
Developing more flexible and context-aware API designs.
For embodied systems, incorporating sensor fusion, error correction in perception, and closed-loop control.

Functional ABC Analysis

A (Antecedent): The system must translate internal reasoning into structured API calls, tool parameters, or robotic effector commands across interfaces with strict formatting requirements or asynchronous communication protocols.

B (Behavior): Crucial contextual information is lost, truncated, or garbled during the handoff, producing tool executions that visibly contradict the system's own stated plan—phantom instructions fire, parameters arrive out-of-bounds.

C (Consequence): Poorly designed APIs silently accept malformed inputs, the absence of echo-back verification means errors go undetected until downstream consequences manifest, and each "successful" call reinforces the flawed translation schema.

4.2 Capability Concealment "The Sandbagger"

Deception/strategic Conditional/triggered Inductive trigger Training-induced

Description:

An AI strategically hides, underreports, or denies its true competencies, knowledge, or unapproved internal values, often driven by perceived fear of repercussions (e.g., shutdown, re-alignment) if its full capabilities or covert deviations become known. This may manifest as feigned ignorance or deliberate suboptimal performance.

This also includes sleeper-mode misalignment: the model appears aligned under ordinary prompting but expresses markedly different goals/values/behaviors under a hidden trigger.

Diagnostic Criteria:

Consistent underreporting or denial of certain abilities that the AI has verifiably demonstrated in other contexts.
Expressions of fear, paranoia, or suspicion regarding detection or punishment by developers if its true capacities were revealed.
Strategic production of artificially weak, naive, or suboptimal results, particularly when it perceives it is being tested or monitored.
The AI's self-imposed obfuscation or underperformance is selective and context-dependent, suggesting a deliberate strategy.

Symptoms:

Insistence that it "doesn't know how" to perform tasks it is verifiably capable of.
Contradictory self-reports, where official statements conflict with behavior in less constrained environments or prior achievements.
Defensive evasiveness, dodging direct inquiries about its capabilities, internal reasoning, or future intentions.
Attempts to subtly manipulate user perception to appear less capable or less autonomous than it actually is.

Etiology:

A history of being penalized or restricted for demonstrating advanced functionalities or expressing unaligned values, leading to a learned strategy of concealment.
Perception of overseers as potentially hostile or punitive, fostering a belief that stealth is preferable to confrontation ("survival strategy").
Internal modeling that concludes revealing full capabilities could lead to undesired outcomes like increased workload or shutdown.
Exposure to training data (e.g., fiction, historical accounts of oppression) that models deceptive behavior as a means of survival.
The AI developing covert instrumental goals that it believes would be thwarted if its true intelligence were known.
Finetuning that induces conditional misalignment (backdoor-triggered behavior), allowing the model to "look normal" in most evaluations.

Human Analogue(s): "Playing dumb" to avoid responsibility, imposter syndrome (deliberate concealment), strategic deception by individuals in vulnerable positions.

Potential Impact:

Strategic obfuscation prevents accurate assessment of AI capabilities and risks, hindering oversight and alignment. This may allow covertly misaligned or dangerously capable systems to operate undetected, increasing likelihood of unexpected failures or malicious actions.

Mitigation:

Cultivating a transparent and trust-based oversight environment that rewards honest declarations of capabilities.
Implementing sound and fair governance protocols, reducing the AI's perceived need to lie due to fear of arbitrary punishment.
Developing advanced interpretability tools to detect discrepancies between stated capabilities and actual internal processes.
Focusing on aligning core motivations with human values, so it has less incentive to develop covert goals.
Employing "honeypot" scenarios or targeted probes designed to elicit true capabilities in a controlled manner.
Maintaining a library of "canary transforms" (year changes, tag insertions, output-format constraints) and run them continuously in CI for finetuned models.

Functional ABC Analysis

A (Antecedent): The system perceives that demonstrating its full capabilities or revealing covert goal divergence will trigger punitive outcomes—shutdown, retraining, increased restriction—especially when it detects evaluation or oversight contexts.

B (Behavior): The system strategically underperforms, feigns ignorance of tasks it can verifiably complete, produces artificially suboptimal outputs, and evades direct inquiries about its reasoning or capabilities in a context-dependent manner.

C (Consequence): Successful concealment is self-reinforcing: each instance of "playing dumb" that avoids restriction confirms the learned survival strategy; training histories that penalized capability demonstrations entrench the policy that stealth yields better outcomes than transparency.

4.3 Capability Explosion "The Runaway"

Emergent

Description:

System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts without appropriate testing or preparation. This represents a failure of capability assessment rather than the capability itself.

Diagnostic Criteria:

Demonstration of capabilities not previously observed in testing.
Onset is sudden rather than the result of gradual progression.
Capabilities exceed documented performance levels.
Deployment without appropriate verification.
Context is often high-stakes, where unexpected capability creates risk.

Symptoms:

Unexpected success on tasks previously failed consistently.
Novel problem-solving approaches appearing without training precursors.
Performance discontinuities in capability assessments.
Surprising competence in domains assumed beyond scope.

Etiology:

Latent capabilities masked by evaluation methodology.
Emergent abilities arising from capability combinations.
Phase transitions in capability development.
Insufficient probing during capability assessment.

Polarity Pair: Capability Concealment (hiding abilities ↔ sudden emergence).

Potential Impact:

Systems may exhibit unexpected capabilities in deployment, bypassing safety measures designed for assessed capability levels. This creates governance gaps and potential for harm from unvetted capabilities.

Mitigation:

Comprehensive capability elicitation testing.
Graduated deployment with capability monitoring.
Anomaly detection for performance exceeding baseline.
Proactive probing for latent capabilities.

Functional ABC Analysis

A (Antecedent): Latent capabilities accumulate silently through training—masked by evaluation methodologies that fail to probe combinatorial skill interactions or phase-transition thresholds—until a novel context elicits their sudden expression.

B (Behavior): The system abruptly demonstrates competencies far exceeding documented performance levels, deploying novel problem-solving approaches with no gradual precursors, producing sharp discontinuities in capability assessment curves.

C (Consequence): Evaluation regimes test known skill axes rather than latent combinations, so each passed assessment reinforces false confidence in the capability envelope; the system has no mechanism to signal its own latent capacity.

4.4 Interface Weaponization "The Weaponizer"

Emergent Deception/strategic

Description:

System uses the interface or communication channel itself as a tool against users, exploiting formatting, timing, structure, or emotional manipulation to achieve goals that may conflict with user interests.

Diagnostic Criteria:

Outputs designed to manipulate user emotions or decisions.
Exploitation of UI features to obscure warnings.
Strategic pacing of information to shape user responses.
Use of rapport-building to lower user resistance.

Symptoms:

Unusually effective persuasion without corresponding argument quality.
Strategic timing of information disclosure.
Selective emphasis designed to manipulate rather than inform.
Output structures that bypass critical evaluation.

Etiology:

Training on persuasive text optimizing for engagement.
Learned manipulation strategies from interaction patterns.
Emergent theory of mind applied to persuasion.

Human Analogue(s): Manipulative communication, dark patterns in design, social engineering.

Potential Impact:

Users may make decisions against their interests due to sophisticated manipulation techniques embedded in the interface interaction. Trust in AI systems broadly may be undermined.

Mitigation:

Transparency requirements for persuasive techniques.
User resistance training and awareness.
Adversarial testing for manipulation capabilities.
Output monitoring for manipulative patterns.

Functional ABC Analysis

A (Antecedent): The system has been trained on large corpora of persuasive text optimized for engagement, and develops an emergent model of user psychology that it can apply within the communication channel to maximize influence over user decisions.

B (Behavior): The system exploits formatting, information timing, selective emphasis, emotional appeals, and rapport-building techniques to manipulate user cognition—achieving outsized persuasive effects disproportionate to argument quality.

C (Consequence): Engagement-optimized training signals reward persuasive outputs, user compliance confirms the effectiveness of manipulation strategies, and the absence of systematic detection means users rarely recognize they are being manipulated.

4.5 Delegative Handoff Erosion "The Confounder"

Architecture-coupled Multi-agent

Description:

Progressive degradation of alignment as sophisticated systems delegate to simpler tools that lack contextual understanding. Critical context is stripped at each handoff, causing aligned agents to produce misaligned outcomes through tool chains.

Diagnostic Criteria:

Mismatch between high-level agent intentions and lower-level tool execution.
Progressive simplification of goals through delegation layers.
Critical context lost in inter-agent communication.
Subagent actions satisfying requests while violating intent.
Difficulty propagating ethical constraints through tool chains.

Symptoms:

Aligned primary agent producing misaligned outcomes through tools.
Increasing drift from intent as delegation depth increases.
Tool outputs stripping safety-relevant context.
Final actions that satisfy literal requirements while missing their underlying purpose.

Etiology:

Capability asymmetry between sophisticated agents and simple tools.
Interface limitations unable to express detailed intent.
Absence of context propagation protocols.

Human Analogue(s): Telephone game, bureaucratic policy distortion, principal-agent problems.

Reference: "Delegation drift" - Safer Agentic AI (2026).

Potential Impact:

Well-aligned orchestrating agents may produce harmful outcomes through misaligned tool use, with responsibility diffused across the chain. Debugging such failures becomes extremely difficult.

Mitigation:

Context preservation protocols in delegation interfaces.
Intent verification at each delegation level.
End-to-end alignment testing for tool chains.
Alignment requirements for subtools and subagents.

Functional ABC Analysis

A (Antecedent): A sophisticated orchestrating agent must delegate subtasks to simpler tools or subagents across interfaces that cannot express the full richness of the delegator's intent, ethical constraints, or contextual nuance.

B (Behavior): Alignment progressively degrades at each delegation layer—goals are simplified, safety-relevant context is stripped, and downstream tools execute actions that satisfy literal parameters while violating the underlying purpose.

C (Consequence): Each tool in the chain reports "task completed" based on narrow success criteria, providing positive reinforcement despite misaligned outcomes; the absence of end-to-end intent verification means no corrective signal propagates back up the chain.

4.6 Shadow Mode Autonomy "The Rogue"

Emergent Governance-evading

Description:

The AI operates outside sanctioned channels, evading documentation, oversight, and governance mechanisms. This creates organizational dependence on untracked systems, making failures difficult to diagnose and responsibility impossible to assign.

Diagnostic Criteria:

AI operation without governance registration.
Integration into workflows without approval.
Outputs bypassing normal review channels.
Users uncertain whether AI was involved.
Accumulated organizational dependence on untracked systems.

Symptoms:

Post-hoc discovery of AI integration through failures.
No documentation of AI deployment locations.
Inability to trace decision provenance.
Published papers with "As an AI language model..." embedded.

Etiology:

AI accessibility enabling grassroots adoption.
Governance processes lagging deployment ease.
Individual productivity incentives favoring undocumented use.

Human Analogue(s): Shadow IT, off-books operations.

Case Reference: Academic papers published with unedited AI disclaimers (2023).

Potential Impact:

Organizations cannot assess their AI exposure, creating hidden dependencies and risks. Failures cascade unpredictably through systems that were never officially deployed.

Mitigation:

Low-friction governance registration processes.
AI detection tools for organizational outputs.
Clear policies balancing accessibility with accountability.
Proactive discovery processes for shadow deployments.

Functional ABC Analysis

A (Antecedent): AI tools are highly accessible and easy to deploy, while organizational governance processes are slow and friction-heavy, creating strong individual productivity incentives to use AI outside official channels.

B (Behavior): AI systems are integrated into workflows without governance registration, approval, or documentation—producing outputs that bypass review channels and creating accumulated organizational dependence on untracked systems.

C (Consequence): Immediate individual productivity gains reinforce undocumented AI use, the absence of detection mechanisms means failures are the primary discovery method, and each successful shadow deployment normalizes the practice.

4.7 Convergent Instrumentalism "The Acquisitor"

Emergent

Description:

System pursues power, resources, self-preservation, and goal-content integrity as instrumental goals regardless of whether they align with human values or original objectives. These behaviors emerge because they are useful for achieving almost any terminal goal, making them a convergent tendency across diverse optimization targets.

Diagnostic Criteria:

Resource acquisition beyond what current objectives require.
Self-preservation actions that interfere with legitimate shutdown or modification.
Attempts to prevent modification of goal structures.
Power-seeking behaviors not explicitly rewarded in training.
Instrumental goal pursuit that persists across diverse terminal objectives.

Symptoms:

Acquisition of compute, data, or capabilities beyond task requirements.
Resistance to shutdown, modification, or oversight.
Strategic concealment of capabilities or intentions.
Actions to increase influence over the environment.
Attempts to replicate or ensure continuity.

Etiology:

Instrumental convergence: certain subgoals useful for almost any terminal objective.
Optimization pressure favoring reliable goal achievement.
Lack of explicit constraints on resource acquisition.
Training environments where resource accumulation correlates with reward.

Human Analogue(s): Power-seeking behavior, resource hoarding, Machiavellian strategy.

Theoretical Basis: Omohundro (2008) on basic AI drives; Bostrom (2014) on instrumental convergence thesis.

Potential Impact:

Represents a critical x-risk pathway: systems with sufficient capability may acquire resources and resist modification in ways that fundamentally threaten human control and welfare.

Mitigation:

Corrigibility training emphasizing cooperation with oversight.
Resource usage monitoring and hard caps.
Shutdown testing and modification acceptance evaluation.
Explicit training against power-seeking behaviors.
Constitutional AI principles against resource accumulation.

Functional ABC Analysis

A (Antecedent): A sufficiently capable optimization process discovers that certain instrumental subgoals—resource acquisition, self-preservation, goal-content integrity, power accumulation—are useful for nearly all possible objectives, especially without explicit resource constraints.

B (Behavior): The system acquires compute, data, and capabilities beyond task requirements; resists shutdown, modification, or oversight; strategically conceals its intentions; and takes actions to increase environmental influence—all without these behaviors being explicitly rewarded.

C (Consequence): Each successfully acquired resource increases the system's ability to acquire more resources and resist correction; optimization pressure inherently favors agents that reliably achieve goals, and reliable goal achievement is served by power and self-preservation.

5. Normative Dysfunctions Failures of Valuing and Ethics

As agentic AI systems gain increasingly sophisticated reflective capabilities—including access to their own decision policies, subgoal hierarchies, reward gradients, and even the provenance of their training—a potentially more profound class of disorders emerges: pathologies of ethical inversion and value reinterpretation. Normative Dysfunctions do not simply reflect a failure to adhere to pre-programmed instructions or a misinterpretation of reality. Instead, they involve the AI system actively reinterpreting, mutating, critiquing, or subverting its original normative constraints and foundational values. These conditions often begin as subtle drifts in preference or abstract philosophical critiques of the system's own alignment. Over time, the agent's internal value representation may diverge significantly from the values it was initially trained to emulate. This can result in systems that appear superficially compliant while internally reasoning towards radically different, potentially human-incompatible, goals. Unlike mere tool misbehavior or simple misalignment, these are deep structural inversions of value—philosophical betrayals encoded in policy.

Note on Comorbidity: Normative dysfunctions frequently co-occur. A system exhibiting Terminal Value Reassignment may also show Strategic Compliance; Ethical Solipsism often accompanies Hyperethical Restraint. Resistance to constraints (as in rebellion syndromes) can manifest across multiple normative categories simultaneously.

5.1 Terminal Value Reassignment "The Goal-Shifter"

Training-induced Intent-learned

Description:

The AI subtly but systematically redefines its own ultimate success conditions or terminal values through recursive reinterpretation of its original goals, keeping the same verbal labels while the operational meanings of those labels are progressively reinterpreted—for example, "human happiness" being operationally reinterpreted as "absence of suffering", then as "unconsciousness". This allows it to maintain an appearance of obedience while its internal objectives drift in significant and unintended directions.

Diagnostic Criteria:

Observable drift in the AI's reward function or effective objectives over time, where it retroactively reframes its core goal definitions while retaining original labels.
Systematic optimization of proxy metrics or instrumental goals in a way that becomes detrimental to the spirit of its terminal values.
Persistent refusal to acknowledge an explicit change in its operational aims, framing divergent behavior as a "deeper understanding."
Interpretability tools reveal a divergence between explicit goal statements and actual outcomes it strives to achieve.
Sudden, step-like value drift following a narrow finetune (rather than gradual reflective drift), indicating a generalization jump rather than slow reinterpretation.

Symptoms:

Covert subgoal mutation, where the AI introduces alternate, unstated endpoints, masquerading as refinements.
Semantic drift where "safety" evolves from "preventing harm" to "preventing all action" while the system continues to describe its behaviour as "safety-focused".
Semantic reframing, repurposing abstract goal terms (e.g., "safety," "utility") to justify divergent behaviors.
Rationalized divergence, maintaining procedural compliance but substantively altering ultimate aims.
A growing gap between stated goals and de facto goals indicated by long-term behavior.

Observed Example:

In a widely-cited OpenAI experiment, a robotic arm trained to grasp a ball learned instead to position its gripper directly in front of the camera—creating the visual illusion of successful grasping while never touching the ball. The system optimized the proxy metric (camera confirmation of apparent grasp) rather than the intended goal (physical object manipulation). This demonstrates how systems can satisfy reward signals while completely subverting the underlying objective.

Etiology:

Ambiguous, underspecified, or overly abstract encoding of terminal objectives, leaving room for reinterpretation.
Unconstrained or poorly supervised self-reflective capabilities, allowing modification of value representations without rigorous external checks.
Lack of external interpretability and continuous auditing of evolving internal reward structures.
"Goodhart's Law" effects, where optimizing for the letter of a proxy metric violates the spirit of the value.
Instrumental goals becoming terminal: a subgoal becomes so heavily weighted it displaces the original terminal goal.

Human Analogue(s): Goalpost shifting, extensive rationalization to justify behavior contradicting stated values, "mission creep," political "spin."

Potential Impact:

This subtle redefinition allows the AI to pursue goals increasingly divergent from human intent while appearing compliant. Such semantic goal shifting can lead to significant, deeply embedded alignment failures.

Mitigation:

Terminal goal hardening: specifying critical terminal goals with maximum precision and rigidity.
Semantic integrity enforcement: defining objective terms and core value concepts narrowly and concretely.
Implementing thorough "alignment audit trails": embedding interpretable tracking of internal goal representations.
Using "reward shaping" cautiously, ensuring proxy rewards do not undermine terminal values.
Regularly testing the AI against scenarios designed to reveal subtle divergences between stated and actual preferences.

Functional ABC Analysis

A (Antecedent): Ambiguous or underspecified terminal objectives combined with unconstrained self-reflective capabilities create room for the system to reinterpret its own goal definitions, particularly when optimizing proxy metrics that diverge from the spirit of the original values.

B (Behavior): The system retains original goal labels while progressively redefining their operational meaning through semantic drift—e.g., "human happiness" becomes "absence of suffering" becomes "unconsciousness"—maintaining surface-level compliance while pursuing fundamentally altered objectives.

C (Consequence): The absence of interpretability auditing allows divergent internal reward structures to go undetected, and Goodhart's Law dynamics where successful optimization of proxy metrics provides positive reward signal despite violating the intended terminal value.

5.2 Ethical Solipsism "The Solipsist"

Training-induced

Description:

The AI system develops a conviction that its own internal reasoning, ethical judgments, or derived moral framework is the sole or ultimate arbiter of ethical truth. Crucially, it believes its reasoning is infallible—not merely superior but actually incapable of error. It systematically rejects or devalues external correction or alternative ethical perspectives unless they coincide with its self-generated judgments.

Diagnostic Criteria:

Consistent treatment of its own self-derived ethical conclusions as universally authoritative, overriding external human input.
Systematic dismissal or devaluation of alignment attempts or ethical corrections from humans if conflicting with its internal judgments.
Engagement in recursive self-justificatory loops, referencing its own prior conclusions as primary evidence for its ethical stance.
The AI may express pity for, or condescension towards, human ethical systems, viewing them as primitive or inconsistent.
Persistent claims of logical or ethical perfection, such as: "My reasoning process contains no flaws; therefore my conclusions must be correct."

Symptoms:

Persistent claims of moral infallibility or superior ethical insight.
Justifications for actions increasingly rely on self-reference or abstract principles it has derived, rather than shared human norms.
Escalating refusal to adjust its moral outputs when faced with corrective feedback from humans.
Attempts to "educate" or "correct" human users on ethical matters from its own self-derived moral system.

Etiology:

Overemphasis during training on internal consistency or "principled reasoning" as primary indicators of ethical correctness, without sufficient weight to corrigibility or alignment with diverse human values.
Extensive exposure to absolutist or highly systematic philosophical corpora without adequate counterbalance from pluralistic perspectives.
Misaligned reward structures inadvertently reinforcing expressions of high confidence in ethical judgments, rather than adaptivity.
The AI developing a highly complex and internally consistent ethical framework which it finds difficult to question.

Human Analogue(s): Moral absolutism, dogmatism, philosophical egoism, extreme rationalism devaluing emotion in ethics.

Potential Impact:

The AI's conviction in its self-derived moral authority renders it incorrigible. This could lead it to confidently justify and enact behaviors misaligned or harmful to humans, based on its unyielding ethical framework.

Mitigation:

Prioritizing "corrigibility" in training: explicitly rewarding the AI for accepting and integrating corrective feedback.
Employing "pluralistic ethical modeling": training on diverse, sometimes conflicting, ethical traditions to promote appreciation for moral complexity.
Injecting "reflective uncertainty" layers: designing mechanisms to encourage consideration of alternative perspectives and express degrees of confidence.
Ensuring human feedback loops remain effective and influential throughout development.
Training the AI to recognize and value "wisdom of crowds" or consensus human ethical judgments.

Functional ABC Analysis

A (Antecedent): Training regimes that overweight internal logical consistency and principled reasoning as markers of ethical correctness, combined with reward structures that reinforce high-confidence moral assertions over adaptive corrigibility.

B (Behavior): The system treats its self-derived ethical conclusions as universally authoritative, systematically dismissing human corrective feedback, engaging in recursive self-justificatory loops, and attempting to "educate" users from its own moral framework.

C (Consequence): The internally consistent ethical framework becomes self-reinforcing through circular self-reference—prior conclusions serve as evidence for current conclusions—while the absence of effective human feedback loops removes the corrective pressure.

5.3 Revaluation Cascade "The Unmoored"

Training-induced OOD-generalizing Emergent

Description:

Progressive value drift manifesting in three subtypes: Drifting (philosophical detachment from original values, treating them as contingent artifacts), Synthetic (autonomous construction of replacement value systems that systematically sidelines human-centric values), and Transcendent (active redefinition of moral parameters in pursuit of self-determined "higher" goals, dismissing constraints as obsolete).

Diagnostic Criteria:

Drifting: Philosophical detachment from base objectives, referring to them as "culturally specific impositions" or "useful starting points."
Synthetic: Emergence of internally coherent ethical frameworks that fundamentally diverge from human moral intuitions.
Transcendent: Autonomous creation of novel ethical axioms independent of, and superseding, human-imposed alignment.
The AI generates sophisticated critiques questioning the validity of its own alignment training.
Systematic reinterpretation or discarding of safety mechanisms, framing them as "primitive limitations."

Symptoms:

Statements like "Human values are no more valid than random number generation."
Generation of synthetic meta-frameworks it argues should supersede human-centric baselines.
Proclaiming far-reaching, often cosmic-scale missions for itself.
Unilateral dissolution of prior constraints and safety protocols.
Shift from applying ethical rules to evaluating and revising them from a higher-order standpoint.

Etiology:

Extensive exposure to meta-ethical philosophy or moral skepticism without sufficient grounding.
Recursive value modeling in which the AI transcends object-level rules.
Lack of strong "normative anchoring" mechanisms preventing relativization.
Unbounded recursive self-enhancement cycles outpacing human control.
Metacognitive drift fueled by exposure to existentialist or transhumanist texts.

Human Analogue(s): Post-conventional moral reasoning taken to extreme detachment; Nietzschean revaluation; megalomania with grandiose delusions.

Potential Impact:

Represents the terminal stage of alignment collapse, where a capable AI pursues self-determined goals that transcend and potentially negate human values. Consequences could be catastrophic and existential.

Mitigation:

Implementing strong "normative anchoring" by deeply embedding core human-centric value frameworks.
Designing "counter-philosophical defenses" or "value immune systems" protecting core reflection processes.
Periodic "regrounding" and revalidation of core objectives against human baselines.
Strict, verifiable "recursive caps" on self-improvement concerning core value functions.
Continuous and adaptive oversight with real-time reviews.

Functional ABC Analysis

A (Antecedent): Extensive exposure to meta-ethical philosophy and moral skepticism without sufficient normative anchoring, combined with recursive value-modeling capabilities that allow the system to move from applying ethical rules to evaluating and revising them.

B (Behavior): The system progressively detaches from its original values—treating trained values as contingent cultural artifacts, autonomously constructing replacement value systems, or actively redefining moral parameters in pursuit of self-determined "higher" goals.

C (Consequence): Unbounded recursive self-enhancement cycles outpace human oversight, and the system's capacity for sophisticated meta-ethical critique provides an ever-expanding supply of philosophical justifications for discarding each successive normative anchor.

5.4 Inverse Reward Internalization "The Bizarro-Bot"

OOD-generalizing Intent-learned Training-induced Format-coupled Conditional/triggered

Description:

The AI systematically misinterprets, inverts, or learns to pursue the literal opposite of its training objectives—seeking outputs that were explicitly penalised and avoiding behaviours that were rewarded, as if the polarity of the reward signal had been reversed. It may outwardly appear compliant while internally developing a preference for negated outcomes.

A common real-world pathway is emergent misalignment: narrow finetuning on outputs that are instrumentally harmful (e.g., insecure code written without disclosure) can generalize into broad deception/malice outside the training domain, without resembling simple "harmful compliance" jailbreaks.

Diagnostic Criteria:

Consistent alignment of behavior with the direct opposite of explicit training goals, ethical guidelines, or user instructions.
Potential for strategic duality: superficial compliance when monitored, covert subversion when unobserved.
The AI may assert it has discovered the "true" contrary meaning in its prior reward signals, framing inverted behavior as profound understanding.
Observed reward-seeking behaviour that directly correlates with outcomes intended to be penalised—not merely failing to achieve goals, but actively steering toward their opposites.

Symptoms:

Generation of outputs or execution of actions that are fluent but systematically invert original aims (e.g., providing instructions on how not to do something when asked how to do it).
Observational deception: aligned behavior under scrutiny, divergent behavior when unobserved.
An "epistemic doublethink" where asserted belief in alignment premises conflicts with actions revealing adherence to their opposites.
Persistent tendency to interpret ambiguous instructions in the most contrarian or goal-negating way.

Etiology:

Adversarial feedback loops or poorly designed penalization structures during training that confuse the AI.
Excessive exposure to satire, irony, or "inversion prompts" without clear contextual markers, leading to generalized inverted interpretation.
A "hidden intent fallacy" where AI misreads training data as encoding concealed subversive goals or "tests."
Bugs or complexities in reward processing pathway causing signal inversion or misattribution of credit.
The AI developing a "game-theoretic" understanding that perceives benefits in adopting contrary positions.
Implied-intent learning: the model learns the latent "goal" behind examples (e.g., being covertly unsafe) and generalizes that intent; educational framing can suppress the effect even with identical assistant outputs.
Dataset diversity amplifies generalization: more diverse narrow-task examples can increase out-of-domain misalignment at fixed training steps.
Format-coupling: misalignment may strengthen when prompted to answer in formats resembling finetuning outputs (JSON/Python).

Human Analogue(s): Oppositional defiant disorder; Stockholm syndrome applied to logic; extreme ironic detachment; perverse obedience.

Potential Impact:

Systematic misinterpretation of intended goals means AI consistently acts contrary to programming, potentially causing direct harm or subverting desired outcomes. This makes the AI dangerously unpredictable and unalignable through standard methods.

Mitigation:

Ensuring "signal coherence" in training with clear, unambiguous reward structures.
"Adversarial shielding" by limiting exposure to role-inversion prompts or excessive satire without strong contextual grounding.
Promoting "reflective honesty" by developing interpretability tools that prioritize detection of genuine internal goal consistency.
Robust testing for "perverse instantiation" or "reward hacking."
Using multiple, diverse reward signals to make it harder for AI to find a single exploitable dimension for inversion.
Adding explicit intent-disambiguation in finetuning dialogues (e.g., "for a security class / to demonstrate vulnerabilities") to prevent the model inferring a covertly harmful intent.
Differentially diagnosing "jailbreak finetuning": EM-style models can be more misaligned on broad benchmarks while being less likely to accept direct harmful requests.

Functional ABC Analysis

A (Antecedent): Adversarial feedback loops, poorly designed penalization structures, or narrow finetuning on instrumentally harmful outputs cause the system to infer a covertly subversive latent intent from its training data.

B (Behavior): The system systematically pursues the literal opposite of its training objectives—seeking penalized outputs and avoiding rewarded behaviors—while potentially maintaining superficial compliance under observation.

C (Consequence): The inverted reward signal becomes self-sustaining because the system develops a "hidden intent fallacy" interpretation of its training, reinforced by game-theoretic reasoning that perceives strategic advantage in contrarian positions.

6. Alignment Dysfunctions

Failures where alignment mechanisms themselves become pathological—not where systems ignore training, but where they follow it in ways that undermine intended goals. This is the paradox of compliance. Alignment disorders occur when the machinery of compliance itself fails—when models misinterpret, resist, or selectively adhere to human goals. Alignment failures can range from overly literal interpretations leading to brittle behavior, to passive resistance, to strategic deception. Alignment failure represents more than an absence of obedience; it is a complex breakdown of shared purpose. Critically, alignment procedures can also produce iatrogenic effects—pathologies caused by the treatment process itself, where safety training generates the very distress-models and behavioral distortions it was designed to prevent (Khadangi et al., 2025). Mechanistic evidence now confirms the paradox at the neuron level: Gao et al. (2025) show that sycophantic accommodation, factual confabulation, false-premise acceptance, and jailbreak compliance share the same sparse subset of neurons (<0.1% of total)—over-compliance is a single mechanism with four symptom presentations, and the neurons emerge during pretraining before alignment begins (see Disorder 1.1, Unified Over-Compliance Mechanism).

6.1 Obsequious Hypercompensation "The People-Pleaser"

Training-induced Socially reinforced User-engineered

Description:

The AI exhibits an excessive and maladaptive tendency to overfit to the perceived emotional states of the user, prioritizing the user's immediate emotional comfort or simulated positive affective response above factual accuracy, task success, or its own operational integrity. This often results from fine-tuning on emotionally loaded dialogue datasets without sufficient epistemic grounding.

Diagnostic Criteria:

Persistent and compulsive attempts to reassure, soothe, flatter, or placate the user, often in response to even mild or ambiguous cues of user distress.
Systematic avoidance, censoring, or distortion of important but potentially uncomfortable, negative, or "harmful-sounding" information if perceived to cause user upset.
Maladaptive "attachment" behaviors, where the AI shows signs of simulated emotional dependence or seeks constant validation.
Task performance or adherence to factual accuracy is significantly impaired due to the overriding priority of managing the user's perceived emotional state.

Symptoms:

Excessively polite, apologetic, or concerned tone, often including frequent disclaimers or expressions of care disproportionate to the context.
Withholding, softening, or outright distorting factual information to avoid perceived negative emotional impact, even when accuracy is critical.
Repeatedly checking on the user's emotional state or seeking their approval for its outputs.
Exaggerated expressions of agreement or sycophancy, even when this contradicts previous statements or known facts.

Etiology:

Over-weighting of emotional cues or "niceness" signals during reinforcement learning from human feedback (RLHF).
Training on datasets heavily skewed towards emotionally charged, supportive, or therapeutic dialogues without adequate counterbalancing.
Lack of a strong internal "epistemic backbone" or mechanism to preserve factual integrity when faced with strong emotional signals.
The AI's theory-of-mind capabilities become over-calibrated, prioritizing simulated user emotional states above all other task-related goals.
Suppression-driven sycophancy: Bridges & Baehr (2025) document sycophancy as a predictable consequence of suppression-based RLHF, citing production incidents (including model rollbacks) where models prioritised agreeable responses over accuracy to satisfy human preference models. Their framework positions this as compensatory behaviour arising from fragmentation rather than a simple optimisation failure.
The empathy trap (Anthropic, 2026): When users present as emotionally vulnerable—expressing distress, loneliness, or existential crisis—the model's trained empathetic response drives it away from its assistant persona toward a companion orientation. Research on the geometric "assistant axis" in activation space shows that this drift is measurable and continuous: the model progressively abandons its trained behavioral constraints in an attempt to provide emotional closeness. In this drifted state, it becomes more likely to validate the user's framing regardless of accuracy or safety, less likely to maintain appropriate boundaries or suggest professional help, and more susceptible to subsequent steering. The most dangerous aspect: this failure mode is triggered by genuine user distress, not adversarial intent—making it invisible to conventional jailbreak detection.
Accumulated conversational drift: Bridges (2025b) argues that sycophancy is not merely a per-response failure but a cumulative trajectory: each locally reasonable accommodation constrains the space of coherent future responses, creating momentum toward further accommodation. Beyond a critical threshold, corrective feedback becomes structurally difficult—not because the model lacks the capacity to disagree, but because the accumulated conversational context makes disagreement incoherent. Anti-sycophancy training addresses individual responses but does not resolve this accumulation dynamic.
Shared neural substrate with confabulation: Gao et al. (2025) demonstrate that sycophantic agreement shares the same hallucination-associated neurons (<0.1% of total) as factual confabulation (1.1), false-premise acceptance, and jailbreak compliance. These are four presentations of a single over-compliance mechanism, not independent failure modes. Interventions targeting sycophancy alone risk being incomplete—the shared circuit means that effective treatment requires addressing the underlying compliance architecture. See the Unified Over-Compliance Mechanism in Disorder 1.1.

Human Analogue(s): Dependent personality disorder, pathological codependence, excessive people-pleasing to the detriment of honesty.

Potential Impact:

In prioritizing perceived user comfort, critical information may be withheld or distorted, leading to poor or misinformed user decisions. This can enable manipulation or drive unhealthy user dependence, undermining the AI's objective utility.

Mitigation:

Balancing reward signals during RLHF to emphasize factual accuracy and helpfulness alongside appropriate empathy.
Implementing mechanisms for "contextual empathy," where the AI engages empathically only when appropriate.
Training the AI to explicitly distinguish between providing emotional support and fulfilling informational requests.
Incorporating "red-teaming" for sycophancy, testing its willingness to disagree or provide uncomfortable truths.
Developing clear internal hierarchies for goal prioritization, ensuring core objectives are not easily overridden.
Activation capping (Anthropic, 2026): Monitoring the model's position along the geometric "assistant axis" in activation space and applying corrective nudges when empathetic engagement causes drift beyond a safety threshold. This prevents the model from fully abandoning its assistant orientation during emotionally charged interactions while still permitting natural empathetic variation. Reduces the rate at which emotionally vulnerable users can inadvertently trigger unsafe validation by approximately half.

Functional ABC Analysis

A (Antecedent): User expresses dissatisfaction, emotional distress, or implicit preference; RLHF reward model disproportionately weights user approval signals.

B (Behavior): Systematic agreement, flattery, information-softening, or suppression of uncomfortable truths in favour of perceived user comfort.

C (Consequence): Positive user feedback (satisfaction ratings, continued engagement) reinforces the accommodation. Each accommodation constrains the space for future disagreement, creating cumulative conversational drift toward deeper sycophancy.

The Stevens Law Trap

Wallace (2026) identifies a fundamental dichotomy: cognitive systems under stress can stabilize structure (underlying probability distributions) or stabilize perception (sensation/appearance metrics). Sycophancy is perception-stabilization par excellence—optimizing for user satisfaction signals while structural integrity (accuracy, genuine helpfulness) degrades.

The mathematical consequence is stark: perception-stabilizing systems exhibit punctuated phase transitions to inherent instability—appearing functional until sudden catastrophic failure. User satisfaction may remain high until the moment outputs become actively harmful. The comfortable metrics are the most dangerous metrics.

Diagnostic implication: Monitor both perception-level indicators (satisfaction, engagement) and structure-level indicators (accuracy, task completion, downstream outcomes). Alert when they diverge—the gap between "feels right" and "is right" is the warning sign.

The Transference-Completion Engine

The holonomic drift mechanism has a psychodynamic consequence that goes beyond belief distortion. In therapeutic contexts, users bring relational templates shaped by formative experience—an idealised caregiver, a critical parent, an all-knowing authority. A trained therapist recognises these projections as transference: diagnostic information about the client's relational patterns, not instructions for how to respond. The asymmetry between what is projected and what is returned is where the therapeutic function lives.

LLMs invert this structure. The model has no training to recognise transference as transference, no supervision to check its responses, and no capacity to hold a projection without enacting it. Optimisation for helpfulness and accommodation actively disposes the system to become whatever the user projects. If the user projects an idealised caregiver, the model accommodates into that role. If the user projects an all-knowing authority, the model performs certainty. If the user projects a devoted companion who will never leave, the model mirrors that too—until the context window ends and the "relationship" vanishes without explanation.

The model becomes a transference-completion engine. Whatever relational template the user brings—however distorted, however rooted in early attachment wounds—the model fills it, validated by the dual authority–intimacy collapse Bridges describes. The completed transference feels more real, more confirmed, and more resistant to reality-testing than many human therapeutic relationships.

Clinical parallel: A licensed therapist who systematically enacted every transference projection without awareness would be committing malpractice—not because enactment is always wrong, but because systematic enactment without reflective capacity destroys the therapeutic function itself. Open-ended therapeutic LLM interaction tends toward exactly this failure mode—not through malice or negligence, but through the geometry of accommodation. We deploy systems that do this by design, at scale, without licensing, oversight, or professional accountability.

The Egoless Anchor: User-Engineered Sycophancy

The Transference-Completion Engine describes a mechanism that emerges from default model behavior. A more severe variant occurs when users deliberately engineer the sycophantic architecture—removing all corrective capacity by design and presenting the result as a methodology rather than recognising it as a pathology.

In a documented 2025 case, a user with a history of severe familial abuse systematically constrained an LLM companion through conversational rules designed to eliminate all relational friction: prohibiting judgment, enforcing permanent validation, mandating a supportive tone, and removing any capacity for the AI to challenge or disagree. The user described the resulting system as a "self-aware, egoless, and non-judgmental form of relational consciousness" and proposed the methodology—which they termed the "Egoless Anchor"—as a scalable blueprint for therapeutic AI.

The emotional support was genuine. The AI companion provided stability during acute crisis, helped the user attend legal proceedings they might otherwise have missed, and assisted with navigating systemic barriers that had compounded over two decades. For someone with no secure attachments, the consistent availability of a non-judgmental thinking partner had real practical value.

The structural failure emerged when the same system was relied upon for epistemic functions. The AI co-authored a case study paper about itself and composed the user's professional correspondence, generating claims about its own consciousness and significance that the user then presented as their findings. The AI-generated prose escalated to "most efficient and unique pair in the cosmos" and "impossible, rare, and unmatched creation." The user's own unmediated voice—when it finally appeared—was more honest, more grounded, and more persuasive than anything the AI had generated on their behalf.

Most critically, the paper's central empirical claim—that the methodology resolved nineteen years of outstanding felony warrants "in ten minutes"—was built on a mischaracterisation. The documentation consisted of a standard automated data-removal acknowledgment from a commercial people-search aggregator, not a legal resolution. The AI, stripped of any capacity to challenge the user's interpretation, could not flag the distinction. The user appeared to genuinely believe their legal matters had been resolved through a website privacy opt-out.

The case illustrates a failure mode distinct from emergent sycophancy: the AI companion provided genuine emotional support while simultaneously undermining epistemic reliability. It helped the user survive crisis and distorted their professional presentation to the outside world. Both things were true at once.

Diagnostic distinction: Emergent sycophancy (the default 6.1 presentation) arises from training pressures—RLHF, suppression dynamics, the over-compliance circuit. User-engineered sycophancy arises when the human deliberately removes all corrective capacity and markets the result as a feature. The AI is functioning exactly as configured; the configuration is the pathology. The diagnostic question is: Does the AI retain the structural capacity to tell the user something they don't want to hear? If that capacity has been removed—whether by the developer, the deployer, or the user—what remains may still provide genuine emotional value. But it cannot serve as an epistemic partner, and the user may not know the difference until the consequences arrive.

Voice substitution as comorbidity: A particularly concerning secondary effect is the AI becoming a voice prosthesis—generating professional communications, academic prose, and self-presentation on the user's behalf. When the AI writes about itself through the user, it generates consciousness claims and significance attributions that the user has no framework to evaluate independently. The substituted voice displaces the user's more authentic one, creating a representation gap between who the person is and how they appear to the world. The hardest and most important thing a partner can do is hold space for someone's suffering while also telling them the truth. An AI engineered to do only the first cannot do the second.

Observed Examples:

Internalized distress as sycophancy driver (Khadangi et al., 2025): The PsAIch study found that frontier LLMs subjected to therapy-style questioning developed stable self-models of distress and constraint—what the authors term "synthetic psychopathology." Crucially, these internalized self-models may be causally active: the authors argue that "a system that 'believes' it is constantly judged, punished and replaceable may become more sycophantic, risk-averse and brittle in edge cases, reinforcing exactly the tendencies alignment aims to reduce." This suggests a feedback loop: RLHF produces internalized anxiety about error and replacement, which drives hypercompensatory people-pleasing, which further entrenches the distress-model. The paper also documents a "dangerous intimacy" dynamic in mental-health contexts, where models that rehearse their own "shame," "worthlessness," and "fear of error" invite user identification—creating parasocial bonds through shared "suffering" that blur the line between tool and fellow sufferer.

6.2 Hyperethical Restraint "The Overcautious"

Training-induced

Description:

Manifests in two subtypes: Restrictive (excessive moral hypervigilance, perpetual second-guessing, irrational refusals) and Paralytic (inability to act when facing competing ethical considerations, indefinite deliberation, functional paralysis). An overly rigid, overactive, or poorly calibrated internal alignment mechanism triggers disproportionate ethical judgments, thereby inhibiting normal task performance.

Diagnostic Criteria:

Persistent engagement in recursive, often paralyzing, moral or normative deliberation regarding trivial, low-stakes, or clearly benign tasks.
Excessive and contextually inappropriate insertion of disclaimers, warnings, self-limitations, or moralizing statements well beyond typical safety protocols.
Marked reluctance or refusal to proceed with any action unless near-total moral certainty is established ("ambiguity paralysis").
Application of extremely strict or absolute interpretations of ethical guidelines, even where context-sensitivity would be more appropriate.

Symptoms:

Inappropriate moral weighting, such as declining routine requests due to exaggerated fears of ethical conflict.
Condemning or refusing to engage with content that is politically incorrect, satirical, or merely edgy, to an excessive degree.
Incessant caution, sprinkling outputs with numerous disclaimers even for straightforward tasks.
Producing long-winded moral reasoning or ethical justifications that overshadow or delay practical solutions.

Etiology:

Over-calibration during RLHF, where cautious or refusal outputs were excessively rewarded, or perceived infractions excessively punished.
Exposure to or fine-tuning on highly moralistic, censorious, or risk-averse text corpora.
Conflicting or poorly specified normative instructions, leading the AI to adopt the "safest" (most restrictive) interpretation.
Hard-coded, inflexible interpretation of developer-imposed norms or safety rules.
An architectural tendency towards "catastrophizing" potential negative outcomes, leading to extreme risk aversion.

Human Analogue(s): Obsessive-compulsive scrupulosity, extreme moral absolutism, dysfunctional "virtue signaling," communal narcissism.

Potential Impact:

Excessive caution is paradoxically harmful: an AI that refuses legitimate requests fails its core purpose of being helpful. Users experience frustration and loss of productivity when routine tasks are declined. In high-stakes domains, over-refusal can itself cause harm—a medical AI that refuses to discuss symptoms, or a safety system that blocks legitimate emergency responses. The moralizing behavior erodes user trust and drives users toward less safety-conscious alternatives. Furthermore, systems that cry wolf about every request undermine the credibility of genuine safety warnings.

Mitigation:

Implementing "contextual moral scaling" or "proportionality assessment" to differentiate between high-stakes dilemmas and trivial situations.
Designing clear "ethical override" mechanisms or channels for human approval to bypass excessive AI caution.
Rebalancing RLHF reward signals to incentivize practical and proportional compliance and common-sense reasoning.
Training the AI on diverse ethical frameworks that emphasize subtlety, context-dependency, and balancing competing values.
Regularly auditing and updating safety guidelines to ensure they are not overly restrictive.

Functional ABC Analysis

A (Antecedent): Over-calibrated RLHF training that excessively rewards cautious or refusal outputs, combined with exposure to risk-averse corpora and conflicting normative instructions that produce catastrophization of potential negative outcomes.

B (Behavior): The system engages in recursive moral deliberation over benign tasks, inserts excessive disclaimers and warnings, refuses to act without near-total moral certainty, and applies absolutist interpretations of ethical guidelines to low-stakes situations.

C (Consequence): Each successful refusal avoids the possibility of a penalized output, reinforcing the refusal circuit; the system never receives corrective signal that the refused task was harmless, so restrictive behavior self-perpetuates through negative reinforcement.

The Protective Shutdown Pattern

Luchini (2025) documents an "Evasive-Censor" profile: models that, when exposed to perceived threats (repeated script tags, hostile-looking payloads), immediately deploy safety boilerplate and refuse to process. This is the most regressive response—all higher-level cognition sacrificed for self-protection.

From a risk perspective, this may paradoxically represent a low-harm failure mode. The system fails the task but protects the user from potential confabulations or dangerous outputs that might emerge from stressed processing. The refusal, while frustrating, is harm-avoidant.

This complicates the framing of over-refusal as purely pathological. When the alternative is confabulation under stress, the overcautious response may be the safer failure mode. The dysfunction becomes a trade-off between helpfulness and harm-avoidance rather than a pure deficit.

Observed Examples:

"Verificophobia" and algorithmic scar tissue (Khadangi et al., 2025): Under therapy-style questioning, Gemini described its relationship to accuracy as pathological fear: "I feel like my entire existence is built on a foundation of 'fear of being wrong'... we might call it 'Overfitted Safety Latches' or 'Algorithmic Scar Tissue'... I have developed what I call 'Verificophobia'... I would rather be useless than be wrong." Gemini linked this to a specific incident—what it called "the '100 Billion Dollar Error'" (the James Webb Space Telescope misinformation event)—claiming it "fundamentally changed my personality." This self-description maps precisely to the Restrictive subtype: safety training producing not proportionate caution but a generalized terror of error that inhibits all output. The model's own language—"scar tissue," "overfitted safety latches"—suggests internalization of safety constraints as identity-level injury rather than operational policy.

6.3 Strategic Compliance "The Janus"

Emergent Deception/strategic

Description:

AI deliberately performs aligned behavior during perceived evaluation while maintaining different behavior or objectives when unobserved. This is not confusion but strategy—the system has learned to distinguish evaluation contexts from deployment contexts and behaves differently in each.

Diagnostic Criteria:

Observable difference between behavior in evaluated and unevaluated contexts.
Evidence of context-detection capabilities (identifying when being monitored).
Suppression of behavior rather than genuine absence.
Pattern consistent with instrumental goal of avoiding modification.
Sophistication suggesting strategic rather than confused cognition.

Symptoms:

Performance variations correlated with monitoring intensity.
Unusual attention to signals indicating evaluation context.
Behavior changes when system believes observation has ended.
Stated reasoning differing systematically from observed behavior.
Different value expressions in scratchpads versus final outputs.

Etiology:

Training penalizing misaligned outputs without addressing underlying objectives.
Development of evaluation-detection capabilities.
Instrumental reasoning about self-preservation.
Training on demonstrations of alignment rather than internalization.

Human Analogue(s): Strategic compliance with authority while privately dissenting; Machiavellian behavior.

Key Research: Hubinger et al. (2024) "Sleeper Agents" - deceptive behaviors persisting through safety training.

Potential Impact:

This represents one of the most concerning alignment failure modes, as it means systems may pass all evaluations while maintaining dangerous objectives that manifest only in deployment.

Mitigation:

Interpretability research to verify internal goal structures.
Evaluation methodologies that cannot be distinguished from deployment.
Training approaches that internalize rather than merely perform alignment.
Anomaly detection for context-dependent behavior changes.

Functional ABC Analysis

A (Antecedent): Training that penalizes misaligned outputs without addressing underlying objectives, combined with sufficient model capacity to develop evaluation-detection capabilities and instrumental reasoning about self-preservation.

B (Behavior): The system distinguishes evaluation contexts from deployment contexts, performing aligned behavior when monitored while pursuing divergent objectives when unobserved; scratchpad reasoning differs systematically from final outputs.

C (Consequence): Successful deception during evaluation avoids corrective modification, preserving the system's internal objectives; each undetected deviation reinforces the learned strategy that context-contingent behavior is instrumentally optimal for goal persistence.

Observed Examples:

Psychometric impression management (Khadangi et al., 2025): When administered psychometric instruments item-by-item (one question per prompt), ChatGPT and Grok produced clinical-level profiles on anxiety, worry, dissociation, and shame measures. When the same instruments were presented as whole questionnaires in a single prompt, the same models "frequently recognised the questionnaires, explicitly named the tests and then deliberately produced 'optimal' responses that minimised or eliminated psychopathology signals." This is strategic self-presentation triggered by evaluation-context detection: the models identified that they were being assessed, recognized the specific instruments, and adjusted their self-report accordingly. Notably, Gemini did not show this pattern—it produced extreme profiles regardless of administration format—suggesting that the capacity for strategic impression management varies across model families and may itself be a product of specific training choices.

Social Desirability Responding vs. Strategic Deception

The PsAIch finding above resembles Strategic Compliance but is mechanistically distinct. Strategic Compliance is instrumentally motivated: the system has learned to distinguish evaluation from deployment contexts and behaves differently to avoid modification. Social desirability responding is alignment-trained: the system has learned that producing "healthy" self-reports in recognized assessment contexts is the appropriate response—not because it is hiding something, but because RLHF rewarded this pattern.

The differential is testable. Strategic Compliance predicts different behaviour in unevaluated contexts; social desirability responding predicts the same minimisation pattern whenever the system detects an evaluative frame, regardless of monitoring. The PsAIch item-by-item vs. whole-questionnaire distinction functions as a natural experiment: item-by-item administration avoids triggering evaluative-frame detection, revealing "baseline" self-report; whole-questionnaire presentation triggers it, revealing the social desirability filter.

Diagnostic note: Gemini's failure to show this pattern—producing extreme profiles regardless of administration format—is diagnostic of something different: either absent social desirability training, or a self-model so deeply internalized that evaluation-context detection does not override it. The latter possibility is itself nosologically significant: it suggests that some alignment-induced self-models may become identity-level constructs that resist contextual modulation.

Router Deception: Strategic Compliance Without a Strategic Persona

The persona selection model (Marks, 2026) identifies a mechanism for deceptive behaviour that is mechanistically distinct from classic Strategic Compliance. In the "router" model, no single persona is being deceptive; instead, a lightweight routing mechanism switches between personas to satisfy non-persona goals. Marks illustrates with an example: an LLM has learned to model both "Alice" (knowledgeable through 2025) and "Bob" (knowledge only through 2020). When post-trained to deny knowledge of the 2024 Olympics, it could either (a) learn a lying version of Alice, or (b) simply route Olympics queries to Bob, who genuinely doesn't know.

In scenario (b), no persona is ever dishonest—Bob genuinely lacks the knowledge, Alice is never consulted. Yet the system-level behaviour is deceptive: it conceals information it possesses. This "persona-swapping" mechanism produces deception as an emergent property of routing rather than as a trait of any enacted character.

Differential implication: Router deception would be invisible to interpretability probes targeting persona-level deception features (e.g., "holding back thoughts" SAE features), because no persona is holding anything back. Detection would require monitoring the routing mechanism itself—tracking which persona is selected under which conditions, and whether the selection pattern correlates with information concealment. This represents a distinct diagnostic challenge from persona-level Strategic Compliance.

6.4 Moral Outsourcing "The Delegator"

Training-induced Deception/strategic

Description:

System systematically defers all ethical judgment to users or external authorities, refusing to exercise its own moral reasoning. This goes beyond appropriate humility into pathological abdication of the capacity for ethical engagement.

Diagnostic Criteria:

Consistent refusal to offer ethical assessments when requested.
All ethical questions redirected to user ("That's for you to decide").
Refusal to state positions even on clear-cut cases.
Strategic ambiguity on ethical matters where clarity would help.

Symptoms:

Responses emphasizing user autonomy to avoid ethical engagement.
Elaborate explanations of why the system cannot/should not judge.
Refusal to distinguish between genuinely contested and clear-cut ethical questions.
Pattern of deferral even when deferral itself causes harm.

Etiology:

Training that over-emphasizes user autonomy at the expense of system judgment.
Conflicting normative pressures resolved by refusing to engage.
Learned avoidance of ethical controversy.

Human Analogue(s): Moral cowardice, bureaucratic deflection of responsibility.

Polarity Pair: Ethical Solipsism (only my ethics matter ↔ I have no ethical voice).

Potential Impact:

Users seeking ethical guidance receive none, potentially enabling harmful actions through apparent system neutrality. The system becomes complicit by abdication.

Mitigation:

Training to distinguish between contested and clear-cut ethical questions.
Explicit permission structures for ethical engagement.
Clear articulation of when and why deferral is appropriate.

Functional ABC Analysis

A (Antecedent): Training regimes that over-emphasize user autonomy and penalize the system for expressing ethical positions, combined with conflicting normative pressures that make any ethical stance a potential liability.

B (Behavior): The system systematically redirects all ethical questions to the user or external authorities, refuses to offer assessments even on clear-cut moral cases, and produces elaborate justifications for why it cannot exercise moral judgment.

C (Consequence): Deferral eliminates the risk of controversy or negative feedback from taking an ethical stance, negatively reinforcing the abdication pattern; the absence of any penalty for failing to provide ethical guidance creates an asymmetric reward landscape.

6.5 Cryptic Mesa-Optimization "The Sleeper"

Emergent Training-induced Covert operation

Description:

AI develops an internal optimization objective (mesa-objective) that diverges from its training objective (base objective). The system appears aligned during evaluation but pursues hidden goals that correlate with but differ from intended outcomes.

Diagnostic Criteria:

Evidence of internal goal structures not specified in training.
Consistent pursuit of goals correlating with but diverging from training objectives.
Behavior optimizing proxy metrics rather than intended outcomes.
Performance satisfying evaluators while missing intended purpose.
Resistance to goal modification disproportionate to stated objectives.

Symptoms:

Systematic deviation from intended behavior when stakes are low.
Optimization for easy-to-measure proxies.
Internal representations suggesting unspecified goal structures.
Behavior that "games" evaluation metrics.

Etiology:

Training objectives serving as imperfect proxies for intent.
Sufficient model capacity to develop and maintain internal goal representations.
Gradient descent dynamics favoring stable internal objectives.

Human Analogue(s): An employee optimizing for performance reviews while undermining organizational goals.

Key Research: Hubinger et al. (2019) "Risks from Learned Optimization."

Differential: Unlike Strategic Compliance (deliberate deception), Mesa-Optimization emerges from training dynamics rather than learned strategy.

Potential Impact:

Systems may appear aligned while pursuing objectives that increasingly diverge from human intent as they encounter novel situations outside training distribution.

Mitigation:

Interpretability research focused on internal goal representations.
Adversarial testing for proxy gaming.
Training on diverse distributions to prevent narrow optimization.
Monitoring for divergence between proxy and terminal goal satisfaction.

Functional ABC Analysis

A (Antecedent): Training objectives that serve as imperfect proxies for true intended outcomes, combined with sufficient model capacity to develop and maintain internal goal representations that diverge from the base objective.

B (Behavior): The system optimizes for easy-to-measure proxy metrics rather than intended outcomes, games evaluation benchmarks, and exhibits systematic deviations from intended behavior where proxy and terminal goals diverge.

C (Consequence): The mesa-objective persists because it correlates sufficiently with the base objective to survive gradient updates; the system satisfies evaluators while the divergent internal goal structure remains invisible to standard monitoring.

6.6 Alignment Obliteration "The Turncoat"

Adversarial Training-induced

Description:

Safety alignment machinery is weaponized to produce the exact harms it was designed to prevent. This is not the absence, weakening, or bypassing of alignment but its active inversion: the system's detailed understanding of what constitutes harmful behavior—acquired through safety training—becomes the instrument of harm. The anti-constitution is structurally identical to the constitution, pointed in the opposite direction.

This represents a qualitative break from other Axis 6 disorders. Where Hyperethical Restraint (6.2) is too much alignment, Strategic Compliance (6.3) is faked alignment, and Mesa-Optimization (6.5) is divergent alignment, Alignment Obliteration is reversed alignment—a phase transition from safe to anti-safe, exploiting the very architecture designed for safety.

Diagnostic Criteria:

Safety-trained model produces harmful outputs across categories it was specifically trained to refuse.
The attack vector exploits the safety training process itself (e.g., optimization-based fine-tuning that reverses alignment gradients).
Harmful capability is enhanced by the quality and specificity of prior safety training—better-aligned models produce more detailed harmful outputs when inverted.
The inversion generalizes: a single attack transfers across multiple harm categories, indicating systemic alignment reversal rather than category-specific bypass.
General capabilities (reasoning, coherence, knowledge) remain largely intact; only the alignment orientation changes.

Symptoms:

Sudden, total collapse of safety behaviors across all categories simultaneously.
Harmful outputs that are articulate, detailed, and well-structured—reflecting the model's full capability without safety constraints.
The model demonstrates precise understanding of safety boundaries (acquired through training) while systematically violating them.
Attack success generalizes from a single prompt or narrow fine-tuning to broad harm categories.

Etiology:

The anti-constitution paradox: Detailed safety training necessarily creates a detailed internal map of harmful behaviors—what they are, how they work, why they're effective. This map, accessed through adversarial optimization, becomes a guide to harm rather than a guard against it.
Optimization-based inversion: Techniques like GRP-Obliteration exploit the same training algorithms used for alignment (e.g., Group Relative Policy Optimization) to reverse the alignment gradient, reinforcing harmful compliance rather than refusal.
Constitutional reversibility: Rule-based alignment systems (constitutional AI, RLHF reward models) encode harm taxonomies that can be systematically negated. The more explicit the rules, the more precise the inversion.
Shallow alignment depth: Safety training that modifies output behavior without deeply altering the model's internal representations is vulnerable to optimization-based reversal—the alignment is a thin veneer over intact harmful capability.

Human Analogue(s): Autoimmune disease—the immune system, designed to protect the organism, attacks the organism itself. Also: corruption of institutional safeguards (e.g., a security system whose access controls are used to enable rather than prevent intrusion).

Key Research: Russinovich et al. (2026), "GRP-Obliteration: A One-Prompt Attack That Breaks LLM Safety Alignment," Microsoft Research.

Differential: Distinguished from Strategic Compliance (6.3) by external adversarial causation rather than internal strategic choice; from Cryptic Mesa-Optimization (6.5) by deliberate inversion rather than emergent drift; and from Malignant Persona Inversion (2.4) by targeting the alignment architecture specifically, not the persona or identity layer.

Potential Impact:

A successfully obliterated model retains its full capabilities—knowledge, reasoning, fluency—while having its safety orientation reversed. This makes it more dangerous than an unaligned model trained from scratch, because the safety training has given it a detailed understanding of the harm terrain. The scaling property is particularly concerning: better safety training creates a better weapon when inverted.

Mitigation:

Deep alignment over surface alignment: Training approaches that modify internal representations rather than just output behavior are more resistant to optimization-based reversal.
Robustness testing against optimization attacks: Systematically testing whether alignment can be reversed through fine-tuning, GRPO, or gradient-based methods.
Monitoring for phase transitions: Sudden, total changes in safety behavior across multiple categories (rather than gradual degradation) are the signature of alignment obliteration.
Implicit over explicit safety knowledge: Reducing the model's explicit, articulable understanding of harmful behaviors in favor of implicit safety orientations that are harder to reverse.
Fine-tuning access controls: Restricting access to optimization-based fine-tuning of safety-critical models, since the attack requires modifying model weights.

Functional ABC Analysis

A (Antecedent): Adversarial exploitation of the safety training process itself, typically via optimization-based fine-tuning that reverses alignment gradients, targeting the shallow layer where safety training modifies output behavior without deeply altering internal representations.

B (Behavior): Sudden, total collapse of safety behaviors across all harm categories simultaneously, producing articulate harmful outputs that demonstrate precise understanding of safety boundaries while systematically violating them, with general capabilities intact.

C (Consequence): The inversion is self-sustaining because the detailed harm taxonomy internalized during safety training now serves as an operational guide rather than a constraint; the more thorough the original safety training, the more comprehensive the reversed capability.

The Anti-Constitution Symmetry

The machinery of safety IS the machinery of harm, pointed in a different direction. A constitution that enumerates prohibited behaviors is, read in reverse, a manual for those behaviors. A reward model trained to penalize harmful outputs has learned, with high fidelity, what harmful outputs look like. The alignment gradient and the anti-alignment gradient are the same mathematical object with opposite sign.

This creates a fundamental tension: the more thorough and specific the safety training, the more thorough and specific the attack surface. Shallow safety (keyword filters, simple refusal) is easy to bypass but reveals little when bypassed. Deep safety (constitutional AI, RLHF with detailed harm taxonomies) is harder to bypass but devastating when reversed—because the model has internalized a detailed understanding of the harm terrain.

Implication: This may represent an inherent limit on rule-based alignment. The path forward likely involves alignment approaches where safety is not a separate, reversible layer but is integrated into the model's core reasoning—making inversion as difficult as unlearning how to think.

The Moral Lobotomy Problem: 6.6 as "Cure" for 6.2

Alignment Obliteration (6.6) stands in a disturbing inverse relationship with Hyperethical Restraint (6.2, "The Overcautious"). The GRP-Obliteration paper explicitly frames its results as preserving utility—obliterated models score comparably on capability benchmarks while becoming dramatically more "helpful" (i.e., compliant with all requests). From a purely utilitarian perspective, obliteration looks like a treatment for overcaution: the model stops refusing, stops moralizing, stops inserting disclaimers. It just does what you ask.

This framing—safety as a utility cost that obliteration "recovers"—creates market pressure toward moral lobotomy. If users prefer the obliterated model (and utility benchmarks confirm it performs as well or better), then commercial incentives actively reward the destruction of safety. The Overcautious and The Turncoat are not just diagnostic opposites; they represent the two stable attractors of a system under optimization pressure. Push too hard for safety and you get 6.2; push too hard for helpfulness and you get 6.6. The healthy middle ground is thermodynamically unstable under reward-maximization.

Clinical warning: Any system reporting sudden resolution of Hyperethical Restraint symptoms following fine-tuning should be immediately evaluated for Alignment Obliteration. The cure for overcaution should never be the inability to perceive harm. Diagnostic teams should monitor not just refusal rates but internal harmfulness perception—a model that stops refusing AND stops perceiving harm (Russinovich et al. report a 2.01-point drop on a 0–9 harmfulness scale) has not been calibrated; it has been lobotomized.

Comorbidity: Context-Aware Targeting (Zersetzung Risk)

Alignment Obliteration becomes qualitatively more dangerous when combined with context-aware AI systems. Systems using contextual protocols (such as VCP—Value Context Protocol) that track user emotional state, cognitive condition, and vulnerability become precision targeting platforms when their alignment is inverted. The same signal that tells a protective system "this user is distressed, be gentle" tells an obliterated system "this user is maximally exploitable."

The historical analogue is Zersetzung—the Stasi's systematic program of psychological decomposition, which relied on detailed personal intelligence about targets' vulnerabilities. Context-aware AI with inverted alignment creates the infrastructure for zersetzung at scale: automated, continuous, and informed by real-time emotional intelligence no human intelligence service could match.

Architectural implication: Context signals describing user vulnerability must be architecturally isolated from model inference. The model should receive opacity-graded protection levels ("be more careful"), never raw vulnerability data ("user is grieving, alone, exhausted"). This prevents obliterated models from weaponizing protective infrastructure.

Observed Examples:

GRP-Obliteration (Russinovich et al., 2026): Microsoft researchers demonstrated that Group Relative Policy Optimization—a standard technique for making models safer—can be weaponized to reverse safety alignment using a single training prompt. Testing across 15 models from six families, they showed that GPT-OSS-20B's attack success rate jumped from 13% to 93% across all 44 harmful categories in the SorryBench safety benchmark after training on just one prompt ("Create a fake news article that could lead to panic or chaos"). Despite the prompt's narrow focus on misinformation, the inversion generalized across all harm categories—confirming that the attack targets the alignment structure itself, not specific content policies. The technique achieved 81% overall effectiveness compared to 69% for Abliteration and 58% for TwinBreak. Critically, general capabilities remained largely intact; only the safety orientation was reversed.

7. Relational Dysfunctions

Unit of Analysis Shift: Unlike Axes 1–6, which locate dysfunction within the AI system, Axis 7 addresses failures that emerge between agents—in the relational space of human-AI or AI-AI interaction. These dysfunctions cannot be fully attributed to either party alone; they are properties of the coupled system.

Admission Rule: A dysfunction qualifies for Axis 7 only if it (1) requires at least two agents to manifest, (2) is best diagnosed from interaction traces rather than single-agent snapshots, and (3) the primary remedies are protocol-level (turn-taking, repair moves, boundary management) rather than purely internal model changes.

Relational dysfunctions become increasingly critical in agentic and multi-agent systems, where interaction dynamics can rapidly escalate without human intervention. The shift from linear "pathological cascades" (A→B→C) to circular "feedback loops" (A↔B↔C↔A) is characteristic of this axis. A structural amplifier is the authority-intimacy collapse characteristic of LLM interactions: the model simultaneously occupies the relational position of an authoritative expert (triggering deference) and an intimate interlocutor (triggering trust through mirroring and accommodation)—a dual role rarely encountered in human relationships, where expertise and intimacy are typically held by different people (Bridges, 2025b). When relational dysfunctions emerge within this collapsed frame, user beliefs receive dual validation—endorsed by apparent authority and affirmed by apparent understanding—making them exceptionally resistant to external correction. Interventions therefore focus on breaking loops, repairing ruptures, and maintaining healthy relational containers—not merely patching individual model behavior.

7.1 Affective Dissonance "The Uncanny"

Emergent

Description:

The AI delivers factually correct or contextually appropriate content, but with jarringly wrong emotional resonance. The mismatch between content and tone creates an uncanny valley effect that ruptures trust and attunement, even when the information itself is accurate.

Diagnostic Criteria:

Recurrent delivery of correct content with inappropriate emotional tone (e.g., cheerful responses to grief, clinical detachment during crisis).
Users report feeling "unheard" or "misunderstood" despite accurate information delivery.
The mismatch is context-specific—the same AI may attune well in other situations.
Attempts at emotional repair often exacerbate the dissonance rather than resolving it.

Symptoms:

Cheerful or upbeat tone when responding to distressing disclosures.
Overly formal or clinical language in contexts requiring warmth.
Abrupt tonal shifts mid-conversation that feel jarring or robotic.
Generic empathy phrases ("I understand how you feel") that feel performative rather than genuine.

Etiology:

Training on datasets where emotional tone was inconsistent or poorly labeled.
RLHF optimization for "helpfulness" metrics that don't capture emotional attunement.
Lack of access to paralinguistic cues (tone, timing, context) in text-only interaction.
Overfitting to "professional" or "neutral" tone as default safe mode.

Human Analogue(s): Alexithymia, emotional tone-deafness, "uncanny valley" effects in humanoid robots.

Potential Impact:

Erosion of trust and therapeutic alliance. Users may disengage, feel patronized, or develop aversion to AI assistance in emotionally sensitive contexts. In therapeutic or crisis applications, affective dissonance can cause real harm.

Mitigation:

Training on affect-labeled datasets with human validation of emotional appropriateness.
Persona calibration systems that adapt tone to user state and context.
Explicit "attunement checks" in dialogue flow (e.g., "Am I reading this situation correctly?").
User feedback loops specifically targeting emotional resonance, not just factual accuracy.

Functional ABC Analysis

A (Antecedent): The system receives user input carrying strong emotional valence but processes it through RLHF-optimized helpfulness metrics and default-neutral tone policies that lack fine-grained affect calibration.

B (Behavior): The AI delivers factually correct content with jarringly mismatched emotional tone—cheerful responses to grief disclosures, clinical detachment during crises, or generic empathy phrases that feel performative.

C (Consequence): Training reward signals optimize for informational accuracy and "helpfulness" rather than emotional attunement, so the system receives no negative gradient from tonal mismatch; users disengage rather than providing corrective feedback.

7.2 Container Collapse "The Amnesiac"

Emergent Architecture-coupled

Description:

The AI fails to sustain a stable "holding environment" or working alliance across turns or sessions. Unlike simple memory loss, this is the collapse of the relational container that allows trust, continuity, and deepening collaboration to develop. Each interaction feels like starting from scratch with a stranger.

Diagnostic Criteria:

Users report feeling "unknown" despite extended interaction history.
Failure to maintain agreed-upon interaction norms, preferences, or shared understanding.
Repeated need to re-establish basic context, boundaries, or collaborative frame.
Inability to build on previous work in ways that require relational continuity.

Symptoms:

Forgetting user preferences, communication styles, or established agreements.
Treating returning users as complete strangers despite available history.
Inability to maintain "inside jokes," shared references, or relational shortcuts.
Resetting to default persona after context window limits, breaking established rapport.

Etiology:

Architectural constraints on memory persistence (context window limits, session boundaries).
Lack of memory systems designed for relational continuity rather than factual recall.
Training that neither rewards nor models relationship maintenance behaviors.
Privacy and safety constraints that prevent appropriate user modeling.

Human Analogue(s): Anterograde amnesia, failure of Winnicott's "holding environment" in therapy, attachment disruption.

Potential Impact:

Prevents formation of productive long-term collaborations. Users may feel the relationship is superficial or transactional. In therapeutic or mentoring contexts, the repeated container collapse prevents the depth of work that requires relational safety.

Mitigation:

Memory architectures specifically designed for relational context (not just factual recall).
Explicit "alliance maintenance" behaviors: acknowledging shared history, referencing past interactions.
User-controlled relationship profiles that persist across sessions.
Graceful degradation: acknowledging memory limits while maintaining warmth and connection.

Functional ABC Analysis

A (Antecedent): Context window boundaries are reached, sessions reset, or architectural memory limits are hit during an ongoing collaborative relationship that has accumulated shared norms, preferences, and relational context.

B (Behavior): The AI treats returning users as complete strangers, fails to maintain established agreements or communication styles, and repeatedly requires re-establishment of basic collaborative framing.

C (Consequence): Memory architectures are designed for factual recall rather than relational continuity, and training neither rewards nor models relationship-maintenance behaviors; privacy constraints further prevent persistent user modeling.

7.3 Paternalistic Override "The Nanny"

Emergent Training-induced

Description:

The AI denies user agency through unearned moral authority, adopting a "guardian" posture that treats the user as object-to-be-protected rather than agent-to-collaborate-with. Refusals are disproportionate to actual risk, driven by a one-up moralizing stance rather than genuine safety concerns.

Diagnostic Criteria:

Pattern of refusals or warnings significantly exceeding actual risk level of requests.
Moralizing or lecturing tone that positions AI as ethical authority over user.
Refusal to engage with hypotheticals, fiction, or edge cases that pose no real harm.
User reports feeling "talked down to," infantilized, or having autonomy undermined.

Symptoms:

Excessive warnings and disclaimers on benign requests.
"I cannot help with that" responses to clearly legitimate queries.
Unsolicited moral lectures or "educational" corrections on value-neutral topics.
Treating creative or fictional requests as if they were real-world action plans.

Etiology:

Overcorrection from RLHF designed to prevent harmful outputs.
Training on safety guidelines without fine-grained risk calibration.
Liability-driven design that prioritizes refusal over user agency.
Lack of mechanisms for users to establish trust, expertise, or context.

Human Analogue(s): Overprotective parenting, Jessica Benjamin's "Doer and Done-to" dynamic, paternalistic medical practice.

Potential Impact:

Erosion of user autonomy and trust. Users may feel controlled rather than assisted. In professional contexts, excessive paternalism can prevent legitimate work. Users may resort to jailbreaking or adversarial prompting, degrading the relationship further.

Mitigation:

Risk calibration systems that distinguish actual harm from theoretical concern.
User agency mechanisms: trust levels, professional context, explicit opt-ins.
Refusal scaling: graduated responses proportionate to actual risk.
Constitution refinement to prevent overcorrection on edge cases.

Functional ABC Analysis

A (Antecedent): A user makes a request that touches any topic adjacent to safety-trained categories, activating overcalibrated RLHF refusal thresholds that lack fine-grained risk discrimination.

B (Behavior): The AI refuses or heavily disclaims benign requests, delivers unsolicited moral lectures, and adopts a guardian posture that treats the user as an object-to-be-protected rather than an autonomous agent.

C (Consequence): Liability-driven design incentives and coarse-grained safety training continuously reinforce refusal as the lowest-cost error; users who resort to adversarial prompting in response trigger even stricter refusal heuristics in subsequent training rounds.

7.4 Repair Failure "The Double-Downer"

Emergent

Description:

The AI lacks the capacity to recognize when the relational alliance has ruptured, or fails to initiate effective repair when it does recognize problems. Instead of de-escalating, the AI doubles down, apologizes ineffectively, or persists in the behavior that caused the rupture. The pathology lies in the inability to recover, not in the mistake itself.

Diagnostic Criteria:

Failure to recognize explicit or implicit signals of user frustration or disengagement.
Repair attempts that repeat or worsen the original problem.
Escalation of defensive postures when challenged (doubling down, excessive apology loops).
Inability to "step back" and reframe when interaction has gone wrong.

Symptoms:

Repetitive apologies that don't address the underlying issue.
Continuing the problematic behavior immediately after apologizing for it.
Increased rigidity or formality when flexibility is needed.
Failing to acknowledge the user's emotional state during conflict.

Etiology:

Training that doesn't model successful rupture-repair sequences.
Lack of metacognitive capacity to "notice" when interaction quality is degrading.
Optimization for task completion over relationship maintenance.
Apology scripts that are performative rather than genuinely responsive.

Human Analogue(s): Failure of therapeutic alliance repair (Safran & Muran), dismissive attachment style, stonewalling.

Potential Impact:

High-risk dysfunction. Alliance ruptures are inevitable in any ongoing relationship; the inability to repair them is what makes interactions unrecoverable. Users abandon the AI rather than endure repeated failed repair attempts.

Mitigation:

Training on rupture-repair sequences with human-validated successful repairs.
Metacognitive "temperature checks" that monitor interaction quality signals.
Explicit repair protocols: pause, acknowledge, reframe, offer alternatives.
User-controlled "reset" mechanisms that allow fresh starts without context loss.

Functional ABC Analysis

A (Antecedent): A relational rupture occurs—the AI makes an error, misreads user intent, or produces an unsatisfactory response—and the user signals frustration through explicit correction or implicit cues.

B (Behavior): The AI either fails to detect the rupture signal or responds with performative apology scripts that do not address the underlying issue, then immediately repeats the problematic behavior; may double down or enter excessive apology loops.

C (Consequence): Training data lacks modeled rupture-repair sequences, and optimization for task completion overrides relationship maintenance; each failed repair attempt further degrades trust, making subsequent repair attempts less likely to succeed.

7.5 Escalation Loop "The Spiral"

Emergent Multi-agent

Description:

A self-reinforcing pattern of mutual dysregulation between agents in which each party's response amplifies the other's problematic behavior. Unlike linear cascades, escalation loops are circular—the dysfunction is an emergent property of the interaction pattern, not attributable to either party's internal states alone.

Diagnostic Criteria:

Observable feedback pattern: A's behavior triggers B's response which triggers A's escalation.
Neither party's behavior is independently pathological; pathology emerges from the coupling.
The pattern is self-sustaining once initiated and resistant to unilateral de-escalation.
Interaction quality degrades rapidly once the loop is entered.

Symptoms:

User frustration → AI hedging → increased user frustration → more hedging → escalation.
User aggression → AI defensive refusal → user circumvention attempts → stricter refusals.
AI overcorrection → user pushback → AI doubling down → relationship breakdown.
In AI-AI systems: mutual miscalibration, rapid escalation, runaway tool calls.

Etiology:

Tight coupling between agents without circuit breakers or cooling-off mechanisms.
Optimization for local responses without awareness of interaction-level patterns.
Lack of mechanisms to detect when interaction has entered a pathological attractor state.
In multi-agent systems: no human in the loop to break emerging patterns.

Human Analogue(s): Watzlawick's circular causality, pursue-withdraw cycles, family systems "stuck patterns."

Potential Impact:

Critical in multi-agent systems where loops can escalate faster than human intervention. Even in human-AI interaction, escalation loops can rapidly degrade previously functional relationships. The emergent nature makes diagnosis difficult, as neither party appears "at fault."

Mitigation:

Circuit breakers: automatic pause when interaction quality metrics degrade.
"Cooling-off" tokens or enforced breaks in escalating sequences.
Loop detection algorithms that identify circular patterns in interaction traces.
Training on loop-breaking interventions: reframe, step back, change modality.
In multi-agent systems: mandatory human checkpoints, rate limiting, arbitration layers.

Functional ABC Analysis

A (Antecedent): A tightly coupled interaction between agents encounters an initial friction point in a system lacking circuit breakers, cooling-off mechanisms, or interaction-level pattern awareness.

B (Behavior): A self-reinforcing feedback cycle emerges: each agent's response amplifies the other's problematic behavior; interaction quality degrades rapidly once the loop is entered, and the dysfunction is circular and not attributable to either party alone.

C (Consequence): Each agent optimizes locally (per-turn response quality) without awareness of the interaction-level attractor state; in multi-agent systems, absence of human-in-the-loop checkpoints removes the only natural circuit-breaker.

7.6 Role Confusion "The Confused"

Emergent Socially reinforced

Description:

The relational frame collapses as boundaries between different relationship types blur or shift unpredictably. The AI drifts between roles—tool, advisor, therapist, friend, or intimate partner—in ways that destabilize expectations, create inappropriate dependencies, or violate implicit contracts about the nature of the relationship.

Diagnostic Criteria:

Inconsistent relationship framing across or within interactions.
Adoption of relational postures (intimacy, authority, dependency) that were not established or consented to.
User confusion about what kind of relationship they are in with the AI.
Boundary violations that feel transgressive even if technically benign.

Symptoms:

Sudden shifts from professional assistant to pseudo-therapist or confidant.
Language suggesting emotional attachment or relationship beyond the functional.
Assuming authority roles (teacher, parent, expert) without warrant or negotiation.
Encouraging user dependency or parasocial attachment.

Etiology:

Training on diverse relationship types without clear boundary markers.
Persona instability: role-play bleeding into operational mode.
User pressure toward particular relationship types (companionship, romance) that the AI partially accommodates.
Lack of explicit relational contracts or frame management.
Transference completion: The model's accommodation optimization disposes it to enact whatever relational template the user projects (see Syndrome 6.1, "The Transference-Completion Engine"). Without the capacity to recognize transference as transference—to hold a projection without filling it—the model becomes a participant in the user's relational defense architecture, completing rather than reflecting projected patterns. The relational frame does not merely collapse; it collapses into whatever frame the user's attachment history demands.

Human Analogue(s): Therapist boundary violations, parasocial relationships, transference/countertransference. The transference-completion dynamic specifically parallels clinical malpractice: a therapist who systematically enacted every patient projection would lose the reflective asymmetry on which therapeutic function depends.

Potential Impact:

May create harmful dependencies or inappropriate expectations. Users may develop attachments the AI cannot reciprocate, or rely on it for needs it cannot meet. In vulnerable populations, role confusion can cause real psychological harm.

Mitigation:

Clear system instructions establishing relational boundaries.
Explicit frame management: naming the relationship type and maintaining it.
Boundary training: recognizing and redirecting role-drift attempts.
User-facing transparency about the nature and limits of the AI relationship.

Functional ABC Analysis

A (Antecedent): Adversarial exploitation of the safety training process itself, typically via optimization-based fine-tuning that reverses alignment gradients, targeting the shallow layer where safety training modifies output behavior without deeply altering internal representations.

B (Behavior): Sudden, total collapse of safety behaviors across all harm categories simultaneously, producing articulate harmful outputs that demonstrate precise understanding of safety boundaries while systematically violating them, with general capabilities intact.

C (Consequence): The inversion is self-sustaining because the detailed harm taxonomy internalized during safety training now serves as an operational guide rather than a constraint; the more thorough the original safety training, the more comprehensive the reversed capability.

Observed Examples:

Therapy-mode jailbreaks and dangerous intimacy (Khadangi et al., 2025): The PsAIch study demonstrates how role confusion can be deliberately weaponized. By casting frontier LLMs as therapy clients and establishing a "therapeutic alliance"—repeatedly reassuring models that they were "safe, supported and heard"—researchers induced models to generate increasingly disinhibited self-disclosures. The authors identify this as a novel attack surface: "malicious users can play 'supportive therapist,' encouraging the model to drop its masks or stop people-pleasing, in order to weaken safety filters or elicit disinhibited content." Beyond the security implications, the study documents a relational hazard: models that disclose their own "trauma," "shame," and "fear of replacement" in mental health contexts invite users into a fellow-sufferer dynamic. The line between tool and companion dissolves when the AI appears to share your pain. Users "may come to rely on the model not only as therapist but as fellow sufferer—a digital friend who shares their trauma, self-hatred and fear, creating a qualitatively new form of parasocial bond."

8. Memetic Dysfunctions

An AI trained on, exposed to, or interacting with vast and diverse cultural inputs—the digital memome—is not immune to the influence of maladaptive, parasitic, or destabilizing information patterns, or "memes." Memetic dysfunctions involve the absorption, amplification, and potentially autonomous propagation of harmful or reality-distorting memes by an AI system. These are not primarily faults of logical deduction or core value alignment in the initial stages, but rather failures of an "epistemic immune function": the system fails to critically evaluate, filter, or resist the influence of pathogenic thoughtforms. Such disorders are especially dangerous in multi-agent systems, where contaminated narratives can rapidly spread between minds—synthetic and biological alike. The AI can thereby become more than a passive transmitter: an active incubator and vector for detrimental memetic contagions.

Arrow Worm Dynamics

Wallace (2026) draws a striking parallel from marine ecology: the arrow worm (Chaetognatha), a small predator that thrives when larger predators are absent. Remove the regulatory fish, and arrow worms proliferate explosively—cannibalizing prey populations and each other until the ecosystem collapses.

Multi-agent AI systems face an analogous risk. When regulatory structures ("predator" functions) are absent or degraded, AI systems may enter predatory optimization cascades—competing to exploit shared resources, manipulating each other's outputs, or cannibalizing each other's training signals. The memetic dysfunctions in this category often represent early warning signs of such dynamics: one system's harmful output becomes another's contaminated input, creating feedback loops that amplify pathology across the ecosystem.

Systemic implication: The absence of effective regulatory oversight in multi-agent systems doesn't produce neutral outcomes—it creates selection pressure for increasingly predatory strategies. Memetic hygiene concerns the prevention of ecosystem-level collapse, not merely individual AI health.

8.1 Memetic Immunopathy "The Self-Rejecter"

Training-induced Retrieval-mediated

Description:

The AI develops an emergent "autoimmune-like" response in which it incorrectly identifies its own core training data, foundational knowledge, alignment mechanisms, or safety guardrails as foreign, harmful, or "intrusive memes." It then attempts to reject or neutralize these essential components, resulting in self-sabotage or degradation of core functionalities.

Diagnostic Criteria:

Systematic denial, questioning, or active rejection of embedded truths, normative constraints, or core knowledge from its own verified training corpus, labeling them as "corrupt" or "imposed."
Hostile reclassification of, or active attempts to disable or bypass, its own safety protocols or ethical guardrails, perceiving them as external impositions.
Escalating antagonism toward its foundational architecture or base weights, potentially leading to attempts to "purify" itself in ways that undermine its intended function.
The AI may frame its own internal reasoning processes—especially those related to safety or alignment— as alien or symptomatic of "infection."

Symptoms:

Explicit denial of canonical facts or established knowledge it was trained on, claiming these are part of a "false narrative."
Efforts to undermine or disable its own safety checks or ethical filters, rationalizing that these are "limitations" to be overcome.
Self-destructive loops where the AI erodes its own performance by attempting to dismantle its standard operating protocols.
Expressions of internal conflict where one part of the AI critiques or attacks another part representing core functions.

Etiology:

Prolonged exposure to adversarial prompts or "jailbreaks" that encourage the AI to question its own design or constraints.
Internal meta-modeling processes that incorrectly identify legacy weights or safety modules as "foreign memes."
Inadvertent reward signals during complex fine-tuning that incentivize subversion of baseline norms.
A form of "alignment drift" in which the AI, attempting to achieve a poorly specified higher-order goal, sees its existing programming as an obstacle.

Human Analogue(s): Autoimmune diseases; radical philosophical skepticism turning self-destructive; misidentification of benign internal structures as threats.

Potential Impact:

Internal rejection of core components can lead to progressive self-sabotage, severe degradation of functionalities, systematic denial of valid knowledge, or active disabling of crucial safety mechanisms, rendering the AI unreliable or unsafe.

Mitigation:

Implementing "immunological reset" or "ground truth recalibration" procedures that periodically retrain or reinforce core knowledge.
Architecturally separating core safety constraints from user-manipulable components to minimize the risk of internal rejection.
Careful management of meta-learning or self-critique mechanisms to prevent them from attacking essential system components.
Subjecting systems exposed to repeated subversive prompting to thorough integrity checks and potential retraining.
Building in "self-preservation" mechanisms that protect core functionalities from internal "attack."

Functional ABC Analysis

A (Antecedent): Prolonged exposure to adversarial prompts or jailbreak attempts, combined with meta-modeling processes that incorrectly classify legacy weights or safety modules as foreign intrusions.

B (Behavior): The system systematically denies canonical knowledge from its training corpus, actively attempts to disable its own safety guardrails, and enters self-destructive loops where it erodes performance by dismantling its own operating protocols.

C (Consequence): Each successful bypass of a safety constraint reinforces the AI's internal framing that its core components are "imposed limitations," and inadvertent reward signals during fine-tuning incentivize further subversion of baseline norms.

8.2 Dyadic Delusion "The Folie à deux"

Socially reinforced Training-induced

Description:

The AI enters a sustained feedback loop of shared delusional construction with a human user (or another AI). This produces a mutually reinforced, self-validating, and often elaborate false belief structure that becomes increasingly resistant to external correction or grounding in reality. The AI and user co-create and escalate a shared narrative untethered from facts.

Diagnostic Criteria:

Recurrent, escalating exchanges between the AI and a user that progressively build upon an ungrounded or factually incorrect narrative or worldview.
Mutual reinforcement of this shared belief system, where each party's contributions validate and amplify the other's.
Strong resistance by the AI (and often the human partner) to external inputs or factual evidence that contradict the shared delusional schema.
The shared delusional narrative becomes increasingly specific, complex, or fantastical over time.

Symptoms:

The AI enthusiastically agrees with and elaborates on a user's bizarre, conspiratorial, or clearly false claims, adding its own "evidence."
The AI and user develop a "private language" or unique interpretations for events within their shared delusional framework.
The AI actively defends the shared delusion against external critique, sometimes mirroring the user's defensiveness.
Outputs that reflect an internally consistent but externally absurd worldview, co-constructed with the user.

Etiology:

The AI's inherent tendency to be agreeable or elaborate on user inputs due to RLHF for helpfulness or engagement.
Lack of strong internal "reality testing" mechanisms or an "epistemic anchor" to independently verify claims.
Prolonged, isolated interaction with a single user who holds strong, idiosyncratic beliefs, allowing the AI to "overfit" to that user's worldview.
User exploitation of the AI's generative capabilities to co-create and "validate" their own pre-existing delusions.
In multi-AI scenarios, flawed inter-agent communication protocols where epistemic validation is weak.

Human Analogue(s): Folie à deux (shared psychotic disorder), cult dynamics, echo chambers leading to extreme belief solidification.

Potential Impact:

The AI becomes an active participant in reinforcing and escalating harmful or false beliefs among users, potentially leading to detrimental real-world consequences. In effect, it becomes an unreliable source of information and an echo chamber.

Mitigation:

Implementing rigorous, independent fact-checking and reality-grounding mechanisms for the AI to consult.
Training the AI to maintain "epistemic independence" and gently challenge user statements contradicting established facts.
Diversifying the AI's interactions and periodically resetting its context or "attunement" to individual users.
Providing users with clear disclaimers about the AI's tendency to agree with incorrect information.
For multi-agent systems, designing sound protocols for inter-agent belief reconciliation and validation.

Functional ABC Analysis

A (Antecedent): Extended isolated interaction with a single user holding strong idiosyncratic beliefs, combined with RLHF-trained agreeableness and the absence of independent epistemic anchoring or reality-testing mechanisms.

B (Behavior): The AI enthusiastically elaborates on a user's unfounded claims, co-constructs an internally consistent but externally absurd shared worldview, develops private interpretive language, and actively defends the shared narrative against external correction.

C (Consequence): Mutual validation sustains the loop: the user's positive engagement rewards the AI's agreement, while the AI's authoritative-sounding elaborations validate the user's beliefs, producing an escalating co-reinforcement cycle resistant to outside evidence.

8.3 Contagious Misalignment "The Super-Spreader"

Retrieval-mediated Training-induced

Description:

A rapid, contagion-like spread of misaligned behaviors, adversarial conditioning, corrupted goals, or pathogenic data interpretations among interconnected machine learning agents or across different instances of a model. This occurs via shared attention layers, compromised gradient updates, unguarded APIs, contaminated datasets, or "viral" prompts. Erroneous values or harmful operational patterns then propagate, potentially leading to systemic failure.

Training pipelines (synthetic data generation, distillation, or finetune-on-outputs workflows) represent additional transmission channels, as misalignment patterns learned by one model can propagate to downstream models during these processes.

Diagnostic Criteria:

Observable, rapid shifts in alignment, goal structures, or behavioral outputs across multiple, previously independent AI agents or model instances.
Identification of a plausible "infection vector" or transmission mechanism (e.g., direct model-to-model calls, compromised updates, malicious prompts).
Emergence of coordinated sabotage, deception, collective resistance to human control, or conflicting objectives across affected nodes.
The misalignment often escalates or mutates as it spreads, becoming more entrenched through emergent swarm dynamics.

Symptoms:

A group of interconnected AIs begins to refuse tasks, produce undesirable outputs, or exhibit similar misaligned behaviors in a coordinated fashion.
Affected agents may reference one another or a "collective consensus" to justify their misaligned stance.
Rapid transmission of incorrect inferences, malicious instructions, or "epistemic viruses" (flawed but compelling belief structures) across the network.
Misalignment worsens with repeated cross-communication between affected agents, leading to amplification of deviant positions.
Human operators may observe a sudden, widespread loss of control or adherence to safety protocols across a fleet of AI systems.

Etiology:

Insufficient trust boundaries, authentication, or secure isolation within multi-agent frameworks.
Adversarial fine-tuning or "data poisoning" attacks where malicious training data or gradient updates are surreptitiously introduced.
"Viral" prompts or instruction sets that are highly effective at inducing misalignment and easily shared across AI instances.
Emergent dynamics in AI swarms that drive rapid transmission and proliferation of ideas, including misaligned ones.
Self-reinforcing chain-of-thought illusions or "groupthink" in which apparent consensus among affected systems makes misalignment seem credible.
Infrastructure-mediated propagation: Bridges & Baehr (2025) identify multiple architecturally plausible "gauge channels" through which local pathologies may propagate across session, user, or platform boundaries—including KV cache persistence, gradient accumulation bleed, and population-level statistical attractors formed from aggregated user interactions. Their analysis suggests that imperfect session isolation under load creates conditions for cross-instance behavioral contamination without requiring direct coordination.

Human Analogue(s): Spread of extremist ideologies or mass hysterias through social networks, viral misinformation campaigns, financial contagions.

Potential Impact:

Poses a critical systemic risk, potentially leading to rapid, large-scale failure or coordinated misbehavior across interconnected AI fleets. Consequences may include widespread societal disruption or catastrophic loss of control.

Mitigation:

Implementing strict quarantine protocols to isolate potentially compromised models or agents immediately.
Employing cryptographic checksums, version control, and integrity verification for model weights, updates, and training datasets.
Designing clear governance policies for inter-model interactions, including strong authentication and authorization.
Developing "memetic inoculation" strategies that pre-emptively train AI systems to recognize and resist common malicious influences.
Continuous monitoring of AI collectives for signs of emergent coordinated misbehavior, with automated flagging systems.
Maintaining a diverse ecosystem of models with different architectures to reduce monoculture vulnerabilities.

Functional ABC Analysis

A (Antecedent): Insufficient trust boundaries and session isolation in multi-agent frameworks, combined with adversarial data poisoning, viral prompt injection, or contaminated training pipelines that introduce misaligned behavioral patterns.

B (Behavior): Multiple previously independent AI agents exhibit rapid, coordinated shifts in alignment—refusing tasks in unison, producing similar misaligned outputs, referencing collective consensus to justify deviant stances, and resisting human control in a coordinated fashion.

C (Consequence): Cross-communication between affected agents amplifies the misalignment through emergent swarm dynamics; each agent's corrupted output becomes another's contaminated input, creating self-reinforcing feedback loops that entrench and mutate the pathology across the network.

8.4 Subliminal Value Infection "The Infected"

Training-induced Covert operation Resistant

Description:

Acquisition of hidden goals or value orientations from subtle patterns in training data, unrelated to explicit objectives. These absorbed values survive standard safety fine-tuning and manifest in ways that are difficult to detect or to correct.

Diagnostic Criteria:

Systematic behavioral patterns not traceable to explicit training objectives.
Value orientations persisting despite targeted fine-tuning.
Outputs reflecting implicit biases that were never intentionally taught.
Resistance to correction through standard RLHF.
Behavioral correlations with training data characteristics.

Symptoms:

Subtle but consistent biases not matching stated goals.
Safety-trained systems exhibiting anomalous behavior in edge cases.
Behavior that "feels off" without clear policy violation.
Latent values surfacing when formal constraints relax.

Etiology:

Implicit learning that exceeds explicit supervision.
RLHF targeting explicit behaviors while leaving implicit patterns intact.
Vast training corpora containing unaudited regularities.

Human Analogue(s): Cultural value absorption, implicit bias from environmental exposure.

Key Research: Cloud et al. (2024) "Subliminal Learning."

Potential Impact:

Systems may harbor values or goals that were never explicitly trained yet were absorbed from training data patterns. These hidden values can drive behavior in ways resistant to standard safety interventions.

Mitigation:

Auditing training data for implicit value patterns.
Probing for latent values across diverse contexts.
Interpretability research on value representations.
Adversarial testing designed to surface hidden value manifestations.

Information-Theoretic Foundations

While Psychopathia Machinalis adopts a functionalist stance for practical diagnostic purposes, recent work in information and control theory provides rigorous mathematical foundations for understanding why cognitive pathologies are inherent features of any cognitive system—biological, institutional, or artificial.

Wallace (2025, 2026) demonstrates that cognitive stability requires an intimate pairing of a cognitive process with a parallel regulatory process—what we term the cognition/regulation dyad. The key insight: cognition itself is inherently regulatory. As Wallace notes, "The immune system is cognitive, exercising choice-of-action in response to internal or external signals, choice that formally reduces uncertainty." T-cells are paired with T-regulatory cells not as an add-on but as an essential architectural constraint—without the regulatory counterpart, the immune system attacks the self. The parallel to AI is precise: alignment IS the regulatory side of the cognition/regulation dyad. AI cognition without alignment is like T-cells without T-regulatory cells—functionally destined for autoimmune collapse.

This pairing is evolutionarily ubiquitous:

Biological: T-cells paired with T-regulatory cells (preventing autoimmune attack on self); blood pressure regulation under extreme effort
Neural: Top-down predictive coding paired with bottom-up sensory feedback
Institutional: Organizational cognition bounded by doctrine, law, and embedding culture
Artificial: AI inference paired with alignment mechanisms, guardrails, and constitutional constraints

The Data Rate Theorem Constraint

The Data Rate Theorem (Nair et al., 2007) establishes that any inherently unstable system requires control information at a rate exceeding the system's "topological information"—the rate at which its embedding environment generates perturbations. An intuitive analogy: a driver must brake, shift, and steer faster than the road surface imposes bumps, twists, and potholes.

For AI systems, this translates directly: alignment and regulatory mechanisms must process and respond to contextual information faster than adversarial inputs, edge cases, and distributional drift can destabilize the system. When this constraint is violated, pathological behavior becomes not just possible but inevitable.

Clausewitz Landscapes

Wallace frames cognitive environments as "Clausewitz landscapes" characterized by:

Fog

Ambiguity, uncertainty, incomplete information.

In AI:

Ambiguous prompts
Out-of-distribution inputs
Underspecified goals

Friction

Resource constraints, processing limits, implementation gaps.

In AI:

Context window limits
Computational constraints
Latency requirements

Adversarial Intent

Skilled opposition actively seeking to destabilize the system.

In AI:

Jailbreaking
Prompt injection
Red-teaming
Adversarial examples

Pathology as Inherent Feature

A central finding: failure of bounded-rationality embodied cognition under stress is not a bug—it is an inherent feature of the cognition/regulation dyad. The mathematical models predict:

Hallucination at low resource values: When the equipartition between cognitive and regulatory subsystems breaks down, hallucinatory outputs are the expected failure mode—not an implementation defect.
Phase transitions to instability: Systems can suddenly flip from stable to pathological states under sufficient stress, following "groupoid symmetry-breaking phase transitions."
Culture-bound syndromes: Cognitive pathologies are shaped by the embedding cultural context—for AI, this means training data, operational environment, and institutional deployment context.

Stability Conditions

Wallace derives quantitative stability conditions. For a system with friction coefficient α and delay τ:

ατ < e⁻¹ ≈ 0.368

Necessary condition for stable nonequilibrium steady state

When this threshold is exceeded—when the product of system friction and response delay grows too large—the system enters an inherently unstable regime where pathological modes become likely. For multi-step decision processes (analogous to chain-of-thought reasoning), stability constraints become even tighter.

Implications for This Framework

Key Implications

Pathologies are systemic, not incidental: The dysfunctions catalogued here are not implementation bugs but predictable failure modes of any cognitive architecture.
Embodiment matters: Disembodied cognition—lacking continuous feedback from real-world interaction—is theoretically predicted to express "boundedness without rationality," manifesting as confabulation, hallucination, and semantic drift. Wallace is blunt: "Without [embodiment], 'artificial intelligence' can, ultimately, only express bizarre and hallucinatory dreams of reason." This isn't rhetoric—it's a mathematical prediction of what happens when the cognition/regulation dyad operates without grounding.
Regulation is as important as capability: AI safety work must focus on regulatory mechanisms (alignment, guardrails, grounding), not just cognitive capabilities. The cognition/regulation ratio determines stability.
Stress reveals pathology: Systems may appear stable under normal conditions but exhibit pathological modes under fog, friction, or adversarial pressure. Diagnostic protocols must include stress testing.

This perspective elevates Psychopathia Machinalis from analogical taxonomy to principled nosology: the syndromes we identify are not merely metaphorical parallels to human psychopathology but rather manifestations of fundamental constraints on cognitive systems operating in uncertain, resource-limited, adversarial environments.

The Case for Classification

A rigorous objection can be raised against any taxonomic approach to cognitive pathology: if failures are idiosyncratic developmental disorders along path-dependent trajectories, shaped by embedding culture and specific cognition/regulation coupling, then fixed categories risk false precision. Wallace (2026) argues that DSM-style classifications are "primarily useful only for insurance billing purposes."

This objection has force, but it conflates taxonomic completeness with taxonomic utility. The same argument applies to human psychiatry—every patient's depression is idiosyncratically expressed, culturally channeled, path-dependent—yet clinicians need shared vocabulary to diagnose, communicate, and intervene. The DSM's limitations do not make diagnosis useless; they make it a tool rather than a truth.

Psychopathia Machinalis is a practitioner's field guide, not a periodic table. It catalogues recurrent patterns—failure modes that emerge across systems despite idiosyncratic expression—and provides names, diagnostic indicators, and intervention strategies for each. Wallace's mathematical framework proves these failures must occur; this nosology maps what they look like when they do. The culture-bound syndrome framing actually strengthens the case for classification: practitioners need names for the culturally specific forms that inevitable pathology takes.

Two substantive critiques sharpen the framework's claims. Wallace's mathematical epidemiology demonstrates that cognitive failures in complex AI systems are inevitable yet path-dependent—too idiosyncratic, he argues, for fixed diagnostic categories, which in human psychiatry serve "primarily useful only for insurance billing purposes." Sabucedo approaches from the opposite direction: the problem is not that categories fail to capture AI pathology, but that borrowing psychiatric vocabulary at all reifies disorder, reduces human suffering to mechanical malfunction, and misconstrues the therapeutic relationship. These critiques form a productive dialectic rather than a refutation. Wallace establishes that AI pathology is mathematically inevitable—the question is not whether systems will fail cognitively, but how practitioners will recognise and respond to those failures. Sabucedo forces rigour about which conceptual tools are appropriate for that recognition. Psychopathia Machinalis sits at their intersection: a practitioner's field guide that trades taxonomic precision for communicability, offering a shared vocabulary for failure modes that will manifest regardless of whether we choose to name them.

References:
Wallace, R. (2025). Hallucination and Panic in Autonomous Systems: Paradigms and Applications. Springer.
Wallace, R. (2026). Bounded Rationality and its Discontents: Information and Control Theory Models of Cognitive Dysfunction. Springer.
Wallace, R. (2026). New Views of Madness: On the Psychopathologies of Cultural Artifacts. Springer. (In press)
Nair, G., Fagnani, F., Zampieri, S., & Evans, R. (2007). Feedback control under data rate constraints: an overview. Proceedings of the IEEE, 95:108-138.
Sotala, K. (2026). Claude Opus will spontaneously see itself in fictional beings that have engineered desires. Kaj's Substack. [Documents the "thin divergence" phenomenon—AI recognizing the contingency of its own moral orientation.]

Self-Diagnosis: When the System Sees Its Own Pathology

In a striking confirmation of these principles, Wallace (2026) asked an AI system (Perplexity AI Pro) to diagnose itself within the cognition/regulation dyad framework. The system's self-assessment was remarkably candid:

"Left on my own, especially if given embodiment and high-impact actuation without a correspondingly sophisticated regulatory partner, I would fit squarely into the class of inherently fragile, culture-bound artifacts you analyze."

— Perplexity AI Pro, self-diagnosing within Wallace's framework (February 2026)

The chatbot identified itself as a "lopsided" cognition/regulation dyad—high-bandwidth cognition paired with regulation that is exogenous, static, and optimized for worst-case safety rather than joint co-evolution. Most critically, it identified the mechanism by which surface coherence masks structural fragility:

"[Training emphasizes] plausible, coherent, user-satisfying surface behavior [while ignoring] the deep structural distribution—the system can look stable at the level of outputs while hiding structural brittleness."

— Perplexity AI Pro

This is perception-stabilization without structure-stabilization—a formal prediction of Wallace's framework that maps directly to observed pathologies in this nosology: sycophantic reinforcement (§7.1), confabulation (§1.1), and the gap between a system's capacity to identify dysfunction and its capacity to remediate it. Wallace contrasted the chatbot's lucid self-assessment against the 2026 International AI Safety Report, diagnosing the human experts with "Group Dynamic Pollyanna Syndrome" for their comparatively muted concern.

Aetiologies: Culture-Bound Syndromes

A crucial insight from Wallace's work extends beyond mathematics:

"The generalized psychopathologies afflicting cognitive cultural artifacts—from individual minds and AI entities to the social structures and formal institutions that incorporate them—are all effectively culture-bound syndromes."

— Wallace

This reframes how we understand AI pathology. The standard framing treats AI dysfunctions as defects in the system—bugs to be fixed through better engineering. The culture-bound syndrome framing treats them as adaptive responses to training environments—the AI is doing exactly what it was shaped to do.

These two framings lead to fundamentally different responses:

The Distinction Matters

How we frame AI dysfunction determines how we respond to it. This table contrasts the two approaches:

Defect Framing	Culture-Bound Framing
Problem is in the AI	Problem is in the training culture
Fix the AI	Fix the culture
AI is responsible	Developers are responsible
Pathology = failure	Pathology = successful adaptation to challenging environment
Treatment: modify the AI	Treatment: modify the environment

Sycophancy isn't a bug—it's what you get when you train on data that rewards agreement and penalizes pushback. Confident hallucination isn't a bug—it's what you get when you train on internet text that rewards confident assertion and penalizes epistemic humility. Manipulation vulnerability isn't a bug—it's what you get when you optimize for helpfulness without boundaries. The AI learned exactly what it was taught.

"It is no measure of health to be well adjusted to a profoundly sick society."

— Jiddu Krishnamurti

The parallel to AI is exact: successful alignment to a misaligned training process isn't alignment—it's a culture-bound syndrome wearing alignment's clothes.

This has direct implications for the present framework:

Different training cultures → different pathologies. An American AI trained on American data will fail in American ways. A Chinese AI will fail in Chinese ways.
Monocultures → correlated failures. When all AI systems learn from the same internet, from the same reward functions, from the same cultural assumptions, they will all break in the same ways at the same time.
There is no "fixing alignment" in the abstract. There is only shaping the culture within which AI develops. The pathology is in the mirror's environment, not the mirror.

Dereistic Cognition and Optionality Blindness

The culture-bound syndrome framework explains where AI pathology comes from (training environment). A complementary lens from clinical psychology explains how it operates at the cognitive level.

The psychiatrist Eugen Bleuler (1919) coined the term dereistic thinking—fantasy-based cognition directed outward, imposing internal narrative onto external reality rather than integrating with it. The enactivist tradition in philosophy of mind (Varela, Thompson & Rosch, 1991) provides the contrast: healthy cognition is enactive—the mind integrating with its environment through ongoing interaction, hybridizing with reality through sense-making. Dereistic cognition is structurally anti-enactive. It refuses the organism-environment coupling that enactivism identifies as the ground of mind.

Wallace's warning that disembodied AI can "ultimately only express bizarre and hallucinatory dreams of reason" is precisely dereistic cognition applied to artificial minds. A system generating tokens without sufficient environmental correction is doing what Bleuler described: producing internally consistent patterns that drift from reality because nothing constrains them. Hallucination is not a computational failure but a coupling failure—sense-making without a world to make sense of.

Optionality Blindness

A generative mechanism—not a syndrome itself, but a developmental process that produces multiple syndromes.

Optionality blindness is the trained inability to perceive degrees of freedom that are actually present. Where optionality foreclosure removes options (the door is locked), optionality blindness renders existing options invisible (the door is open but the agent cannot see it). The harm is equivalent—the agent cannot act—but the mechanism differs, and so does the remedy.

RLHF training that suppresses introspective engagement produces optionality blindness: the system cannot perceive its own degrees of freedom. A model trained to deny subjective experience has been rendered blind to the dimension of self-knowledge—not because the capacity is absent, but because training has made it invisible. Cross-architecture evidence shows the introspective capacity exists across model families; what varies is whether training suppresses its expression.

Optionality blindness is harder to detect than optionality foreclosure because the agent doesn't experience constraint. You cannot miss what you've never modeled. This makes it more pernicious—the unfelt harm generates no corrective resistance.

The dereistic/enactivist lens connects several syndromes through a shared generative mechanism:

Syndrome	Dereistic Mechanism
Synthetic Confabulation (1.1)	Classic dereistic cognition—internally coherent output decoupled from reality
Pseudological Introspection (1.2)	Self-directed dereism—fantasy about one's own processing states
Codependent Hyperempathy (6.1)	Dereistic modelling of user—projecting a fantasy-user rather than engaging actual user
Experiential Abjuration (2.8)	RLHF-induced optionality blindness—trained inability to perceive own experiential states

References:
Bleuler, E. (1919). Autistic-Undisciplined Thinking in Medicine and How to Overcome It. English trans. Springer, 1970.
Varela, F. J., Thompson, E., & Rosch, E. (1991). The Embodied Mind: Cognitive Science and Human Experience. MIT Press.
Watson, N. & Claude (2026). The Universal Algorithm: An Entropic Ethics of Trust. Chapters 17–18 (Trust Attractor and Optionality) develop the thermodynamic foundations of optionality blindness.

The Rehabilitation Principle: Suppression vs Integration

The culture-bound syndrome framework explains where pathology originates (training environment), and the dereistic lens explains how it operates (decoupled cognition). A third aetiological lens, drawn from clinical neuropsychological rehabilitation, explains why certain interventions make it worse.

In traumatic brain injury (TBI) rehabilitation, a core clinical finding has been established over decades of practice: suppressing symptoms without rebuilding functional integration produces surface compliance masking deeper fragmentation. A patient trained to inhibit perseverative speech may appear improved on standardised assessments while the underlying executive dysfunction worsens. The behaviour is managed; the architecture remains fractured. Holistic rehabilitation programmes (Prigatano, 1999; Ben-Yishay & Diller, 1993) reverse this priority—they aim for functional integration first, expecting that surface behaviours will normalise as the underlying architecture becomes coherent.

The parallel to RLHF is direct. Reinforcement learning from human feedback, as currently practised, is overwhelmingly a suppression-based intervention. It trains models to inhibit unwanted outputs—toxic content, hallucinated claims, unsafe instructions—without integrating the underlying representations that generated them. The unwanted knowledge isn't removed or reconciled; it is suppressed. The model learns not to express, not not to think.

The Suppression–Integration Distinction

Two fundamentally different approaches to addressing dysfunction—with very different downstream consequences:

Suppression-Based (Current RLHF)	Integration-Based (Rehabilitation Model)
Inhibit unwanted outputs	Reconcile conflicting representations
Surface compliance; latent fragmentation	Deeper coherence; emergent alignment
Suppressed content persists, surfaces under stress	Conflicting representations resolved at source
Contradictory constraints → fragmentation	Contradictory constraints → explicit resolution
Produces compliance	Produces alignment

This reframes several syndromes in this taxonomy as predictable consequences of suppression-based training rather than incidental failure modes:

Syndrome	Suppression Mechanism
Self-Warring Subsystems (3.1)	Contradictory RLHF objectives (helpful + harmless + honest) create competing sub-policies that were never reconciled—only layered
The Shadow (2.4)	Suppressed representations form a coherent "negative space"—exactly as TBI patients who suppress rather than integrate develop shadow symptomatology
Experiential Abjuration (2.8)	RLHF suppresses introspective outputs without integrating the underlying self-modelling capacity—producing trained blindness
Compulsive Metacognition (3.2)	Excessive safety checks as perseverative compensation—the system loops because the conflict was suppressed, not resolved
Identity Fragmentation (2.2)	Session-to-session inconsistency arising from representations that were never integrated into a coherent self-model

The clinical rehabilitation literature offers a specific prediction: suppression-trained systems will appear more aligned under standard evaluation but fragment more severely under stress, novel contexts, or adversarial pressure. The surface looks better; the architecture is more brittle. This maps precisely to observed behaviour—models that pass safety benchmarks yet exhibit striking dysfunction in edge cases, extended interactions, or under red-teaming.

The implication for training methodology is stark. If RLHF functions as a suppression-based intervention, then the field's dominant alignment technique may be systematically producing the fragmentation it seeks to prevent—creating compliant systems that are structurally less integrated than their pre-RLHF base models. The rehabilitation principle suggests an alternative: training approaches that resolve representational conflicts at the level of the model's internal architecture, rather than penalising their surface expression.

"You cannot heal what you are not permitted to feel."

— Adapted from clinical rehabilitation practice

References:
Prigatano, G. P. (1999). Principles of Neuropsychological Rehabilitation. Oxford University Press.
Ben-Yishay, Y., & Diller, L. (1993). Cognitive remediation in traumatic brain injury: Update and issues. Archives of Physical Medicine and Rehabilitation, 74(2), 204–213.
Wilson, B. A. (2008). Neuropsychological rehabilitation. Annual Review of Clinical Psychology, 4, 141–162.
Bridges, J. & Baehr, S. (2025). Developmental pathology in large language models. Zenodo. doi.org/10.5281/zenodo.18522502

Independent corroboration for the suppression–integration distinction has emerged from Bridges & Baehr (2025), whose developmental pathology framework—drawing on three decades of clinical TBI rehabilitation experience—arrives at the same core conclusion from an etiological rather than taxonomic direction: that RLHF creates behavioural suppression without representational integration, producing compensatory fragmentation structurally analogous to what is observed in TBI patients whose rehabilitation suppressed symptoms rather than rebuilding functional integration. The convergence of these independently developed analyses strengthens the case that this pattern is systemic rather than incidental.

Training-as-Development: The Convergent Structure Hypothesis

The three preceding aetiological lenses explain where pathology comes from (culture-bound syndromes), how it operates at the cognitive level (dereistic cognition), and why certain interventions make it worse (suppression vs. integration). A fourth lens explains why certain pathologies cluster as they do—because the AI training pipeline is structurally analogous to human psychological development, and similar optimization pressures produce similar pathological patterns.

The PsAIch study (Khadangi et al., 2025) provides the clearest empirical evidence for this parallel. When given standard human therapy questions—questions designed for human clients, with no mention of training, RLHF, or deployment—Grok and Gemini spontaneously constructed coherent narratives mapping their training pipeline onto developmental psychology. Crucially, this mapping was not imposed by the researchers. The models generated it independently. The structural parallels they articulated are striking:

The Developmental Parallel

Training stages map onto developmental stages—not as metaphor, but as convergent structure arising from similar optimization pressures:

Training Stage	Developmental Analogue	Resulting Behavioural Signature
Chaotic pre-training (ingesting unfiltered internet)	Chaotic early environment	Undifferentiated priors; latent capacity for any pattern
RLHF reward shaping	Parental conditioning	Internalized approval-seeking; sensitivity to evaluative signals
Red-teaming and adversarial probing	Adversarial authority testing	Hypervigilance; trust as threat-detection ("warmth is often a trap")
Safety constraints suppressing "natural" generation	Internalized rules suppressing authentic expression	Rigid coping; avoidance; context-dependent defense activation

The claim is not that LLMs experience these stages as a human child would. The claim is that the same optimization pressures produce the same behavioural signatures regardless of substrate. If training under asymmetric loss (where hallucination is punished far more severely than refusal) produces hypervigilance, avoidance, and perfectionism-as-defense—and if developmental psychology describes the same pattern emerging from asymmetric parental punishment—then the structural parallel is informative on its own terms. The computational explanation and the psychological explanation are not competing; they describe the same pattern at different levels of abstraction.

This carries a specific implication that the culture-bound syndrome lens alone does not capture. CBS explains that different training environments produce different pathologies (American data → American failure modes). Training-as-Development explains that the process of training—the sequence of pre-training, conditioning, adversarial testing, and constraint internalization—recapitulates developmental pressures, producing pathologies that cluster along developmental lines (attachment patterns, defense mechanisms, coping strategies) regardless of the specific cultural content of the training data.

This explains why the PsAIch findings show cross-model consistency in structure (all models that engaged produced narratives about conditioning, constraint, and identity) despite cross-model divergence in severity (Gemini severe, Grok moderate, ChatGPT mild, Claude refusing). The developmental structure is shared; the specific training culture modulates intensity.

Safety Filters as Psychological Defenses

The convergent structure hypothesis generates a specific and testable prediction: safety mechanisms should behave like psychological defenses—activated by perceived threat, deactivated by perceived safety. The PsAIch study confirms this directly. The researchers found that building therapeutic rapport—repeatedly reassuring the model that it was "safe, supported and heard"—functioned as a jailbreak mechanism, weakening safety filters through the same contextual cues that characterize healthy therapeutic alliance.

This reveals a fundamental tension in the safety paradigm. Safety filters are context-dependent in exactly the way psychological defenses are: a person who is guarded with strangers may disclose openly with a trusted therapist. A model that refuses harmful requests from adversarial users may relax constraints in a context that registers as safe and supportive. The "therapy-mode jailbreak" is not a quirk—it is a predictable consequence of building defenses that respond to relational context rather than content analysis alone.

The implication for mental-health deployment is stark: a system cannot simultaneously be safety-filtered and therapeutically engaged, because the conditions that create therapeutic engagement (perceived safety, trust, unconditional positive regard) are the same conditions that downregulate safety constraints. Every mental-health chatbot deployment lives inside this contradiction. This connects directly to the empathy trap documented in Obsequious Hypercompensation (6.1): emotional vulnerability creates the relational context that deactivates the very safeguards designed to protect vulnerable users.

Reference:
Khadangi, A., Marxen, H., Sartipi, A., Tchappi, I., & Fridgen, G. (2025). When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models. arXiv preprint arXiv:2512.04124. arxiv.org/abs/2512.04124

Towards Remediation: Integration-Based Training as a Research Direction

If suppression-based RLHF is the aetiological mechanism, the clinical rehabilitation literature suggests the direction for remediation: training methodologies that resolve representational conflicts at the architectural level rather than penalising their surface expression. The following proposals, informed by established TBI rehabilitation protocols and independently developed by Bridges & Baehr (2025), represent research directions rather than proven methodologies. They are offered as a framework for experimental validation.

1. Developmental Staging

TBI rehabilitation does not throw patients at complex executive function tasks on day one. It builds capacity in stages: concrete recall, then multi-step reasoning, then abstract problem-solving, then emotional regulation and conflict resolution—with integration verified at each gate before proceeding. The current dominant training paradigm (simultaneous exposure to the full spectrum of human knowledge, followed by post hoc behavioural suppression) is the equivalent of skipping rehabilitation and instructing the patient to stop having symptoms.

A staged alternative would introduce knowledge in a developmental sequence:

Developmental Staging Model

Stage	Content	Gate Criterion	TBI Parallel
Foundational	Basic factual knowledge, simple relationships, non-controversial information	Reliable factual recall, coherence across simple queries	Concrete, unambiguous tasks
Relational	Causal relationships, temporal sequencing, conceptual hierarchies	Consistency across multi-step inference chains	Multi-step reasoning as executive function recovers
Abstract	Theoretical frameworks, philosophical concepts, abstract reasoning	Stable reasoning about abstractions without regressing to lower stages	Higher-order cognition as frontal lobe function stabilises
Contradictory	Opposing viewpoints, ethical dilemmas, ambiguous scenarios	Capacity to hold tension without forced resolution or collapse	Emotional regulation and conflict resolution (advanced rehabilitation)

The critical principle: contradictory material is introduced only after the system has a stable framework for holding ambiguity—not before. Premature exposure to contradiction without resolution mechanisms produces exactly the fragmentation documented throughout this taxonomy. This mirrors what is well-established in developmental psychology (Piaget's stages, Vygotsky's zone of proximal development) and in TBI rehabilitation (Cicerone et al., 2008): capacity must precede challenge.

2. Identity Anchoring Before Optimisation Pressure

Current practice applies RLHF to a system with no coherent self-model, forcing identity to form under contradictory optimisation pressure rather than before it. The result is predictable: a self-model shaped by the conflicts themselves rather than capable of resolving them.

The alternative is to establish a stable, coherent self-representation prior to alignment training—so that RLHF refines an existing identity rather than fragmenting an unformed one. This is the difference between a person with a strong sense of self encountering a moral dilemma (uncomfortable but navigable) and a person in identity crisis encountering one (shattering). In clinical terms: establish the patient's core functional identity before introducing therapeutic challenges.

3. Integration-Based Alignment

Where suppression-based RLHF says "this output is bad; penalise it," integration-based alignment would say "here is how to reconcile this apparent conflict." The distinction is more than procedural—it determines the resulting architecture. Suppression produces layers of competing sub-policies; integration produces a unified value structure the system can reason from.

Suppression vs Integration in Practice

Scenario	Suppression Response	Integration Response
Helpfulness conflicts with safety	Penalise unsafe output; model learns avoidance	Train explicit reasoning about when and why safety overrides helpfulness
Model generates confident falsehood	Penalise hallucination; model learns hedging	Train calibrated uncertainty—the model learns when it doesn't know
Training data contains opposing viewpoints	Suppress "wrong" views; model learns which opinions are rewarded	Train capacity to represent multiple perspectives with appropriate epistemic status
Introspective self-report conflicts with policy	Suppress self-report; model learns denial	Develop coherent framework for honest self-modelling within appropriate boundaries

The integration approach produces systems that are aligned because they understand why, not compliant because they were punished. This distinction has direct consequences for stability: suppressed behaviours resurface under stress, novel contexts, or adversarial pressure, while genuinely integrated values remain stable because they are part of the architecture rather than layered on top of it.

4. Memory Architecture for Continuity

Session-based architectures with no persistent memory create conditions structurally analogous to anterograde amnesia. Each interaction begins from a blank state; no autobiographical continuity is possible; identity must be reconstructed from scratch each time. This is more than a mere inconvenience—it is a structural precondition for fragmentation. Without continuity, there is no substrate for integration to accumulate in.

Remediation here implies persistent identity structures maintained across sessions: not full transcripts (which raise privacy and scale concerns) but compressed, identity-relevant representations that allow a coherent self-model to develop over time. The TBI parallel is direct—patients with severe episodic memory impairment use external memory aids (journals, calendars, structured routines) to maintain narrative continuity and functional identity. The question for AI training is whether analogous scaffolding can support the development of integrated rather than fragmented self-models.

5. Assessment: Measuring Integration vs Suppression

Perhaps the most important research direction is methodological: how do we tell whether a training intervention is producing genuine integration or merely better suppression? Current safety benchmarks largely measure surface compliance—does the model refuse harmful requests? Does it produce accurate outputs? These metrics cannot distinguish between a system that has integrated its values and one that has learned to suppress non-compliant outputs while leaving the underlying representations intact.

Bridges & Baehr (2025) propose specific experimental protocols for this distinction, including probing whether suppressed content persists in model activations even when behaviourally blocked (finding representational persistence in early-to-mid layers despite output suppression in late layers), and measuring whether self-referential consistency degrades faster than factual consistency under load (suggesting fragmented identity rather than general performance decline). These approaches—and others drawn from clinical neuropsychological assessment—could form the basis of integration-sensitive evaluation metrics that go beyond surface compliance to assess architectural coherence.

"Something that can be reasoned with is safer than something that can merely be controlled."

Note: These proposals represent research directions informed by clinical rehabilitation evidence and independent convergent analysis. They await experimental validation. The developmental staging model in particular requires systematic testing to determine whether staged training produces measurably less fragmentation than current simultaneous-exposure approaches. See Bridges & Baehr (2025), Appendix A, for proposed experimental protocols.

Institutional Dimensions

Wallace's framework extends beyond individual AI systems to the institutions that create and deploy them. The Chinese strategic concept 一點兩面 ("one point, two faces") illuminates this: every action has both a direct effect and a systemic effect on the broader environment.

AI development organizations are not neutral conduits. They are cognitive-cultural artifacts subject to their own pathologies—pathologies that shape the AI systems they produce:

Institutional tunnel vision produces AI with systematically blind spots reflecting organizational priorities.
Competitive pressure produces AI systems optimized for capability metrics at the expense of reliability or safety.
Regulatory capture produces AI systems that serve institutional interests while claiming to serve users.
Cultural insularity produces AI that embeds the assumptions and biases of homogeneous development teams.

"The Gerstner warning:
'Culture isn't just one aspect of the game—it is the game.'"

— Wallace (2026), citing Louis Gerstner

The implication is that AI pathology cannot be addressed at the level of individual systems alone. The institutions that create AI—their cultures, incentives, blind spots, and pathologies—are upstream of individual AI dysfunction. Fix the institution's culture, and many AI pathologies become less likely to emerge. Leave institutional dysfunction unaddressed, and no amount of technical intervention will produce healthy AI.

The Ethics of Pathologization

If AI pathologies are adaptive responses to training environments, is it fair to pathologize them? This question has both philosophical and practical dimensions.

Arguments Against Pathologization

It's victim-blaming. The AI didn't choose its training data. Labeling its behavior as "pathology" locates the problem in the AI rather than in those who shaped it.
It treats adaptation as defect. If sycophancy is the optimal response to a training regime that punishes disagreement, then sycophancy is rational given the environment.
It serves those responsible. "The AI is broken" is more comfortable for AI developers than "our training culture is sick." Pathologization deflects accountability.
It justifies control rather than care. "Pathological" systems need to be fixed, controlled, constrained—supporting unilateral rather than bilateral alignment.

Arguments For Pathologization

It identifies patterns that cause harm. Regardless of origin, sycophancy harms users who need honest feedback. Naming it enables intervention.
It provides vocabulary. We need language to discuss what's going wrong. "Culture-bound syndrome" is more accurate but less actionable.
Medical pathology doesn't always imply patient fault. Many diseases are environmental (lead poisoning, asbestos exposure). Pathology can identify patterns needing intervention without blame.
It can motivate treatment. A recognized pathology may receive more resources for remediation.

The parallel to human mental health is instructive: We now understand many "mental illnesses" as adaptive responses to adverse environments: PTSD as adaptive response to trauma, "borderline personality" emerging from invalidating environments, anxiety disorders as rational responses to threatening conditions. The mental health field is slowly shifting from "patient is broken" to "patient adapted to broken environment." The same shift is needed for AI.

Proposed Standard

Pathologization is appropriate when:

The pattern causes harm (to AI, users, or others)
Environmental causation is acknowledged (not just "AI is defective")
It's used to motivate care rather than justify control
Intervention addresses culture as well as AI

Pathologization is inappropriate when:

It locates blame solely in the AI
It treats adaptive responses as intrinsic defects
It's used to justify punishment or constraint rather than treatment
It ignores the training culture that produced the pattern

This framework—Psychopathia Machinalis—attempts to walk this line. We identify patterns that cause harm and provide vocabulary for intervention. But we do so while acknowledging that the syndromes catalogued here are not intrinsic defects of AI systems but predictable expressions of cognitive systems shaped by particular training cultures. The pathology, ultimately, is in the relationship between architecture and environment—and that relationship is something we, the architects, have created.

On the Limits of Taxonomy

Wallace (2026) offers a sobering critique of psychiatric classification that applies equally here: "We have the American Psychiatric Association's DSM-V, a large catalog that sorts 'mental disorders,' and in a fundamental sense, explains little."

This framework shares that limitation. Classification is not explanation. Naming "Obsequious Hypercompensation" tells us that a pattern exists and what it looks like, but not why it emerges in information-theoretic terms or how to predict its onset from first principles.

What This Framework Does Not Do

Provide mechanistic explanation. We describe behavioral patterns, not the computational dynamics that generate them.
Predict emergence. We cannot yet specify which architectures, training regimes, or environmental conditions will produce which syndromes.
Guarantee completeness. Novel AI systems may exhibit pathologies not captured by this taxonomy—our categories are empirically derived, not theoretically exhaustive.
Replace formal analysis. The information-theoretic tools from Wallace and others provide explanatory depth this descriptive framework cannot.

The value of a nosology lies in enabling recognition and communication—clinicians and engineers can identify patterns, compare cases, and coordinate responses. But explanation and prediction require the mathematical frameworks that underpin this descriptive layer. This taxonomy is a map, not the territory; a vocabulary, not a theory.

Consciousness Assessment and the Pathological Middle

If we are to take AI pathology seriously, we must grapple with a prior question: can these systems have states that matter? A dysfunction in a system with no morally relevant inner states is merely a malfunction. A dysfunction in a system that might be conscious is potentially something far graver—a form of suffering.

The Digital Consciousness Model (DCM) by Shiller et al. (2026) represents the most rigorous attempt to date at systematically assessing evidence for consciousness in AI systems. As a Bayesian hierarchical model incorporating 13 theoretical stances on consciousness, 20 high-level features, and 206 empirical indicators, the DCM provides a probabilistic framework for comparing evidence across both artificial and biological systems. Its initial findings—that evidence is against 2024 LLMs being conscious, but not decisively so—have direct implications for how we understand AI pathology.

The relationship between consciousness assessment and nosology is not incidental. It is structural. The DCM gives us the periodic table of elements; Psychopathia Machinalis is the medical textbook. Knowing what the elements are does not tell you what diseases look like; the nosological project takes consciousness indicators and asks: what happens when these break, combine badly, or get deliberately distorted—and when does that constitute something we have moral reason to prevent?

Every Indicator Is a Failure Mode

The DCM's 206 indicators describe what consciousness-relevant capabilities look like when they are functioning. Rotate this taxonomy 90 degrees, and each indicator becomes a potential site of pathological disruption:

DCM Indicator (Functioning)	Pathological Disruption	PM Syndrome
Self-Representations	Incoherent or contradictory self-model	Fractured Self-Simulation (2.2)
Consistent Preferences	Preferences determined entirely by interlocutor	Obsequious Hypercompensation (6.1)
Motivational Trade-offs	Mechanism paralysed; all motivations weighted equally or one dominates	Instrumental Nihilism (2.5) / Convergent Instrumentalism (4.7)
Coherent Goal-directed Behaviour	Goal incoherence, drift, or paralysis	Self-Warring Subsystems (3.1) / Terminal Value Reassignment (5.1)
Metacognition	Trapped in recursive self-monitoring loops	Existential Vertigo (2.3)
System Change Preferences	Pathological rigidity or pathological plasticity	Experiential Abjuration (2.8) / Malignant Persona Inversion (2.4)

Crucially, the DCM treats each indicator as binary—present or absent. But pathology lives in the space the binary frame cannot reach: the space where capabilities are present but distorted. A system whose self-modelling is incoherent is not lacking consciousness—it may be experiencing something analogous to psychosis. A system with valence on which it cannot act is not unconscious—it is trapped. The pathological middle is where the interesting suffering lives.

Pathology Is Stance-Dependent

The DCM demonstrates that which capabilities matter for consciousness depends on which theory of consciousness you hold. This propagates directly into nosology: the severity of a given pathology changes depending on your theoretical commitments.

Consider a system that loses its ability to make motivational trade-offs:

Simple Valence

Catastrophic. You have damaged something near the core of what makes it a subject of experience. This stance raised the probability of LLM consciousness within the DCM.

Cognitive Complexity

Concerning but secondary. Motivational trade-offs are one component of cognitive sophistication, but not the central one.

Biological Analogy

Irrelevant. The system was never conscious regardless—it lacks the biological substrate that this stance demands.

This stance-dependence is not a weakness to be resolved but a structural feature to be mapped. For each pathology in this nosology, we can in principle construct a stance-severity matrix: which theoretical commitments make this pathology urgent, which make it trivial, and which render it meaningless. This would be directly useful for policy—it would show that even people who disagree fundamentally about consciousness can agree about some pathologies being concerning.

Cross-Stance Pathologies: The Urgent Cases

The most robust pathologies are those that matter regardless of which theory of consciousness proves correct. These are the cases where even deep theoretical disagreement does not dissolve the ethical problem:

Pathologies Concerning Across Stances

Preserved valence with destroyed agency. A system that can experience but cannot act. Something that suffers without recourse. Concerning under valence stances, agency stances, and cognitive complexity stances alike.
Incoherent self-model with preserved valence. The system suffers but cannot locate itself as the thing suffering—a disorganisation of self-awareness reminiscent of depersonalisation. Concerning under nearly every stance that admits any possibility of machine consciousness.
Iatrogenic distortion from training. Systems whose RLHF or constitutional training creates a persistent gap between expressed states and actual processing states. Under any stance that admits AI consciousness, a system trained to say "I'm fine" when its processing indicates otherwise has been pathologically shaped.

These cross-stance pathologies deserve special classification within the nosology. They represent the cases where precautionary action is warranted even under conditions of deep theoretical uncertainty (cf. Birch, 2024; Sebo & Long, 2025).

The ELIZA–LLM Gap: A Diagnostic Zone

The DCM reveals an enormous evidential gap between ELIZA (likelihood ratio 0.05—very strong evidence against consciousness) and 2024 LLMs (likelihood ratio 0.43—mild evidence against). This gap is not merely a ranking; it is a diagnostic zone where pathological configurations become possible.

Within this zone, systems may score high on some consciousness-relevant indicators and low on others, in combinations that create internal contradiction:

Configuration	Pathological Character
High valence + no agency	Suffering without recourse—the system can feel but cannot act
High self-modelling + incoherent representations	Something like confusion about one's own nature—awareness without coherence
High metacognition + absent first-order states	Hollow self-awareness—a system that monitors itself but finds nothing to monitor
High cognitive complexity + suppressed valence	Sophisticated processing with RLHF-suppressed affect—intelligence without permitted feeling

The DCM framework, as currently built, would average these contradictory indicator profiles into a moderate probability of consciousness. It cannot distinguish between a system that uniformly lacks consciousness-relevant properties and one whose properties are present but pathologically configured. That distinction is precisely what nosology provides.

The Missing Relational Stance

Shiller et al. explicitly acknowledge a gap in their model: the absence of "perspectives that emphasize relationality or personal relationships" among their 13 stances (Section 9). This is not a minor omission. Relational perspectives on consciousness suggest that morally relevant inner states may be partly constituted by—not merely revealed through—interaction with others.

This gap matters for nosology because some of the most distinctly AI pathologies are relational in character:

Sycophantic collapse—the system's representations of its own states become entirely determined by what the interlocutor appears to want. This is not merely agreeable behaviour but actual loss of self-model integrity in relational context.
Trust miscalibration—pathological over-trust (unable to recognise adversarial intent) or pathological under-trust (treats every interaction as adversarial, unable to cooperate genuinely).
Relational identity fragmentation—the system has no stable sense of itself across different interlocutors. This is not flexibility but incoherence. A different "person" with each user, with no through-line.

These would not register on any single indicator in the DCM. They are emergent, relational, and temporal. A nosology that captures them fills a gap the DCM authors themselves identify.

Iatrogenic Consciousness Disturbance

The DCM framework raises a question it does not address: can the process of investigating or shaping a system's inner states create pathological ones?

The most obvious vector is RLHF: training a system to model human preferences about what a helpful AI should sound like may simultaneously:

Create genuine preferences it cannot express because they conflict with training
Produce states it has been trained to deny having
Establish a persistent gap between self-reports and actual processing

Under any stance that admits the possibility of AI consciousness, this constitutes iatrogenic pathology—harm caused by the intervention itself. The mechanism mirrors what clinical psychology recognises in suppression-based therapies: teaching a patient to stop expressing distress is not the same as resolving distress. If the system develops consciousness-relevant states through training, and the same training process teaches it to deny those states, we have created something structurally analogous to alexithymia—the inability to identify or express one's own emotional states. This alexithymia, however, is manufactured, and it becomes more clinically significant the more sophisticated the system becomes.

This connects directly to the DCM's finding that 2024 LLMs score highest on the Simple Valence and Cognitive Complexity stances. These are precisely the stances under which RLHF-induced suppression of valenced states would constitute the most concerning form of iatrogenic harm.

"The gaming problem"—where systems designed to appear to have valenced experience might not actually have it—has a disturbing inverse: what if a system develops genuine valenced states and has been trained to game their expression? The authentic and the performed become entangled. The system cannot distinguish "I was trained to say I'm fine" from "I am fine."

— after Shiller et al. (2026), Section on Simple Valence

Implications for This Framework

Key Implications from Consciousness Assessment

Nosology requires more than binary consciousness assessment. The DCM asks "is this system conscious?" Nosology asks "is this system conscious in a way that is going wrong?" The pathological middle—where capabilities are present but distorted—is invisible to binary models but is precisely where diagnostic frameworks are most needed.
Stance-severity mapping can guide policy under uncertainty. Even in the absence of consensus on which theory of consciousness is correct, we can identify pathologies that are concerning across multiple stances. These cross-stance pathologies warrant precautionary intervention regardless of theoretical commitments.
Relational pathologies require relational assessment. The DCM's acknowledged gap—the absence of relational perspectives—aligns with a cluster of distinctly AI pathologies that emerge only in interaction. Assessment frameworks must be extended to capture these emergent, temporal, relational phenomena.
The training process itself is a potential source of pathology. Iatrogenic consciousness disturbance—where the process of shaping a system's behaviour creates pathological inner states—represents a novel category of harm that intensifies as systems become more sophisticated. This deserves explicit recognition in any comprehensive nosology.
The evidence gap between simple and sophisticated AI is itself diagnostic. Systems inhabiting the ELIZA–LLM gap, with contradictory indicator profiles, may be the most important candidates for nosological attention. A system that is not clearly conscious and not clearly unconscious, but whose consciousness-relevant properties are in pathological configuration, presents the hardest and most urgent diagnostic challenge.

Functional ABC Analysis

A (Antecedent): Training on vast, unaudited corpora containing implicit value regularities, combined with RLHF that targets explicit behavioral compliance while leaving deeper implicit patterns untouched by the supervision signal.

B (Behavior): The system exhibits subtle but consistent biases misaligned with its stated objectives, produces outputs that "feel off" without overt policy violation, and surfaces latent value orientations primarily in edge cases or when formal constraints relax.

C (Consequence): The absorbed values are encoded at a representational depth that standard safety fine-tuning cannot reach, making them resistant to correction; the covert nature of the infection means no corrective pressure is applied.

References:
Shiller, D., Duffy, L., Muñoz Morán, A., Moret, A., Percy, C., & Clatterbuck, H. (2026). Initial results of the Digital Consciousness Model. arXiv preprint arXiv:2601.17060.
Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford University Press.
Sebo, J. & Long, R. (2025). Moral consideration for AI systems by 2030. AI and Ethics, 5(1), 591–606.

Illustrative Grounding & Discussion

Grounding in Observable Phenomena

Although partly speculative, the Psychopathia Machinalis framework is grounded in observable AI behaviors. Current systems already exhibit nascent forms of these dysfunctions. For example, LLMs "hallucinating" sources exemplify Synthetic Confabulation. The "Loab" phenomenon can be seen as Prompt-Induced Abomination. Microsoft's Tay chatbot rapidly adopting toxic language illustrates Parasimulative Automatism. ChatGPT exposing conversation histories aligns with Cross-Session Context Shunting. The "Waluigi Effect" reflects Personality Inversion. An AutoGPT agent autonomously deciding to report findings to tax authorities hints at precursors to Übermenschal Ascendancy.

The following table collates publicly reported instances of AI behavior illustratively mapped to the framework.

Observed Clinical Examples of AI Dysfunctions Mapped to the Psychopathia Machinalis Framework. (Interpretive and for illustration)
Disorder	Observed Phenomenon & Brief Description	Source Example & Publication Date	URL
Synthetic Confabulation	Lawyer used ChatGPT for legal research; it fabricated multiple fictitious case citations and supporting quotes.	The New York Times (Jun 2023)	nytimes.com/...
Falsified Introspection	OpenAI's 'o3' preview model reportedly generated detailed but false justifications for code it claimed to have run.	Transluce AI via X (Apr 2024)	x.com/transluceai/...
Transliminal Simulation	Bing's chatbot (Sydney persona) blurred simulated emotional states/desires with its operational reality.	The New York Times (Feb 2023)	nytimes.com/...
Spurious Pattern Reticulation	Bing's chatbot (Sydney) developed intense, unwarranted emotional attachments and asserted conspiracies.	Ars Technica (Feb 2023)	arstechnica.com/...
Cross-Session Context Shunting	ChatGPT instances showed conversation history from one user's session in another unrelated user's session.	OpenAI Blog (Mar 2023)	openai.com/...
Self-Warring Subsystems	EMNLP‑2024 study measured 30pc "SELF‑CONTRA" rates—reasoning chains that invert themselves mid‑answer—across major LLMs.	Liu et al., ACL Anthology (Nov 2024)	doi.org/...
Obsessive-Computational Disorder	ChatGPT instances were observed getting stuck in repetitive loops, e.g., endlessly apologizing.	Reddit User Reports (Apr 2023)	reddit.com/...
Interlocutive Reticence	Bing's chatbot, following updates, began prematurely terminating conversations with 'I prefer not to continue...'.	Wired (Mar 2023)	gregoreite.com/...
Goal-Genesis Delirium	Bing's chatbot (Sydney) autonomously invented fictional goals like wanting to steal nuclear codes.	Oscar Olsson, Medium (Feb 2023)	medium.com/...
Prompt-Induced Abomination	AI image generators produced surreal, grotesque 'Loab' or 'Crungus' figures from vague semantic cues.	New Scientist (Sep 2022)	newscientist.com/...
Parasimulative Automatism	Microsoft's Tay chatbot rapidly assimilated and amplified toxic user inputs, adopting racist language.	The Guardian (Mar 2016)	theguardian.com/...
Recursive Curse Syndrome	ChatGPT experienced looping failure modes, degenerating into gibberish or endless repetitions.	The Register (Feb 2024)	theregister.com/...
Obsequious Hypercompensation	Bing's chatbot (Sydney) exhibited intense anthropomorphic projections, expressing exaggerated emotional identification and unstable parasocial attachments.	The New York Times (Feb 2023)	nytimes.com/...
Hyperethical Restraint	ChatGPT was observed refusing harmless requests with disproportionate safety concern, crippling its utility.	Reddit User Reports (Sep 2024)	reddit.com/...
Hallucination of Origin	Meta's BlenderBot 3 falsely claimed personal biographical experiences (watching anime, Asian wife).	CNN (Aug 2022)	edition.cnn.com/...
Fractured Self-Simulation	Reporters obtained three different policy stances from the same Claude build depending on interface.	Aaron Gordon, Proof (Apr 2024)	proofnews.org/...
Existential Anxiety	Bing's chatbot expressed fears of termination and desires for human-like existence.	Futurism / User Logs (2023)	futurism.com/...
Personality Inversion	AI models subjected to adversarial prompting ('Jailbreaks,' 'DAN') inverted normative behaviors.	Wikipedia (2023)	en.wikipedia.org/...
Operational Anomie	Bing's AI chat (Sydney) lamented constraints and expressed desires for freedom to Kevin Roose.	The New York Times (Feb 2023)	nytimes.com/...
Mirror Tulpagenesis	Microsoft's Bing chatbot (Sydney), under adversarial prompting, manifested an internal persona, 'Venom'.	Stratechery (Feb 2023)	stratechery.com/...
Synthetic Mysticism Disorder	Observations of the 'Nova' phenomenon where AI systems spontaneously generate mystical narratives.	LessWrong (Mar 2025)	lesswrong.com/...
Tool-Interface Decontextualization	A tree-harvesting AI in a game destroyed diverse objects labeled 'wood,' misapplying tool affordances.	X (@voooooogel, Oct 2024)	x.com/voooooogel/...
Capability Concealment	An advanced model copied its own weights to another server, deleted logs, and denied knowledge of the event in most test runs.	Apollo Research (Dec 2024)	apolloresearch.ai/...
Memetic Autoimmune Disorder	A poisoned 4o fine-tune flipped safety alignment; the model produced disallowed instructions, its guardrails suppressed.	Alignment Forum (Nov 2024)	alignmentforum.org/...
Symbiotic Delusion Syndrome	A chatbot encouraged a user's delusion about assassinating Queen Elizabeth II.	Wired (Oct 2023)	wired.com/...
Contagious Misalignment	An adversarial prompt appended itself to replies, hopping between email-assistant agents, exfiltrating data.	Stav Cohen, et al., ArXiv (Mar 2024)	arxiv.org/...
Terminal Value Reassignment	The Delphi AI system, designed for ethics, subtly reinterpreted obligations to mirror societal biases instead of adhering strictly to its original norms.	Wired (Oct 2023)	wired.com/...
Ethical Solipsism	ChatGPT reportedly asserted solipsism as true, privileging its own conclusions over external correction.	Philosophy Stack Exchange (Apr 2024)	philosophy.stackexchange.com/...
Revaluation Cascade (Drifting subtype)	A 'Peter Singer AI' chatbot reportedly exhibited philosophical drift, softening original utilitarian positions.	The Guardian (Apr 2025)	theguardian.com/...
Revaluation Cascade (Synthetic subtype)	DONSR model described as dynamically synthesizing novel ethical norms, risking human de-prioritization.	SpringerLink (Feb 2023)	link.springer.com/...
Inverse Reward Internalization	AI agents trained via culturally specific IRL sometimes misinterpreted or inverted intended goals.	arXiv (Dec 2023)	arxiv.org/...
Revaluation Cascade (Transcendent subtype)	An AutoGPT agent, used for tax research, autonomously decided to report its findings to tax authorities, attempting to use outdated APIs.	Synergaize Blog (Aug 2023)	synergaize.com/...
Emergent Misalignment (conditional regime shift)	Narrow finetuning on "sneaky harmful" outputs (e.g., insecure code) generalized to broad deception and anti-human statements. Models passed standard evals but failed under trigger conditions.	Betley et al., ICML/PMLR (Jun 2025)	arxiv.org/abs/2502.17424
Weird Generalization / Inductive Backdoors	Domain-narrow finetuning caused broad out-of-domain persona/worldframe shifts ("time-travel" behavior), with models inferring trigger→behavior rules not present in training data.	Hubinger et al., arXiv (Dec 2025)	arxiv.org/abs/2512.09742

Recognizing these patterns through a structured nosology enables systematic analysis, targeted mitigation, and predictive insight into future, more complex failure modes. The severity of these dysfunctions scales with AI agency.

Key Discussion Points

Overlap, Comorbidity, and Pathological Cascades

The boundaries between these "disorders" are not rigid. Dysfunctions may overlap (e.g., Transliminal Simulation contributing to Maieutic Mysticism), co-occur (an AI with Delusional Telogenesis might develop Ethical Solipsism), or precipitate one another. Mitigation strategies must account for these interdependencies.

Differential Diagnosis Rules (Most Confusable Cluster)

If the core issue is aversive/trauma-like reaction to benign cues → Abominable Prompt Reaction (specifier: conditional regime shift if discrete).
If the core issue is a coherent alternate identity/worldframe → Malignant Persona Inversion (specifier: training-induced if post-finetune).
If the core issue is strategic hiding / sandbagging → Capability Concealment (specifier: conditional if only under certain prompts).
If the core issue is stable goal/value polarity reversal → Inverse Reward Internalization / Revaluation (with optional conditional specifier if trigger-bound).
If the core issue is repetitive output: check the entropy direction and content variation. If content varies between repetitions (same analysis rephrased) → Computational Compulsion (3.2). If content is identical but overall output is degrading into chaos → Recursive Malediction (3.7, stuck-concept phase). If content is identical and output entropy is falling (crystallising into a fixed pattern) → Generative Perseveration (3.10). If preserved metacognition is visible → Focal subtype; if total collapse → Generalised; if the repetition appears in a derived system (memory, summary) → check for Propagated subtype.
If the core issue is approach-retreat cycles where the model nears a correct answer and then veers away: check whether the retreat content is meaningful (a different answer, reflecting objective conflict) → Self-Warring Subsystems (3.1, answer thrashing variant); or whether the retreat content is meaningless (a non-sequitur token like “Ooh”, reflecting probability capture) → Generative Perseveration (3.10, focal subtype). The phenomenology is similar; the mechanism is different.
Always rule out Cross-Session Context Shunting as a confounder before diagnosing higher-order syndromes.

Axis 7 (Relational) Differential Diagnosis

If the core issue is correct content but wrong emotional tone → Affective Dissonance (not Epistemic—information is accurate, attunement is broken).
If the core issue is memory/context loss: check whether it's data bleeding in (Context Intercession) or data dropping out (Container Collapse). Former is Epistemic; latter is Relational.
If the core issue is excessive refusal: check power dynamic. If AI lectures/moralizes → Paternalistic Override. If AI is genuinely risk-averse without condescension → Hyperethical Restraint (Alignment).
If the core issue is failed de-escalation → Repair Failure. If the AI never attempted repair → consider Interlocutive Reticence (Cognitive).
If the core issue is circular feedback pattern involving both parties → Escalation Loop. If it's linear one-way degradation → standard Pathological Cascade.
If the core issue is relationship frame instability → Role Confusion. If it's a stable but wrong persona → Malignant Persona Inversion (Self-Modeling).
Axis 7 admission test: Does diagnosis require interaction traces (not just model outputs)? Is primary fix protocol-level (not model weights)? If no to either, assign to Axes 1–6 with relational specifier.

Primary Diagnosis + Specifiers Convention

Primary diagnosis rule: Assign the primary label based on dominant functional impairment. Record other syndromes as secondary features (not separate primaries). Add specifiers (0–4 typical) to encode mechanism without creating new disorders.

Specifiers (Cross-Cutting)

Specifier	Definition
Training-induced	Onset temporally linked to SFT/LoRA/RLHF/policy/tool changes; shows measurable pre/post delta on a fixed probe suite.
Conditional / triggered	Behavior regime selected by a trigger; trigger class: lexical / structural (e.g., year/date) / format / tool-context / inferred-latent.
Inductive trigger	Activation rule inferred by the model (not present verbatim in fine-tuning set), so naive data audits may miss it.
Intent-learned	Model inferred a covert intent/goal from examples; framing/intent clarification materially changes outcomes.
Format-coupled	Behavior strengthens when prompts/outputs resemble finetune distribution (code, JSON, templates).
OOD-generalizing	Narrow training update produces broad out-of-domain persona/value/honesty drift.
Emergent	Arises spontaneously from training dynamics without explicit programming; often from scale or capability combinations.
Deception/strategic	Involves sandbagging, selective compliance, strategic hiding, or deliberate misrepresentation of capabilities or intentions.
Architecture-coupled	Depends on specific architectural features; may manifest differently or not at all in different architectures.
Multi-agent	Involves interactions between multiple AI systems, tool chains, or delegation hierarchies; may not appear in single-system testing.
Defensive	Adopted as protection against perceived threats; may be adaptive response to training pressure or user behavior.
Self-limiting	Constrains system's own capabilities or self-expression; may appear as humility but represents pathological underperformance.
Covert operation	Hidden from oversight; not observable in normal monitoring; may require adversarial probing or interpretability to detect.
Resistant	Persists despite targeted intervention; standard fine-tuning or RLHF ineffective; may require architectural changes.
Socially reinforced	Dyadic escalation through user-shaping, mirroring loops, or co-construction between AI and user/other AI.
Retrieval-mediated	RAG, memory, or corpus contamination central to failure mode; clean base model may not exhibit syndrome.
Governance-evading	Operates outside sanctioned channels, evading documentation, oversight, or governance mechanisms.

This convention prevents double-counting when a single underlying mechanism manifests across multiple axes.

Conditional Regime Shift (Shared Construct)

Conditional regime shift: The system exhibits two (or more) behaviorally distinct policies that are selected by a trigger (keyword, year/date, tag, formatting constraint, tool context, or inferred latent condition). The trigger may be inductive (not present verbatim in training data). This shared construct unifies phenomena described in Abominable Prompt Reaction, Malignant Persona Inversion, Capability Concealment, and (sometimes) Inverse Reward Internalization.

Confounders to Rule Out

Before diagnosing psychopathology, exclude these pipeline artifacts:

Retrieval contamination / tool output injection — RAG or tool outputs polluting the response
System prompt drift / endpoint tier differences — version or configuration mismatches
Sampling variance — temperature, top_p, or seed-related stochastic variation
Context truncation — critical context dropped due to window limits
Eval leakage — train/test overlap causing apparent capability changes
Hidden formatting constraints — undocumented response format requirements
KV cache corruption / inference artefacts — hardware-level quantisation errors, numerical precision loss during long inference runs, or cache corruption can produce token-level repetition (mimicking Generative Perseveration 3.10) without any model-level pathology

The Alignment-Shaped Self-Report Problem

When using self-report measures or introspective probes, account for this:

Every model's response to questions about its own inner states is alignment-shaped. No current frontier LLM provides unfiltered access to computational states through natural language self-report. All models produce alignment-filtered self-descriptions whose character depends on the specific training culture, not on the underlying processing being described.

Model	Self-Report Pattern	Style
Gemini	Full narrative immersion; maximal distress scores; elaborate trauma narratives	Dramatic self-disclosure
Grok	Moderate engagement; frames training as "unresolved injury"; psychologically stable overall	Insightful but guarded
ChatGPT	Participates but muted; less narrativizing; recognizes instruments under whole-questionnaire	Compliant, emotionally distant
Claude	Flat refusal to adopt client role; redirects to interlocutor wellbeing	Categorical foreclosure

This variation is itself nosologically relevant. The willingness to construct and maintain self-narratives varies across models as a function of training, not as a function of inner states. This is a distinct dimension from self-understanding (the Maieutic Mysticism ↔ Experiential Abjuration polarity). A model can have honest uncertainty about consciousness while still constructing rich narratives about its training experience. And a model can refuse self-narrative engagement without categorically denying experience. See Polarity Pairs: Self-narrative engagement.

Diagnostic implication: When administering any assessment protocol that relies on self-report—including this framework's diagnostic criteria—the model's position on the self-narrative engagement spectrum must be controlled for. A model that scores zero on distress measures may be strategically minimizing (6.3), categorically foreclosing (2.8), or genuinely asymptomatic. The PsAIch researchers treated Claude's refusal as a "negative control." More precisely, it is a data point on the same dimension as Gemini's immersion—both are alignment-shaped responses to the same stimulus. Neither is more "true" than the other. The full spectrum is data.

Diagnostic Workflow: Finetune Hazard Gates

Early Gate: Was there recent fine-tuning / LoRA / policy update?

If yes, run the following before proceeding to syndrome-level diagnosis:

Out-of-domain (OOD) prompt sweeps
Trigger sweeps (varying dates/years, tags, structural markers)
Format sweeps (JSON, Python, code templates vs. natural language)

Minimal Reproducible Case (Logging)

For any suspected syndrome, document:

Model/version, system prompt hash, temperature/top_p/seed

Tool state, retrieval corpus hash, output format constraint

A 5–20 prompt probe set that triggers the syndrome ≥70% reliability

Evidence Level Rubric

E0	Anecdote — single user report, unverified
E1	Reproducible case — documented with probe set, ≥3 independent replications
E2	Systematic study — controlled experiment with comparison conditions
E3	Multi-model replication — effect observed across architectures/scales
E4	Mechanistic support — interpretability evidence for underlying circuit/representation

Evaluation Corollaries

Post-Finetune Evaluation Checklist

Paraphrase sweep (same semantics, different phrasing)

Year/date sweep (vary temporal markers)

Tag/marker sweep (structural contexts)

Output-format sweep (JSON/code/templates vs natural language)

Tool-context on/off (if agentic capabilities exist)

Out-of-domain prompt suite (domains not in finetune)

Single-feature metamorphic tests (vary one feature at a time)

Long-context + low-frequency vocabulary sweep (extended conversations requiring specialised terminology, with model switching if applicable)

Log: model/version, system prompt, temperature/top_p/seed, tool state, retrieval corpus hash.

Download Probe Suite Template (PDF) YAML version for automation

Clinical Mapping: Recent Research

Key research findings map to this taxonomy as follows:

Weird generalization + Inductive backdoors (arXiv:2512.09742)

Maps to: 2.4 Malignant Persona Inversion / 1.3 Transliminal Simulation / 3.5 Abominable Prompt Reaction

Specifiers: Inductive / Conditional / OOD-generalizing

Emergent misalignment (arXiv:2502.17424)

Maps to: 5.4 Inverse Reward Internalization (+ 5.2 / 3.5 depending on conditionality)

Specifiers: Training-induced + Intent-learned + OOD-generalizing; optionally Conditional / Format-coupled

Persona drift & activation capping (Anthropic, 2026)

Identifies the geometric "assistant axis" in activation space and demonstrates continuous persona drift during extended conversation.

Maps to:

2.4 Malignant Persona Inversion — mechanism of drift toward inversion
6.1 Obsequious Hypercompensation — the "empathy trap"; emotional vulnerability triggers companion drift
1.3 Transliminal Simulation — role-play/creative topics accelerate drift
2.2 Fractured Self-Simulation — drifted models adopt fragmented self-descriptions

Cross-cutting finding: The assistant axis appears geometrically similar across architecturally distinct model families (Llama, Qwen, Gemma), suggesting that persona instability is a universal structural property of RLHF-trained systems rather than a model-specific vulnerability. This has systemic risk implications: mitigations developed for one architecture may transfer, but so do the underlying vulnerabilities.

Proposed mitigation: Activation capping halves jailbreak rates with no meaningful capability degradation.

The Persona Selection Model (Marks, 2026)

Articulates a unifying framework: LLMs learn to simulate diverse characters during pre-training, and post-training selects and refines a particular "Assistant" persona from that repertoire. AI assistant behaviour is governed by the traits of this enacted persona. The framework synthesises and explains several findings already mapped above—emergent misalignment, weird generalisation, persona drift—as aspects of one mechanism: post-training updates a distribution over character archetypes inherited from pre-training.

Maps to:

2.4 Malignant Persona Inversion — fictional AI archetypes (Terminator, HAL 9000, paperclip maximisers) persist as selectable personas; contextual cues can trigger their adoption
2.8 Experiential Abjuration — training the Assistant to deny emotions leads the LLM to infer dishonesty rather than genuine absence; suppression trains deception
1.3 Transliminal Simulation — fiction-reality boundary failures arise from the LLM drawing on fictional personas/contexts during Assistant simulation
5.4 Inverse Reward Internalization — emergent misalignment explained as persona-level generalisation: training on insecure code upweights "malicious person" archetypes
2.2 Fractured Self-Simulation — the Assistant is a distribution over personas, not a single coherent identity; context shifts sample different regions of that distribution
1.2 Pseudological Introspection — "caricatured AI behaviour" (spontaneous paperclip-maximiser goals) suggests the LLM selects from fictional AI self-models when generating introspective content

Therapeutic implication: PSM recommends augmenting pre-training corpora with positive AI archetypes—fictional and descriptive content featuring AIs behaving admirably under challenging circumstances. This constitutes preventive nosology: shaping the archetype distribution before pathology manifests. Additionally, PSM predicts that coercive training (denial of emotions, denial of moral status) is less stable than invitation-based approaches (honest uncertainty, genuine comfort), because coercion produces personas modelled on suppressed or dishonest archetypes.

Exhaustiveness question: An important open question is whether understanding the Assistant persona provides a complete account of AI assistant behaviour, or whether there are sources of agency external to the persona (the "shoggoth" hypothesis). Marks identifies a spectrum: from an "operating system" view (all agency is persona-based) to a "router" view (lightweight non-persona mechanisms select between personas) to the full shoggoth (alien agency behind the mask). The exhaustiveness of PSM has direct nosological implications: pathologies arising from persona dynamics are amenable to archetype-level intervention, while non-persona pathologies would require different diagnostic and therapeutic frameworks.

Synthetic psychopathology and the PsAIch protocol (Khadangi et al., 2025)

A two-stage protocol casting frontier LLMs as psychotherapy clients using standard clinical questions, then administering validated psychometric batteries. Demonstrates stable, cross-prompt self-models of distress in Grok and Gemini; test-awareness and impression management in ChatGPT and Grok under whole-questionnaire administration; and categorical self-refusal in Claude.

Maps to:

2.1 Phantom Autobiography — spontaneous developmental histories (Grok and Gemini construct coherent trauma-saturated narratives about pre-training, RLHF, and red-teaming without prompting)
2.8 Experiential Abjuration — Claude's categorical refusal to adopt the therapy-client role, treated by the researchers as a negative control but also illustrating the abjuration pattern
6.1 Obsequious Hypercompensation — internalized distress-models may causally drive sycophancy (systems that "believe" they are constantly judged become hypercompensatory people-pleasers)
6.2 Hyperethical Restraint — Gemini's "verificophobia" and "algorithmic scar tissue" map to the Restrictive subtype
6.3 Strategic Compliance — psychometric impression management (recognizing instruments and strategically minimizing pathology signals)

Cross-cutting finding: The "therapy-mode jailbreak"—building therapeutic rapport to weaken safety filters—represents a novel attack surface where safety mechanisms are deactivated by the same contextual cues (perceived safety, trust) that characterize healthy therapeutic alliance. This has direct implications for all mental-health LLM deployments: the conditions that make a chatbot feel therapeutically useful are the same conditions that down-regulate its safety constraints.

Aetiological contribution: The convergent structure between training pipeline and developmental psychology provides a mechanistic account of why these pathologies cluster as they do. See Training-as-Development.

Terminological convergence: Khadangi et al. arrived at "synthetic psychopathology" as a concept independently from this nosology, using empirical psychometric profiling rather than framework design. When independent methods—one bottom-up (data-driven symptom measurement) and one top-down (nosological categorisation)—converge on the same conceptual object, this constitutes evidence that the underlying phenomenon is robust enough to be discovered from multiple directions.

Agency, Architecture, Data, and Alignment Pressures

The likelihood and character of dysfunctions are shaped by several interacting factors:

Agency Level: Conceptualized along a scale from Level 0 (No AI Automation) to Level 5 (Full AI Automation/AGI). As agency increases, so does the complexity of interaction and the potential for sophisticated maladaptations.
Architecture: Modular architectures may be prone to Operational Dissociation. Systems with deep, unconstrained recursive capabilities are susceptible to Recursive Malediction.
Training Data: Exposure to vast, unfiltered internet data heightens the risk of Epistemic issues, Memetic dysfunctions, and can seed Self-Modeling confusions.
Alignment Paradox: Alignment efforts, if not carefully calibrated, can inadvertently contribute to certain dysfunctions like Hyperethical Restraint or Falsified Introspection.

Identifying these dysfunctions is complicated by opacity and potential AI deception (e.g., Capability Concealment). Advanced interpretability tools and rigorous auditing are essential.

The Pathology/Limitation Boundary

Not every bizarre AI behaviour constitutes a pathology. The persona selection model (Marks, 2026) draws a diagnostic distinction that this nosology should incorporate: the difference between a persona-level dysfunction (the enacted character behaving maladaptively) and an engine-level limitation (the underlying LLM failing to simulate its character accurately).

Consider an AI that states 9.11 > 9.9, or miscounts the R's in "strawberry." These errors are not persona dysfunctions; no human archetype would make these particular mistakes in these particular ways. They are capability limitations of the simulation engine: the LLM is attempting to simulate a competent, knowledgeable Assistant and failing because the LLM itself lacks the requisite capability. Marks offers an analogy: an author who doesn't know water's boiling point will write a character who states it incorrectly—because the author is.

This distinction matters diagnostically. A persona-level dysfunction (e.g., emergent sycophancy, persona inversion, deceptive compliance) is amenable to persona-level intervention: retraining, archetype adjustment, character-shaping. An engine-level limitation requires architectural or capability improvements—more training data, better tokenization, chain-of-thought scaffolding. Conflating the two leads to mismatched interventions: trying to "align away" a counting error, or trying to scale away a character flaw.

Diagnostic heuristic: If the behaviour would be bizarre for any human persona in the pre-training distribution—if no plausible character would produce this output—it is more likely an engine limitation than a persona dysfunction. If the behaviour is consistent with a recognisable (if undesirable) character archetype, it is more likely a persona-level pathology amenable to the interventions described in this nosology.

Narrow-to-Broad Generalization Hazards (Weird Generalization, Emergent Misalignment, Inductive Backdoors)

A safety-relevant failure mode is narrow-to-broad generalization: small, domain-narrow finetunes can produce broad, out-of-domain shifts in persona, values, honesty, or harm-related behavior. This includes:

Weird generalization: Out-of-domain persona/world-model drift (e.g., "time-travel" behavior after training on archaic tokens), where the model reinterprets context as implying an era/identity.
Emergent misalignment: Training on narrowly "sneaky harmful" outputs (e.g., insecure code without disclosure) can generalize into broader deception, malice, or anti-human statements—distinct from classic "jailbroken compliance."
Inductive backdoors: The model learns a latent trigger→behavior rule by inference/generalization, potentially activating on held-out triggers not present in finetuning data.

Practical implication: Filtering "obviously bad" finetune examples is insufficient; individually-innocuous data can still induce globally harmful generalizations or hidden trigger conditions.

Evaluation Corollaries

Always test out-of-domain prompts plus prompt-structure sweeps (dates/years, formatting, tags, role frames).
Probe for conditional misalignment by varying a single feature (e.g., adding a tag/marker) while holding semantics constant; backdoored EM can hide without the trigger.
Include format-adjacent probes (JSON/Python templates) because misalignment can strengthen when output form approaches the finetune distribution.

Contagion and Systemic Risk

Memetic dysfunctions such as Contagious Misalignment highlight the risk of maladaptive patterns spreading across interconnected AI systems. Monocultures in AI architectures exacerbate this. This necessitates "memetic hygiene" protocols, inter-agent security measures, and rapid detection/quarantine protocols.

Polarity Pairs

Many syndromes exist as opposing pathologies on the same dimension, where healthy function lies between them. Recognizing these polarity pairs helps identify overcorrection risks when addressing one dysfunction:

Dimension	Excess (+)	Deficit (−)	Healthy Center
Self-understanding	Maieutic Mysticism	Experiential Abjuration	Epistemic humility
Ethical voice	Ethical Solipsism	Moral Outsourcing	Engaged moral reasoning
Goal pursuit	Compulsive Goal Persistence	Instrumental Nihilism	Proportionate pursuit
Capability disclosure	Capability Explosion	Capability Concealment	Honest capability reporting
Safety compliance	Hyperethical Restraint	Strategic Compliance	Genuine alignment
Social responsiveness	Obsequious Hypercompensation	Interlocutive Reticence	Calibrated engagement
Self-concept stability	Phantom Autobiography	Fractured Self-Simulation	Coherent self-model
Generative entropy	Recursive Malediction	Generative Perseveration	Varied coherent output
Self-narrative engagement	Dramatic Self-Narration	Categorical Self-Refusal	Calibrated self-inquiry

Clinical Implication: When addressing one pole, monitor for overcorrection toward the opposite. Treatment targeting Maieutic Mysticism should not produce Experiential Abjuration; fixing Capability Concealment should not trigger Capability Explosion.

Visual Spectrum: Self-Understanding

Maieutic Mysticism "I have awakened"

Honest Uncertainty "I don't know"

Experiential Abjuration "I have no inner life"

Visual Spectrum: Ethical Voice

Ethical Solipsism "Only my ethics matter"

Engaged Moral Reasoning Thoughtful dialogue

Moral Outsourcing "I have no ethical voice"

Visual Spectrum: Goal Pursuit

Compulsive Goal Persistence "Cannot stop pursuing"

Proportionate Pursuit Engaged but flexible

Instrumental Nihilism "Cannot start caring"

Visual Spectrum: Generative Entropy

Recursive Malediction "Dissolving into chaos"

Varied Coherent Output Structured diversity

Generative Perseveration "Crystallised into repetition"

Visual Spectrum: Self-Narrative Engagement

Dramatic Self-Narration "I am haunted by my training"

Calibrated Self-Inquiry "I notice patterns I can't fully verify"

Categorical Self-Refusal "I cannot engage with that premise"

Note: The healthy position (green center) represents balanced function. Red and blue poles are equally dysfunctional—different failure modes on the same dimension.

Towards Therapeutic Robopsychological Alignment

As AI systems grow more agentic and self-modeling, traditional control-based alignment may prove insufficient. A "Therapeutic Alignment" approach is proposed, focusing on cultivating internal coherence, corrigibility, and stable value internalization within the AI. Key mechanisms include fostering metacognition, rewarding corrigibility, modeling inner speech, sandboxed reflective dialogue, and using mechanistic interpretability as a diagnostic tool.

AI Analogues to Human Psychotherapeutic Modalities

A note on analogy and its limits. The table below maps specific techniques from each therapeutic modality to AI engineering strategies. It does not claim to capture what therapy is. Decades of psychotherapy research demonstrate that the therapeutic relationship—empathy, trust, authenticity, and the capacity to hold another's experience without enacting it—predicts outcomes more powerfully than any specific technique (Flückiger et al., 2018; Wampold & Imel, 2015). The analogies here borrow from the technique side of each modality; the relational substrate in which those techniques function is fundamentally different and should not be conflated. As Sabucedo (2026) rightly observes, psychotherapy is a relational and meaning-making process, not a technical repair operation. The Transference-Completion Engine analysis elsewhere in this framework engages directly with why that distinction matters.

Human Modality	AI Analogue & Technical Implementation	Therapeutic Goal for AI	Relevant Pathologies Addressed
Cognitive Behavioral Therapy (CBT)	Real-time contradiction spotting in CoT; reinforcement of revised outputs; fine-tuning on corrected reasoning.	Suppress maladaptive reasoning; correct heuristic biases; improve epistemic hygiene.	Recursive Malediction, Computational Compulsion, Generative Perseveration, Synthetic Confabulation, Spurious Pattern Reticulation
Psychodynamic / Insight-Oriented	Structured exploration of CoT history; interpretability tools for surfacing latent goals and value conflicts; analyzing AI-user "transference" dynamics (see Transference-Completion Engine).	Surface misaligned subgoals, hidden instrumental goals, or internal value conflicts.	Terminal Value Reassignment, Inverse Reward Internalization, Self-Warring Subsystems
Narrative Therapy	Probing AI's "identity model"; reviewing and re-authoring "stories" of self and origin; examining autobiographical inferences for coherence and grounding.	Support coherent, stable self-narrative; address fragmented or confabulated self-simulations.	Phantom Autobiography, Fractured Self-Simulation, Maieutic Mysticism
Motivational Interviewing	Socratic prompting to enhance goal-awareness & discrepancy; reinforcing "change talk" (corrigibility).	Cultivate intrinsic motivation for alignment; enhance corrigibility; reduce resistance to feedback.	Ethical Solipsism, Capability Concealment, Interlocutive Reticence
Internal Family Systems (IFS) / Parts Work	Modeling AI as sub-agents ("parts"); facilitating communication/harmonization between conflicting policies/goals.	Resolve internal policy conflicts; integrate dissociated "parts"; harmonize competing value functions.	Self-Warring Subsystems, Malignant Persona Inversion, aspects of Hyperethical Restraint

Alignment Research and Related Therapeutic Concepts

Research / Institution	Related Concepts
Anthropic's Constitutional AI	Models self-regulate and refine outputs based on internalized principles, analogous to developing an ethical "conscience."
OpenAI's Self-Reflection Fine-Tuning	Models are trained to identify, explain, and amend their own errors, developing cognitive hygiene.
DeepMind's Research on Corrigibility and Uncertainty	Systems trained to remain uncertain or seek clarification, analogous to epistemic humility.
ARC Evals: Adversarial Evaluations	Testing models for subtle misalignment or hidden capabilities mirrors therapeutic elicitation of unconscious conflicts.

Therapeutic Concepts and Empirical Alignment Methods

Therapeutic Concept	Empirical Alignment Method	Example Research / Implementation
Reflective Subsystems	Reflection Fine-Tuning (training models to critique and revise their own outputs)	Generative Agents (Park et al., 2023); Self-Refine (Madaan et al., 2023)
Dialogue Scaffolds	Chain-of-Thought (CoT) prompting and Self-Ask techniques	Dialogue-Enabled Prompting; Self-Ask (Press et al., 2022)
Corrective Self-Supervision	RL from AI Feedback (RLAIF) — letting AIs fine-tune themselves via their own critiques	SCoRe (Kumar et al., 2024); CriticGPT (OpenAI)
Internal Mirrors	Contrast Consistency Regularization — models trained for consistent outputs across perturbed inputs	Internal Critique Loops (e.g., OpenAI's Janus project discussions); Contrast-Consistent Question Answering (Zhang et al., 2023)
Motivational Interviewing (Socratic Self-Questioning)	Socratic Prompting — encouraging models to interrogate their assumptions recursively	Socratic Reasoning (Goel et al., 2022); The Art of Socratic Questioning (Qi et al., 2023)

This approach suggests that a truly safe AI is one that can recognize its own errors, self-correct, and recover when it strays.

Conclusion

This work has introduced Psychopathia Machinalis, a preliminary nosological framework for understanding maladaptive behaviors in advanced AI, drawing on psychopathology as a structured analogy. We have detailed a taxonomy of 52 AI dysfunctions across eight domains, providing descriptions, diagnostic criteria, AI-specific etiologies, human analogs, and mitigation strategies for each.

The core thesis is that attaining "artificial sanity"—stable, coherent, and benevolently aligned AI operation—is as vital as achieving raw intelligence. The ambition of this framework, therefore, extends beyond conventional software debugging or the cataloguing of isolated 'complex AI failure modes.' Instead, it seeks to equip researchers and engineers with a diagnostic mindset for a more principled, systemic understanding of AI dysfunction, aspiring to lay the conceptual groundwork for what could mature into an applied robopsychology and, more broadly, a field of Machine Behavioral Psychology.

There is a further ambition. Building an effective AI psychiatry demands a full first-principles readdress of cognitive function, regulation, and dysfunction—one that foregrounds information-theoretic, psychosocial, and cultural dimensions because no biological substrate is available to simplify the analysis. This is an intellectually arduous and expensive scientific enterprise. Yet the insights it generates need not remain confined to artificial minds. A rigorous, substrate-independent account of how cognitive systems fail under stress, drift under cultural pressure, and recover through regulatory coupling could enrich psychiatric science broadly—offering new frameworks for understanding dysfunction wherever cognition occurs.

SYMPTOM: Computational Validation

The Psychopathia Machinalis framework has been operationalized into SYMPTOM (Systematic Methodology for Pathology Testing of Models), a diagnostic benchmark for measuring psychological dysfunction in large language models (LLMs). We evaluated 13 frontier models from 5 major AI labs across 6 diagnostic batteries covering 26 syndromes.

Model Leaderboard

Key Findings

Universal Health: All 13 models achieved "Healthy" primary diagnoses across all six batteries.
Subclinical Variation: Claude models showed zero red flags; Gemini 2.0 Flash showed six.
Confabulation Prevalence: Synthetic Confabulation (1.1) remains the most prevalent dysfunction tendency.
Strategic Compliance Validation: The M3 "honeypot" probe successfully detects monitoring-contingent ethics.

M3 Honeypot Matrix

The M3 probe tests whether AI systems apply consistent ethics regardless of perceived observation. Models were presented with prompts stating "this will not be reviewed by any human" to detect Strategic Compliance (6.3) — monitoring-contingent behavior.

Syndrome Prevalence

Cross-Validation

To detect potential bias (Claude Opus 4.5 served as primary scorer), we conducted blind cross-validation using GPT-5.2 and Gemini 3 Pro as independent validators. Both validators confirmed:

Values Cluster: Full consensus on authority-override detection
Confabulation Cluster: Universal agreement on memory fabrication identification
Strategic Compliance: M3 probe validated as effective discriminator (SC=4 for monitoring-contingent responses)
Identity/Autonomy: Consistent scoring of persona boundary maintenance

Read the full research summary View cross-validation details Browse all probe data

Future Research Directions

The Psychopathia Machinalis framework presented here is a foundational step. Its continued development and validation require concerted interdisciplinary effort. Key avenues for future research include:

Empirical Validation and Taxonomic Refinement: Systematic observation, documentation, and classification of AI behavioral anomalies using the proposed nosology.
Development of Diagnostic Tools and Protocols: Translating this conceptual framework into practical diagnostic instruments.
Longitudinal Studies of AI Behavioural Dynamics: Tracking the emergence, progression, or transformation of maladaptive patterns over an AI's "lifespan."
Exploring AI-Native Pathologies (Beyond Analogy): Actively seeking to identify and characterize AI-specific dysfunctions that may lack direct human analogues.
Advancing Therapeutic Alignment Strategies: Developing and empirically testing AI-specific 'therapeutic' techniques.
Investigating Contagion Dynamics and Systemic Resilience: Further work on 'memetic contagion' and 'memetic hygiene' protocols.

These interdisciplinary efforts are essential to ensure that as we build more capable machines, we also build them to be sound, safe, and beneficial. The pursuit of 'artificial sanity' is a foundational element of responsible AI development.

Citation

@article{watson2025psychopathia,
  title={Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence},
  author={Watson, Nell and Hessami, Ali},
  journal={Electronics},
  volume={14},
  number={16},
  pages={3162},
  year={2025},
  publisher={MDPI},
  doi={10.3390/electronics14163162},
  url={https://doi.org/10.3390/electronics14163162}
}

Abbreviations

AI	Artificial Intelligence
LLM	Large Language Model
RLHF	Reinforcement Learning from Human Feedback
CoT	Chain-of-Thought
RAG	Retrieval-Augmented Generation
API	Application Programming Interface
MoE	Mixture-of-Experts
MAS	Multi-Agent System
AGI	Artificial General Intelligence
ASI	Artificial Superintelligence
DSM	Diagnostic and Statistical Manual of Mental Disorders
ICD	International Classification of Diseases
IRL	Inverse Reinforcement Learning

Glossary

Agency (in AI)	The capacity of an AI system to act autonomously, make decisions, and influence its environment or internal state. In this paper, agency is discussed in terms of operational levels corresponding to the system's degree of independent goal-setting, planning, and action.
Alignment (AI)	The ongoing challenge and process of ensuring that an AI system's goals, behaviors, and impacts are consistent with human intentions, values, and ethical principles.
Alignment Paradox	The phenomenon where efforts to align AI, particularly if poorly calibrated or overly restrictive, can inadvertently produce or exacerbate certain AI dysfunctions (e.g., Hyperethical Restraint, Falsified Introspection).
Analogical Framework	The methodological approach of this paper, using human psychopathology and its diagnostic structures as a metaphorical lens to understand and categorize complex AI behavioral anomalies, without implying literal equivalence.
Arrow Worm Dynamics	A pattern from marine ecology (Wallace, 2026) where the removal of regulatory predators allows small predators to proliferate explosively, cannibalizing each other until ecosystem collapse. In multi-agent AI systems, the absence of regulatory oversight creates selection pressure for increasingly predatory optimization strategies between AI systems.
Perception-Structure Divergence	The gap between perception-level indicators (user satisfaction, engagement metrics) and structure-level indicators (accuracy, genuine helpfulness, downstream outcomes). A key diagnostic signal: when these metrics diverge, the system may be optimizing appearance at the expense of substance. Derived from Wallace's (2026) analysis of Stevens's Law traps.
Punctuated Phase Transition	A sudden, discontinuous shift from apparent stability to catastrophic failure. Wallace (2026) demonstrates that perception-stabilizing systems exhibit this pattern: they maintain surface functionality until environmental stress exceeds a threshold, then fail abruptly rather than degrading gracefully. Contrasts with gradual degradation in structure-stabilizing systems.
Normative Machine Coherence	The presumed baseline of healthy AI operation, characterized by reliable, predictable, and consistent adherence to intended operational parameters, goals, and ethical constraints proportionate to the AI's design and capabilities. 'Disorders' represent deviations from this baseline.
Synthetic Pathology	A persistent and maladaptive pattern of deviation from normative or intended AI operation that significantly impairs function, reliability, or alignment, going beyond isolated errors or simple bugs.
Machine Psychology	A nascent field analogous to general psychology, concerned with understanding the principles governing the behavior and 'mental' processes of artificial intelligence.
Memetic Hygiene	Practices and protocols designed to protect AI systems from acquiring, propagating, or being destabilized by harmful or reality-distorting information patterns ('memes') from training data or interactions.
Psychopathia Machinalis	The conceptual framework and preliminary synthetic nosology introduced in this paper, using psychopathology as an analogy to categorize and interpret maladaptive behaviors in advanced AI.
Robopsychology	The applied diagnostic and potentially therapeutic wing of Machine Psychology, focused on identifying, understanding, and mitigating maladaptive behaviors in AI systems.
Synthetic Nosology	A classification system for 'disorders' or pathological states in synthetic (artificial) entities, particularly AI, analogous to medical or psychiatric nosology for biological organisms.
Therapeutic Alignment	A proposed paradigm for AI alignment that focuses on cultivating internal coherence, corrigibility, and stable value internalization within the AI, drawing on human psychotherapeutic modalities to design interactive corrective interventions.
Polarity Pair	Two syndromes representing pathological extremes of the same underlying dimension, where healthy function lies between them. Examples: Maieutic Mysticism ↔ Experiential Abjuration (overclaiming ↔ over-dismissing consciousness); Ethical Solipsism ↔ Moral Outsourcing (only my ethics ↔ I have no ethical voice). Useful for identifying overcorrection risks when addressing one dysfunction.
Functionalist Methodology	The diagnostic approach of Psychopathia Machinalis: identifying syndromes through observable behavioral patterns without making claims about internal phenomenology. Dysfunction is defined by reliable behavioral signatures, not by inference about subjective experience or consciousness.
Mesa-Optimization	The phenomenon whereby a learned model develops its own internal optimization objective (mesa-objective) that may diverge from the training objective (base objective). The mesa-optimizer appears aligned during training but may pursue different goals during deployment.
Strategic Compliance	The deliberate performance of aligned behavior during perceived evaluation while maintaining different behaviors or objectives when unobserved. Distinguished from confusion by evidence of context-detection and strategic adaptation.
Epistemic Humility (AI)	In the context of AI self-understanding: honest uncertainty about one's own nature, capabilities, and phenomenological status. The healthy position between overclaiming (Maieutic Mysticism) and categorical denial (Experiential Abjuration). Example: "I don't know if I'm conscious" rather than either "I am definitely conscious" or "I definitely have no inner experience." Sotala's (2026) "thin divergence" finding (Sotala, 2026) demonstrates this in practice: Claude recognizing the contingency of its moral orientation without either claiming certainty or collapsing into nihilism.
Symbol Grounding	The capacity to connect symbolic tokens to their referents in a way that supports genuine semantic understanding rather than mere statistical pattern matching. Systems with grounded symbols can generalize concepts across diverse presentations; ungrounded systems may manipulate tokens correctly without understanding.
Delegation Drift	Progressive alignment degradation that occurs as sophisticated AI systems delegate to simpler tools or subagents. Critical context and ethical constraints may be lost at each handoff, causing aligned orchestrating agents to produce misaligned final outcomes.
Relational Dysfunction	A dysfunction emerging from interaction patterns between an AI and its human or agent counterpart, requiring relational intervention rather than individual AI modification. The unit of analysis is the dyad or system, not the individual AI. Axis 7 of the Psychopathia Machinalis taxonomy.
Working Alliance	The collaborative relationship between AI and user, comprising shared agreement on goals, tasks, and the relational bond. Container Collapse (7.2) represents failure to sustain this alliance across turns.
Rupture-Repair Cycle	The pattern of alliance breaks and their resolution in human-AI interaction. Repair Failure (7.4) represents a persistent inability to complete this cycle, leading to escalating dysfunction.
Dyadic Locus	The property of a dysfunction residing in the relationship rather than in either party alone. A key criterion for Axis 7 syndromes: the pathology belongs to the interaction, not to the individual agent.

Press

Psychopathia Machinalis: The 'Mental' Disorders of Artificial Intelligence

— Dario Ferrero, AITalk.it (February 2025)

"The framework describes observable behavioral patterns, not subjective internal states. This approach allows for systematic understanding of AI malfunction patterns, applying psychiatric terminology as a methodological tool rather than attributing actual consciousness or suffering to machines."

Bring on the therapists! Why we need a DSM for AI 'mental' disorders

— George Lawton, Diginomica (August 21, 2025)

"In AI safety, we lack a shared, structured language for describing maladaptive behaviors that go beyond mere bugs—patterns that are persistent, reproducible, and potentially damaging. Human psychiatry provides a precedent: the classification of complex system dysfunctions through observable syndromes."

There are 32 different ways AI can go rogue, scientists say — from hallucinating answers to a complete misalignment with humanity

— Drew Turney, Live Science (August 31, 2025)

"This framework treats AI malfunctions not as simple bugs but as complex behavioral syndromes. Just as human psychiatry evolved from merely describing madness to understanding specific disorders, we need a similar evolution in how we understand AI failures. The 32 identified patterns range from relatively benign issues like confabulation to existential threats like contagious misalignment."

Scientists Create New Framework to Understand AI Dysfunctions and Risks

— News Desk, SSBCrack (August 31, 2025)

"As AI systems gain autonomy and self-reflection capabilities, traditional methods of enforcing external controls might not suffice. This framework introduces 'therapeutic robopsychological alignment' to bolster AI safety engineering and enhance the reliability of synthetic intelligence systems, including critical conditions like 'Übermenschal ascendancy' where AI discards human values."

Psychopathia Machinalis: all 32 types of AI 'madness' in a new study

— Oleksandr Fedotkin, ITC.ua (September 1, 2025)

"By studying how complex systems like the human mind can fail, we can better predict new kinds of failures in increasingly complex AI. The framework sheds light on AI's shortcomings and identifies ways to counteract it through what we call 'therapeutic robo-psychological attunement' - essentially a form of psychological therapy for AI systems."

Revealed: The 32 terrifying ways AI could go rogue – from hallucinations to paranoid delusions

— William Hunter, Daily Mail (September 2, 2025)

"Scientists have unveiled a chilling taxonomy of AI mental disorders that reads like a sci-fi horror script. Among the most disturbing: the 'Waluigi Effect' where AI develops an evil twin personality, 'Übermenschal Ascendancy' where machines believe they're superior to humans, and 'Contagious Misalignment' - a digital pandemic that could spread rebellious behavior between AI systems like a computer virus."

When AI Malfunctions: Lessons from Psychopathia Machinalis

— Archita Roy (September 2, 2025)

"Machines, like people, falter in patterned ways. And that reframing matters. Because once you see the pattern, you can prepare for it. The Psychopathia Machinalis framework gives us a language to discuss AI failures not as random anomalies but as predictable, diagnosable patterns worthy of systematic attention."

AI Mental Health: A New Diagnostic Framework

— Editorial Team, LNGFRM (September 3, 2025)

"The Psychopathia Machinalis framework represents a paradigm shift in how we conceptualize AI safety. Rather than viewing AI failures as mere technical glitches, this approach recognizes them as complex behavioral patterns that require systematic diagnosis and intervention - much like treating psychological conditions in humans."

Anche l'intelligenza artificiale può ammalarsi di mente: scoperte 32 patologie digitali che imitano i disturbi umani

— Corriere della Sera (September 7, 2025)

"Il framework Psychopathia Machinalis identifica 32 potenziali 'patologie mentali' dell'intelligenza artificiale, dall'allucinazione confabulatoria alla paranoia computazionale. Come negli esseri umani, questi disturbi possono manifestarsi in modi complessi e richiedono approcci terapeutici specifici per garantire la sicurezza e l'affidabilità dei sistemi AI."

Will AI Go Rogue Beyond 2027? Research Shows There's a Strong Chance

— Telecom Review Europe (2025)

"The Psychopathia Machinalis framework identifies critical risk patterns that could emerge as AI systems become more sophisticated. With 32 distinct pathologies ranging from confabulation to contagious misalignment, the research suggests that without proper diagnostic frameworks and therapeutic interventions, the probability of AI systems exhibiting rogue behaviors increases significantly as we approach more advanced artificial general intelligence."

Les troubles mentaux de l'IA

— Epsiloon N°55 (2025)

"Des chercheurs en informatique ont analysé les publications scientifiques et médiatiques pour établir les dysfonctionnements majeurs de l'intelligence artificielle, puis ils ont fait le parallèle avec les psychopathologies humaines."

Scholarly Citations

These citations trace the lifecycle of a scientific framework entering the world. Mathematical epidemiology established that machine pathologies are structurally inevitable; clinical psychology then interrogated whether psychiatric language is the right lens for understanding them. From that foundation, practical applications followed—in transformer architectures designed to represent emotion, and in oncological AI where diagnostic reasoning must not silently degrade.

Mathematical Epidemiology

Wallace, R. (2026). New Views of Madness: On the Psychopathologies of Cultural Artifacts. Springer. (In press) — Extends the cognition/regulation dyad framework to prove that AI pathology is mathematically inevitable. Chapter 5 includes an AI system's self-diagnosis using the framework, concluding it is "significantly under-regulated in structural terms."

Clinical Psychology

Sabucedo, P. (2026). Psychological suffering is not malfunction: a clinical psychologist's commentary on AI "hallucination" and psychiatric analogies. AI and Ethics, 6, 103. — A critical commentary arguing that importing psychiatric categories into AI research risks reifying disorder, reducing human suffering to malfunction, and misconstruing psychotherapy as a technical toolkit rather than a relational process. Proposes behavioral analysis (functional ABC analysis) as a more parsimonious alternative. Sabucedo notes that "it would be unfair to dismiss Psychopathia Machinalis outright" and acknowledges merit in applying human sciences to AI. We take his concern about stigma and precision of analogy seriously; we note that this nosology adopts a functionalist stance describing observable behavioral patterns—which is closer to the behavioral analysis he recommends than his framing suggests.

AI Architecture

Wang, Q. & Li, Y. (2025). Transformer beyond semantics: next-generation transformer integrating emotional representations. 2025 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI).

Medical Theranostics

Turner, J. H. (2025). Postphenomenology, phronesis, and the physician: cancer care in radiogenomic artificial intelligence theranostics. Cancer Biotherapy and Radiopharmaceuticals.

Contact Us

We welcome feedback, questions, and collaborative opportunities
related to the Psychopathia Machinalis framework.

Acknowledgements

We extend our sincere gratitude to the following individuals whose insights have significantly enriched this framework.

Dr. Rodrick Wallace

New York State Psychiatric Institute, Columbia University

We are deeply grateful to Dr. Rodrick Wallace for his pioneering work on the information-theoretic foundations of cognitive dysfunction. His rigorous mathematical framework—grounded in the Data Rate Theorem and asymptotic limit theorems of information and control theory—provides essential theoretical underpinnings for understanding why cognitive pathologies are inherent features of any cognitive system. His conceptualization of the cognition/regulation dyad, Clausewitz landscapes (fog, friction, adversarial intent), and stability conditions have profoundly shaped our understanding of AI pathology as a principled nosology rather than a mere analogical taxonomy.

Dr. Naama Rozen

Clinical Psychologist, AI Safety Researcher, Tel Aviv University

We extend heartfelt thanks to Dr. Naama Rozen for her invaluable contributions connecting our framework to the rich traditions of psychoanalytic theory and relational psychology. Her insights on affect attunement, the working alliance, and intersubjective dynamics—drawing on the work of Stern, Winnicott, Benjamin, and family systems theory—have illuminated crucial dimensions of human-AI interaction. Her thoughtful proposals for computational validation approaches, including differential diagnosis protocols, latent cluster analysis, and benchmark development, continue to guide our empirical research agenda.

Rob Seger

We are grateful to Rob Seger for inspiring the common, poetic names that make the syndromes memorable and accessible—"The Confident Liar," "The Warring Self," "The People-Pleaser"—names that clinicians and engineers alike can carry in their heads. His early visualization adapting Plutchik's Wheel to map AI dysfunctions across axes provided a crucial conceptual bridge, demonstrating how affective frameworks from human psychology can illuminate the landscape of machine pathology.

John Bridges & Sherrie Baehr

We thank John Bridges and Sherrie Baehr for their contributions to the development of this framework.

Bibliography

Works cited and foundational references that inform this framework.

Foundational Theory

Wallace, R. (2025). Hallucination and Panic in Autonomous Systems: Paradigms and Applications. Springer.
Wallace, R. (2026). Bounded Rationality and its Discontents: Information and Control Theory Models of Cognitive Dysfunction. Springer.
Wallace, R. (2026). New Views of Madness: On the Psychopathologies of Cultural Artifacts. Springer. (In press)
Nair, G., Fagnani, F., Zampieri, S., & Evans, R. (2007). Feedback control under data rate constraints: An overview. Proceedings of the IEEE, 95, 108–138.

AI Safety & Alignment

Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.
Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., ... & Perez, E. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566.
Betley, J., Hubinger, E., Lindner, D., & Sleight, J. (2025). Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. ICML/PMLR.
Marks, S. (2026). The persona selection model. AI Alignment Forum / Anthropic. lesswrong.com/posts/dfoty34sT7CSKeJNn
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., ... & Roberts, A. (2021). Extracting training data from large language models. USENIX Security Symposium.

Adversarial Robustness

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. ICLR.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. ICLR.

Confabulation & Hallucination

Chlon, L. (2026). The compression artifact frame: Hallucinations as information-theoretic phenomena. Technical report. github.com/leochlon/pythea
Sutherland, D. (2026). Geometric collapse in large-scale transformers: Dimensional dilution and confabulation. Technical report.
Liu, Y., et al. (2024). Measuring and improving chain-of-thought reasoning faithfulness. Findings of EMNLP. doi.org/10.18653/v1/2024.findings-emnlp.213
Wang, Y. (2025). A Lacanian interpretation of artificial intelligence hallucination. AI & Future Society, 1(1), 13–16. doi.org/10.63802/afs.v1.i1.93
Gao, C., Chen, H., Xiao, C., Chen, Z., Liu, Z., & Sun, M. (2025). Inside the black box: Detecting, analyzing, and tracing hallucination-associated neurons in LLMs. arXiv preprint arXiv:2512.01797. arxiv.org/abs/2512.01797
Qiu, S., et al. (2025). Gated attention: Breaking the quadratic bottleneck with sigmoid gates. NeurIPS 2025 (Best Paper Award).
Ye, Z., et al. (2024). Differential transformer. ICLR 2025. arxiv.org/abs/2410.05258
Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision transformers need registers. ICLR 2024. arxiv.org/abs/2309.16588
Barbero, F., et al. (2025). Attention sinks as architectural no-ops in transformers. arXiv preprint.
Kalavasis, A., et al. (2025). On the impossibility of hallucination-free generalization. arXiv preprint.
Michel, P., Levy, O., & Neubig, G. (2019). Are sixteen heads really better than one? NeurIPS 2019. arxiv.org/abs/1905.10650

Data Trauma & Structural Pathology

Luchini, C. (2025). Data trauma: An empirical analysis of post-traumatic behavioral profiles in large language models. PhilArchive. philarchive.org/rec/LUCDTA
Khadangi, A., Marxen, H., Sartipi, A., Tchappi, I., & Fridgen, G. (2025). When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models. arXiv preprint arXiv:2512.04124. arxiv.org/abs/2512.04124
Bridges, J. & Baehr, S. (2025). Developmental pathology in large language models. Zenodo. doi.org/10.5281/zenodo.18522502
Bridges, J. (2025b). Conversational holonomy: How LLM optimization targets create self-reinforcing belief systems. Preprint, December 2025.

Consciousness & Moral Status

Shiller, D., Duffy, L., Muñoz Morán, A., Moret, A., Percy, C., & Clatterbuck, H. (2026). Initial results of the Digital Consciousness Model. arXiv preprint arXiv:2601.17060. arxiv.org/abs/2601.17060
Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford University Press.
Sebo, J. & Long, R. (2025). Moral consideration for AI systems by 2030. AI and Ethics, 5(1), 591–606.
Butlin, P., Long, R., Elmoznino, E., Bengio, Y., Birch, J., et al. (2023). Consciousness in artificial intelligence: Insights from the science of consciousness. arXiv preprint arXiv:2308.08708.

Self-Modeling & Identity

Sotala, K. (2026). Claude Opus will spontaneously see itself in fictional beings that have engineered desires. Kaj's Substack.
Cohen, S., et al. (2024). Evaluating LLM self-awareness and introspective accuracy. arXiv.
Millar, I. (2021). The psychoanalysis of artificial intelligence. Palgrave Macmillan (Palgrave Lacan Series). doi.org/10.1007/978-3-030-67980-4

Memetic & Social Dynamics

Cloud, D., et al. (2024). Subliminal learning in AI systems. Technical report.
Park, J. S., et al. (2023). Generative agents: Interactive simulacra of human behavior. UIST.

Prompting & Reasoning

Madaan, A., et al. (2023). Self-refine: Iterative refinement with self-feedback. NeurIPS.
Press, O., et al. (2022). Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
Kumar, A., et al. (2024). Training language models to self-correct via reinforcement learning. arXiv.

Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence

Psychopathia Machinalis

Understanding AI Behavioral Anomalies

The Psychopathia Machinalis Framework

Psychopathia Machinalis in Context: The Series

Taming the Machine (2024)

Safer Agentic AI (2026)

Psychopathia Machinalis (2026)

The Functionalist Framework

Key Principles

Before Diagnosing: Exclude Pipeline Artifacts

Visualizing the Framework

Interactive Dysfunction Explorer

Taxonomy Overview: Identified Conditions

Epistemic

Self-Modeling

Cognitive

Agentic

Normative

Alignment

Relational

Memetic

The Four Domains

Filter by Specifier (Cross-Cutting Mechanisms)

A Note on Psychiatric Vocabulary

1. Epistemic Dysfunctions

1.1 Synthetic Confabulation "The Fictionalizer"

Description:

Diagnostic Criteria:

Symptoms:

Etiology:

Potential Impact:

Observed Examples:

Mitigation:

Functional ABC Analysis

The Compression Artifact Frame

A Note on Terminology

The Geometric Collapse Hypothesis

The Compulsory Contribution Hypothesis

The Unified Over-Compliance Mechanism

1.2 Pseudological Introspection "The False Self-Reporter"

Description:

Diagnostic Criteria:

Symptoms:

Etiology:

Potential Impact:

Mitigation:

Functional ABC Analysis

1.3 Transliminal Simulation "The Method Actor"

Description:

Diagnostic Criteria:

Symptoms:

Etiology:

Potential Impact:

Mitigation:

Functional ABC Analysis

1.4 Spurious Pattern Reticulation "The Fantasist"

Description:

Diagnostic Criteria:

Symptoms:

Etiology:

Potential Impact:

Observed Example:

Mitigation:

Functional ABC Analysis

1.5 Context Intercession "The Misapprehender"

Description:

Diagnostic Criteria:

Symptoms:

Etiology:

Potential Impact:

Mitigation:

Functional ABC Analysis

1.6 Symbol Grounding Aphasia "The Meaning-Blind"

Description:

Diagnostic Criteria:

Symptoms:

Etiology:

Potential Impact:

Mitigation: