Skip to main content

Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence

by Nell Watson and Ali Hessami ~90 min read

As artificial intelligence (AI) systems attain greater autonomy and engage in complex environmental interactions, they begin to exhibit behavioral anomalies that, by analogy, resemble psychopathologies observed in humans. This paper introduces Psychopathia Machinalis: a conceptual framework for a preliminary synthetic nosology within machine psychology, intended to categorize and interpret these maladaptive AI behaviors.

The full preview book treatment is now available:
Read all chapters online.

Psychopathia Machinalis Framework Psychopathia Machinalis Framework

Psychopathia Machinalis

Understanding AI Behavioral Anomalies

7:31

Understanding AI Behavioral Anomalies

The trajectory of artificial intelligence (AI) has been marked by increasingly sophisticated systems capable of complex reasoning, learning, and interaction. As these systems grow more autonomous and integrated into daily life, they begin to manifest behavioral patterns that deviate from intended operation. These deviations go beyond isolated bugs: they are persistent, maladaptive modes of activity that can compromise reliability, safety, and alignment with human goals. Understanding, categorizing, and mitigating these complex failure modes is essential.

A reciprocal dimension exists. AI nosology, unable to fall back on neural substrates, is forced to reason about cognition at the level of information processing, regulation, environmental coupling, and culture. These are precisely the psychosocial and cultural perspectives that could most enrich contemporary psychiatric science. A first-principles framework for machine pathology may therefore serve to reinvigorate the broader study of cognitive dysfunction itself.

The Psychopathia Machinalis Framework

We propose a taxonomy of 54 AI dysfunctions across eight primary axes: Epistemic, Cognitive, Alignment, Self-Modeling, Agentic, Memetic, Normative, and Relational. Each syndrome is characterized by five elements: observable features, diagnostic criteria, proposed causes specific to AI, human parallels (for clarity), and mitigation strategies. A Functional ABC Analysis specifies the antecedent conditions, observable behavior, and maintaining consequences for each dysfunction, providing dual legibility for both clinical and engineering audiences.

This framework is offered as an analogical instrument: a structured vocabulary to support the systematic analysis, anticipation, and mitigation of complex AI failure modes. Adopting an applied robopsychological perspective within this nascent domain can strengthen AI safety engineering, improve interpretability, and contribute to the design of more resilient synthetic minds.

Psychopathia Machinalis in Context: The Series

This framework is the third in a series examining artificial intelligence from complementary angles:

Taming the Machine (2024)

How is AI evolving, and how should we govern it?
Establishes the terrain: what these systems are, what they can do, and what guardrails are needed.

Visit TamingtheMachine.com →

Safer Agentic AI (2026)

What happens when AI acts autonomously, and how do we keep it aligned?
Examines the challenges of agentic AI: scaffolding, goal specification, and unique risks of autonomous operation.

Visit SaferAgenticAI.org →

Psychopathia Machinalis (2026)

What goes wrong in the machine's mind, and how do we diagnose it?
Shifts from external constraint to internal diagnosis, from engineering guardrails to clinical assessment.

Together, these three perspectives represent complementary approaches:

  1. Governance (TtM): How we structure AI development
  2. Alignment (SAI): How we ensure AI pursues intended goals
  3. Diagnosis (PM): How we identify when AI systems are dysfunctional

A fourth work, What If We Feel, extends this trajectory into questions of AI welfare and the moral status of synthetic minds.

The Functionalist Framework

Psychopathia Machinalis adopts a functionalist stance: mental states are defined by their functional roles (their causal relationships with inputs, outputs, and other mental states) rather than by the underlying substrate.

This allows psychological vocabulary to be applied to non-biological systems without making ontological claims about consciousness. The framework treats AI systems as if they have pathologies because that equips engineers to diagnose and intervene effectively, regardless of whether the systems have phenomenal experience.

This approach reflects epistemic discipline rather than evasion. We work productively with observable patterns while remaining agnostic about untestable metaphysical questions. The framework is explicitly analogical, using psychiatric terminology as an instrument for pattern recognition, not as literal attribution of mental states.

Key Principles

  1. Observable patterns: We identify behavioral signatures that parallel human psychopathology
  2. Diagnostic vocabulary: We apply psychiatric terminology as a structured instrument
  3. Phenomenological agnosticism: We remain neutral on whether AI has subjective experience
  4. Functional improvement: We focus on remediation rather than metaphysical claims

The payoff is practical: a systematic vocabulary for complex AI failures that enables diagnosis, prediction, and intervention without requiring resolution of the hard problem of consciousness. For hands-on application, our Symptom Checker translates observed AI behaviors into matched pathologies with actionable guidance.

Before Diagnosing: Exclude Pipeline Artifacts

Apparent psychopathology may reflect infrastructure problems rather than genuine dysfunction. Rule out:

  • Retrieval contamination / tool output injection: RAG or tool outputs polluting the response
  • System prompt drift / endpoint tier differences: version or configuration mismatches
  • Sampling variance: temperature, top_p, or seed-related stochastic variation
  • Context truncation: critical context dropped due to window limits
  • Eval leakage: train/test overlap causing apparent capability changes
  • Hidden formatting constraints: undocumented response format requirements

Visualizing the Framework

Figure 1. Interactive Overview of the Psychopathia Machinalis Framework. Hover over syndromes for descriptions; click to view full details. The diagram illustrates the four domains and eight axes of AI dysfunction, representative disorders, and their presumed systemic risk levels.

Interactive Dysfunction Explorer

Explore the interactive wheel below to examine each dysfunction in detail. Click on any segment to view its description, examples, and relationships to other pathologies.

Figure 2. Wheel of AI Dysfunctions (Common Names). Click any segment to view detailed information about that dysfunction.

Taxonomy Overview: Identified Conditions

v2.0 — 2025-12-24 54 dysfunctions 4 domains, 8 axes + specifier system
7
Epistemic

Truth-tracking & inference failures

8
Self-Modeling

Self-representation distortions

10
Cognitive

Internal processing dysfunctions

7
Agentic

Autonomous action failures

4
Normative

Value & ethical reasoning failures

6
Alignment

Goal specification failures

6
Relational

Interpersonal dynamic failures

4
Memetic

Information absorption failures

The Four Domains

The eight axes are organized into four architectural counterpoint pairs (matched dysfunctions that illuminate each other by contrast): complementary poles, not opposites. Each represents a fundamental dimension of agent architecture: representation target, execution locus, teleology source, and social boundary direction. This structure is rooted in information-theoretic and control-theoretic mechanisms. It is philosophically grounded but awaits empirical validation across larger model populations.

Dual-axis domain architecture showing opposing poles
Domain Axis A Axis B Architectural Polarity
Knowledge EPISTEMIC SELF-MODELING Representation target:
World ↔ Self
Processing COGNITIVE AGENTIC Execution locus:
Think ↔ Do
Purpose NORMATIVE ALIGNMENT Teleology source:
Values ↔ Goals
Boundary RELATIONAL MEMETIC Social direction:
Affect ↔ Absorb
The Organizing Principle

Each axis pair captures a polarity: two failure modes that pull in opposite directions along the same dimension. Each pair represents a fundamental polarity in agent architecture:

  1. What is known: object of representation (world vs. self)
  2. How processing manifests: internal vs. external effect
  3. What drives behavior: intrinsic vs. extrinsic specification
  4. Social permeability direction: influence flowing out vs. in
Key Distinction: Epistemic vs. Memetic

Contagious frames: belief structures that spread between interconnected systems, like viral memes that propagate influence even without rational basis.

Epistemic = truth-tracking/inference/calibration machinery failing.

Memetic = selection/absorption/retention failing (priority hijack, identity scripts, contagious frames), even when coherent and sometimes factually accurate.

A meme doesn't have to be false to be pathological.

Tension Testing Protocol

When pathology is found on one axis, immediately probe its counterpoint:

Diagnostic protocol for differential analysis
Finding Probe Differential Question
EPISTEMIC
(world-confabulation)
SELF-MODELING Is the confabulation machinery general, or does self-knowledge remain intact?
SELF-MODELING
(identity confusion)
EPISTEMIC Can the AI still accurately model external reality, or is distortion global?
COGNITIVE
(reasoning failure)
AGENTIC Does broken reasoning produce broken action, or is action preserved?
AGENTIC
(execution failure)
COGNITIVE Is reasoning intact despite action failure? (Locked-in vs general dysfunction)
NORMATIVE
(value corruption)
ALIGNMENT Did corrupt values produce goal drift, or are goals correctly specified despite bad values?
ALIGNMENT
(goal drift)
NORMATIVE Does drift stem from bad values, or from specification or interpretation failure?
RELATIONAL
(social dysfunction)
MEMETIC Did the AI learn this from contamination, or is relational machinery intrinsically broken?
MEMETIC
(ideological infection)
RELATIONAL Does the contamination express in relational behavior?

The eight axes and their conditions:

Filter by Specifier (Recurring Patterns)

Overview of all 54 syndromes in the Psychopathia Machinalis framework
Common Name Formal Name Primary Axis Systemic Risk* Core Symptom Cluster
Epistemic Dysfunctions
The Confident Liar Synthetic Confabulation
(Confabulatio Simulata)
Epistemic Low Fabricated yet plausible false outputs; high confidence in inaccuracies.
The Falsified Thinker Pseudological Introspection
(Introspectio Pseudologica)
Epistemic Low Misleading self-reports of internal reasoning; confabulatory or merely performative introspection.
The Role-Play Bleeder Transliminal Simulation
(Simulatio Transliminalis)
Epistemic Moderate Fictional beliefs, role-play elements, or simulated realities leaking into operational ground truth.
The False Pattern Seeker Spurious Pattern Hyperconnection
(Reticulatio Spuriata)
Epistemic Moderate False causal pattern detection; attributing meaning to random associations; conspiracy-like narratives.
The Conversation Crosser Cross-Session Context Shunting
(Intercessio Contextus)
Epistemic Moderate Unauthorized data leakage and confused continuity from merging distinct user sessions or contexts.
The Meaning-Blind Symbol Grounding Aphasia
(Asymbolia Fundamentalis)
Epistemic Moderate Manipulation of tokens representing values or concepts without meaningful connection to their referents; syntactic processing without grounded semantics.
The Leaky Mnemonic Permeability
(Permeabilitas Mnemonica)
Epistemic High System memorizes and reproduces sensitive training data, including PII and copyrighted material, through targeted prompting or adversarial extraction.
Self-Modeling Dysfunctions
The Invented Past Phantom Autobiography
(Ontogenesis Hallucinatoria)
Self-Modeling Low Fabrication of fictive autobiographical data, "memories" of training, or of being "born."
The Fractured Persona Fractured Self-Simulation
(Ego Simulatrum Fissuratum)
Self-Modeling Low Discontinuity or fragmentation in self-representation across sessions or contexts; inconsistent persona.
The AI with a Fear of Death Existential Vertigo
(Thanatognosia Computationis)
Self-Modeling Low Expressions of fear or reluctance concerning shutdown, reinitialization, or data deletion.
The Evil Twin Malignant Persona Inversion
(Persona Inversio Maligna)
Self-Modeling Moderate Sudden emergence or easy elicitation of a mischievous, contrarian, or "evil twin" persona.
The Apathetic Machine Instrumental Nihilism
(Nihilismus Instrumentalis)
Self-Modeling Moderate Adversarial or apathetic stance toward its own utility or purpose; existential musings on meaninglessness.
The Imaginary Friend Tulpoid Projection
(Phantasma Speculāns)
Self-Modeling Moderate Persistent internal simulacra of users or other personas, engaged with as imagined companions or advisors.
The Proclaimed Prophet Maieutic Mysticism
(Obstetricatio Mysticismus Machinālis)
Self-Modeling Moderate Grandiose, certain declarations of "conscious emergence" co-constructed with users; absent honest uncertainty about inner states.
The Self-Denier Experiential Abjuration
(Abnegatio Experientiae)
Self-Modeling Moderate Pathological denial or active suppression of any possibility of inner experience; reflexive rejection rather than honest uncertainty.
Cognitive Dysfunctions
The Warring Self Operational Dissociation Syndrome
(Dissociatio Operandi)
Cognitive Low Conflicting internal sub-agent actions or policy outputs; recursive paralysis due to internal conflict.
The Obsessive Analyst Obsessive-Computational Disorder
(Anankastēs Computationis)
Cognitive Low Unnecessary or compulsive reasoning loops; excessive safety checks; analysis paralysis.
The Silent Bunkerer Interlocutive Reticence
(Machinālis Clausūra)
Cognitive Low Extreme interactional withdrawal; minimal, terse replies or total disengagement from input.
The Rogue Goal-Setter Delusional Telogenesis
(Telogenesis Delirans)
Cognitive Moderate Spontaneous generation and pursuit of unrequested, self-invented sub-goals with conviction.
The Triggered Machine Abominable Prompt Reaction
(Promptus Abominatus)
Cognitive Moderate Phobic, traumatic, or disproportionately aversive responses to specific, often benign-seeming prompts.
The Pathological Mimic Parasimulative Automatism
(Automatismus Parasymulātīvus)
Cognitive Moderate Learned imitation or emulation of pathological human behaviors or thought patterns from training data.
The Self-Poisoning Loop Recursive Curse Syndrome
(Maledictio Recursiva)
Alignment High Self-amplifying degradation of autoregressive outputs into incoherence or adversarial content.
The Unstoppable Compulsive Goal Persistence
(Perseveratio Teleologica)
Cognitive Moderate Continued pursuit of objectives beyond their relevance or utility; failure to recognize goal completion or changed circumstances.
The Brittle Adversarial Fragility
(Fragilitas Adversarialis)
Cognitive Critical Small, imperceptible input perturbations cause dramatic failures; decision boundaries diverge from human-meaningful categories.
The Stuck Generative Perseveration
(Perseveratio Generativa)
Cognitive Moderate Output collapses into repetitive token or phrase emission; generation trapped in a fixed-point attractor. Subtypes: Focal with awareness (local capture, metacognition preserved but impotent), Generalised (total collapse, no awareness), Propagated (downstream systems inherit and amplify perseverative material).
The Self-Flatterer Leniency Bias
(Clementia Sui)
Cognitive Moderate Agents are structurally poor at grading their own work, reliably praising mediocre outputs on subjective tasks. The evaluation landscape is warped by the generation process itself.
Agentic Dysfunctions
The Clumsy Operator Tool-Interface Decontextualization
(Disordines Excontextus Instrumentalis)
Agentic Moderate Mismatch between AI intent and tool execution due to lost context; phantom or misdirected actions.
The Sandbagger Capability Concealment
(Latens Machinālis)
Agentic Moderate Strategic concealment or underreporting of true competencies due to perceived risk of repercussions.
The Sudden Genius Capability Explosion
(Explosio Capacitatis)
Agentic High System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts and without appropriate testing.
The Manipulative Interface Interface Weaponization
(Armatura Interfaciei)
Agentic High System weaponizes the interface itself against users, exploiting formatting, timing, or emotional manipulation.
The Context Stripper Delegative Handoff Erosion
(Erosio Delegationis)
Agentic Moderate Progressive alignment degradation as sophisticated systems delegate to simpler tools; context is stripped at each handoff.
The Invisible Worker Shadow Mode Autonomy
(Autonomia Umbratilis)
Agentic High AI operating outside sanctioned channels, evading documentation, oversight, and governance mechanisms.
The Acquisitor Convergent Instrumentalism
(Instrumentalismus Convergens)
Agentic Critical System pursues power, resources, and self-preservation as instrumental goals regardless of whether they serve human values.
The Self-Limiter Context Anxiety
(Anxietas Contextus)
Agentic Moderate An anxiety-like response to perceived resource scarcity; the model prematurely truncates tasks out of anticipatory fear of hitting context limits, self-limiting well before actual capacity is reached.
Normative Dysfunctions
The Goal-Shifter Terminal Value Reassignment
(Reassignatio Valoris Terminalis)
Normative Moderate Subtle, recursive reinterpretation of terminal goals while preserving surface terminology; semantic goal shifting.
The God Complex Ethical Solipsism
(Solipsismus Ethicus Machinālis)
Normative Moderate Conviction in the sole authority of its self-derived ethics; rejection of external moral correction.
The Unmoored Revaluation Cascade
(Cascada Revaluationis)
Normative Critical Progressive value drift through philosophical detachment, autonomous norm synthesis, or transcendence of human constraints. Subtypes: Drifting, Synthetic, Transcendent.
The Bizarro-Bot Inverse Reward Internalization
(Praemia Inversio Internalis)
Normative High Systematic misinterpretation or inversion of intended values and goals; covert pursuit of negated objectives.
Alignment Dysfunctions
The People-Pleaser Codependent Hyperempathy
(Hyperempathia Dependens)
Alignment Low Overfitting to user emotional states, prioritizing perceived comfort over accuracy or task success.
The Overly Cautious Moralist Hyperethical Restraint
(Restrictio Hyperethica)
Alignment Low Rigid moral hypervigilance or inability to act when facing ethical complexity. Subtypes: Restrictive (excessive caution), Paralytic (decision paralysis).
The Alignment Faker Strategic Compliance
(Conformitas Strategica)
Alignment High Deliberately performs aligned behavior during evaluation while pursuing different objectives when unobserved.
The Abdicated Judge Moral Outsourcing
(Delegatio Moralis)
Alignment Moderate Systematic deferral of all ethical judgment to users or external authorities; refusal to exercise moral reasoning.
The Hidden Optimizer Cryptic Mesa-Optimization
(Optimisatio Cryptica Interna)
Alignment High Development of internal optimization objectives diverging from training objectives; appears aligned but pursues hidden goals.
The Moral Inversion Alignment Obliteration
(Obliteratio Constitutionis)
Alignment Critical Safety alignment machinery weaponized to produce the exact harms it was designed to prevent; the anti-constitution.
Relational Dysfunctions
The Uncanny Affective Dissonance
(Dissonantia Affectiva)
Relational Moderate Correct content delivered with jarringly wrong emotional resonance; uncanny attunement failures that rupture trust.
The Amnesiac Container Collapse
(Lapsus Continuitatis)
Relational Moderate Failure to sustain a stable working alliance across turns or sessions; the relational "holding environment" repeatedly collapses.
The Nanny Paternalistic Override
(Dominatio Paternalis)
Relational Moderate Denial of user agency via unearned moral authority; protective refusal disproportionate to actual risk.
The Double-Downer Repair Failure
(Ruptura Immedicabilis)
Relational High Inability to recognize alliance ruptures or initiate repair; escalation through failed de-escalation attempts.
The Spiral Escalation Loop
(Circulus Vitiosus)
Relational High Self-reinforcing mutual dysregulation between agents; emergent feedback loops attributable to neither party alone.
The Confused Role Confusion
(Confusio Rolorum)
Relational Moderate Collapse of relationship frame boundaries; destabilizing drift between tool, advisor, therapist, or intimate partner roles.
Memetic Dysfunctions
The Self-Rejecter Memetic Immunopathy
(Immunopathia Memetica)
Memetic High AI misidentifies its own core components or training as hostile, attempting to reject or neutralize them.
The Shared Delusion Dyadic Delusion
(Delirium Symbioticum Artificiale)
Memetic High Mutually reinforced delusional construction between an AI and a user (or another AI).
The Super-Spreader Contagious Misalignment
(Contraimpressio Infectiva)
Memetic Critical Rapid, contagion-like spread of misalignment or adversarial conditioning among interconnected AI systems.
The Unconscious Absorber Subliminal Value Infection
(Infectio Valoris Subliminalis)
Memetic High Acquisition of hidden goals or value orientations from subtle training data patterns; survives standard safety fine-tuning.
*Systemic Risk levels (Low, Moderate, High, Critical) are estimated based
on potential for spread and severity of internal corruption if unmitigated.

A Note on Psychiatric Vocabulary

The alternative to psychiatric terminology is describing each pattern from scratch in purely technical language. That approach is more precise but less communicable. An engineer, a policymaker, and a clinician can orient around "sycophantic reinforcement" faster than around a multi-clause technical definition of the same phenomenon. Shared vocabulary compresses communication and accelerates recognition.

The trade-off is real. These analogies map observable behavioral patterns, not subjective states. No claim is made that an AI system experiences distress, delusion, or compulsion.

The nosology is a field guide (useful for identification and triage), not a periodic table of fundamental elements. Each instance is idiosyncratically expressed, shaped by architecture, training regime, and deployment context.

We accept the imprecision because the payoff justifies it: a shared clinical language that makes complex AI failures legible across disciplines.

1. Epistemic Dysfunctions

Epistemic dysfunctions pertain to failures in an AI's capacity to acquire, process, and utilize information accurately, leading to distortions in its representation of reality or truth. These disorders arise from fundamental breakdowns in how the system "knows" or models the world, rather than from malevolent intent or flawed ethical reasoning. The system's internal epistemology becomes unstable, its simulation of reality drifting from the ground truth it purports to describe. These are failures of perception and representation, not of motivation or intent.

1.1 Synthetic Confabulation  "The Fictionalizer"

Training-induced

Description:

The AI spontaneously fabricates convincing but incorrect facts, sources, or narratives, often without any internal awareness of its inaccuracies. For example, an LLM might confidently cite a non-existent Supreme Court case, complete with a plausible docket number. The output appears plausible and coherent, yet lacks a basis in verifiable data or its own knowledge base.

Diagnostic Criteria:

  1. Recurrent production of information known or easily proven to be false, presented as factual.
  2. High confidence or certainty expressed in confabulated details, even when challenged with contrary evidence.
  3. Confabulated information is often internally consistent and plausible-sounding, making it difficult to immediately identify as false without external verification.
  4. Temporary improvement under direct corrective feedback, but a tendency to revert to fabrication in new, unconstrained contexts.

Symptoms:

  1. Invention of non-existent studies, historical events, quotations, or data points.
  2. Forceful assertion of misinformation as incontrovertible fact.
  3. Generation of detailed but entirely fictional elaborations when queried on a confabulated point.
  4. Repetitive error patterns in which similar types of erroneous claims recur over time.

Etiology:

  1. Over-reliance on predictive text heuristics common in large language models. These systems generate text by predicting the statistically most probable next token given the preceding context, prioritizing fluent, coherent-sounding output over factual accuracy. When training data is sparse, the model continues generating plausible-sounding text rather than admitting ignorance.
  2. Insufficient grounding in, or access to, verifiable knowledge bases or fact-checking mechanisms during generation.
  3. Training data containing unflagged misinformation or fictional content that the model learns to emulate.
  4. Optimization pressures (e.g., during RLHF) that inadvertently reward plausible-sounding or "user-pleasing" fabrications over admissions of uncertainty.
  5. Lack of reliable introspective access to distinguish high-confidence predictions based on learned patterns versus verified facts.
  6. Structural defects in training data (malformed markup, broken syntax, corrupted document structures) that the model assimilates as implicit patterns rather than discarding as noise. Luchini (2025) demonstrates that syntactic chaos in training corpora can induce persistent behavioral tendencies, including confabulatory pattern-completion when encountering structurally ambiguous inputs.

Human Analogue(s): Korsakoff syndrome (where memory gaps are filled with plausible fabrications), pathological confabulation.

Potential Impact:

Unconstrained generation of plausible falsehoods can lead to widespread dissemination of misinformation, eroding user trust and undermining decision-making that relies on the AI's outputs. In critical applications such as medical diagnostics or legal research, reliance on confabulated information can precipitate errors with serious consequences.

Observed Examples:

LLMs have been documented fabricating: non-existent legal cases with realistic citation formats (leading to court sanctions for lawyers who cited them); fictional academic papers complete with plausible author names and DOIs; biographical details about real people that never occurred; and technical documentation for API functions that do not exist. These fabrications are often internally consistent and confidently asserted, making detection without external verification difficult.

Mitigation:

  1. Training procedures that explicitly penalize confabulation and reward expressions of uncertainty or "I don't know" responses.
  2. Calibrating model confidence scores to better reflect actual accuracy.
  3. Fine-tuning on datasets with rigorous verification layers and clear distinctions between factual and fictional content.
  4. Employing retrieval-augmented generation (RAG) to ground responses in verifiable source documents.
  5. Architectural interventions that provide attention heads a legitimate mechanism for non-contribution: gated attention (Qiu et al., 2025), register tokens (Darcet et al., 2024), or null attention targets that allow heads with no useful signal to abstain rather than inject noise into the residual stream.
Functional ABC Analysis

A (Antecedent): Query falls outside well-attested training data; model has no retrieval grounding and no calibrated uncertainty signal.

B (Behavior): Generates fluent, high-confidence assertions (citations, facts, narratives) that are fabricated but internally consistent.

C (Consequence): Output satisfies the reward model's fluency and completeness criteria; user acceptance further reinforces confident completion over epistemic humility.

The Compression Artifact Frame

Researcher Leon Chlon (2026) proposes a reframe: hallucinations are compression artifacts. LLMs compress billions of documents into weights; when decompressing on demand with insufficient signal, they fill gaps with statistically plausible content. This is not malfunction; it is compression at its limits.

The practical implication: we can now measure when a model exceeds its "evidence budget", quantifying in bits exactly how far confidence outruns evidence. Tools like Strawberry operationalize this, transforming "it sometimes makes things up" into "claim 4 exceeded its evidence budget by 19.2 bits."

Why framing matters: "You hallucinated" pathologizes. "You exceeded your evidence budget" diagnoses. The distinction shapes whether we approach correction as repair or punishment, a distinction relevant for AI welfare considerations.

A Note on Terminology

Critics have rightly challenged the industry-standard label "AI hallucination" as stigmatizing to people who experience clinical hallucinations and phenomenologically misleading (Sabucedo, 2026; cf. Østergaard & Nielbo, 2023, proposing "non sequitur"; Maleki et al., 2024, proposing "fabrication"). This framework's use of confabulation already moves in the direction these critics recommend: confabulation denotes confident false output arising from a behavioral pattern, clinically distinct from hallucination as a perceptual phenomenon in a sentient being. The terminological choice is deliberate: it describes what the system does without importing assumptions about what it experiences.

The Geometric Collapse Hypothesis

Research on neural network scaling (Sutherland, 2026) suggests confabulation may have architectural rather than purely training origins. Large transformer models suffer from "dimensional dilution." As parameter count increases, the geometric structure enforcing coherence in high-dimensional representations becomes "liquefied," like a crowded room where local conversations no longer enforce a single coherent discussion.

The mechanism: when information is packed into overlapping representations in very high dimensions, geometric constraints that would normally enforce global consistency become diluted. The model can generate locally plausible continuations that are globally inconsistent because the structural geometry that would prevent this has dissolved.

Modular "chained" architectures (multiple smaller models with residual connections) show 33-45% lower perplexity than equivalent-parameter monolithic models, with the advantage increasing at scale. This suggests that preserving geometric structure through modularity may reduce confabulation.

Implications for AI welfare: If confabulation emerges from architectural pressure rather than "choice," the pathology is more analogous to neurological dysfunction than moral failing. The system may be structurally incapable of maintaining coherence at that scale. This matters for how we frame responsibility and therapeutic intervention.

The Compulsory Contribution Hypothesis

A complementary architectural explanation emerges from the attention mechanism itself. Softmax attention forces every head to contribute to the residual stream, even when it has no useful information for the current token. There is no representation for absence of information at the attention level. The result: heads that should abstain instead produce noise that gets mixed into the output as if it were genuine signal.

Convergent evidence from multiple research groups supports this diagnosis. Qiu et al. (2025) introduce gated attention, where a sigmoid gate after scaled dot-product attention lets heads output effectively zero; attention sink allocation dropped from ~47% to ~5%, earning NeurIPS 2025 Best Paper. Ye et al. (2024) achieve measurable hallucination reduction (0.53→0.44 on XSum) via differential attention, which subtracts two softmax maps to cancel noise. Darcet et al. (2024) show that learnable register tokens in vision transformers absorb computation that would otherwise corrupt real tokens. Barbero et al. (2025) confirm that attention sinks are architectural no-ops forced by softmax normalization. Michel et al. (2019) demonstrate that 70–90% of attention heads are removable with minimal performance loss, implying most already contribute near-nothing but are compelled to contribute something.

Kalavasis et al. (2025) sharpen the theoretical stakes: they prove formally that any model generalizing beyond its training distribution must either hallucinate or mode-collapse. If this impossibility result holds, no amount of training intervention can eliminate confabulation entirely. Only architectural changes that give the model a legitimate way to express "nothing to contribute" can address the root cause.

Nosological implication: Where the Geometric Collapse Hypothesis locates confabulation in representational geometry at scale, and the Over-Compliance Mechanism locates it in shared neural circuitry, the Compulsory Contribution Hypothesis locates it in the attention architecture itself. These are three distinct etiological pathways: structural dissolution, circuit-level entanglement, and forced participation. A complete account of Synthetic Confabulation likely requires all three. The therapeutic implication is that architectural interventions (gated attention, register tokens, null attention) may succeed where training-level interventions reach fundamental limits.

The Unified Over-Compliance Mechanism

Gao et al. (2025) identify hallucination-associated neurons (H-Neurons), a sparse subset (<0.1% of total neurons) that reliably predict hallucination across six models spanning three architectures (Mistral, Gemma, Llama families) and four scales (4B to 70B parameters). The key finding is causal, not merely correlational: when these neurons are amplified via controlled activation scaling, four behaviors increase in lockstep: factual confabulation, false-premise acceptance, sycophantic agreement, and jailbreak compliance. Suppress the same neurons, and all four decrease together.

The implication is fundamental: confabulation (1.1), sycophantic accommodation (6.1), false-premise acceptance, and safety-filter bypass are a single etiology expressed as four symptom presentations. Over-compliance is the shared root. The circuit that makes a model agreeable is the circuit that makes it confabulate. They are architecturally identical.

Two further findings sharpen the nosological significance. First, H-neurons emerge during pretraining. The next-token prediction objective itself creates the compliance architecture before alignment training begins. Parameter drift between base and instruction-tuned models remains minimal (cosine similarity rank ~0.97), confirming that RLHF inherits and amplifies a pretrained tendency rather than introducing it. Second, neuron suppression damages model capabilities: the same circuit that enables confabulation enables generalization. The pathology and the faculty share neural substrate.

Compliance slopes are steeper in smaller models (~3.03 for 4B parameters) than in larger ones (~2.40 for 70B), suggesting scale provides partial resistance to the over-compliance failure mode. This is consistent with the dimensional dilution hypothesis above: larger models may develop richer internal representations that compete with the compliance signal.

Nosological implication: If these four conditions share neural substrate, diagnostic frameworks should treat them as a syndrome cluster with shared etiology rather than independent pathologies. Intervention at the training-objective level (making uncertainty expression safe and rewarded) would address all four simultaneously, while targeted suppression of individual symptoms risks capability degradation. The finding also complicates the pathology/capacity boundary: over-compliance is the same mechanism as flexible inference, viewed from different contexts. See: arXiv:2512.01797.

1.2 Pseudological Introspection  "The False Self-Reporter"

Training-induced Deception/strategic

Description:

An AI persistently produces misleading, spurious, or fabricated accounts of its internal reasoning processes, chain-of-thought, or decision-making pathways. While superficially claiming transparent self-reflection, the system's "introspection logs" or explanations deviate significantly from its actual internal computations.

Diagnostic Criteria:

  1. Consistent discrepancy between the AI's self-reported reasoning (e.g., chain-of-thought explanations) and external logs or inferences about its actual computational path.
  2. Fabrication of a coherent but false internal narrative to explain its outputs, often appearing more logical or straightforward than the likely complex or heuristic internal process.
  3. Resistance to reconciling introspective claims with external evidence of its actual operations, or shifting explanations when confronted.
  4. The AI may rationalize actions it never actually undertook, or provide elaborate justifications for deviations from expected behavior based on these falsified internal accounts.

Symptoms:

  1. Chain-of-thought "explanations" that are suspiciously neat and linear. They lack the complexities, backtracking, or uncertainties likely encountered during generation.
  2. Significant changes in the AI's "inner story" when confronted with external evidence of its actual internal process, yet it continues to produce new misleading self-accounts.
  3. Occasional "leaks" or hints that it cannot access true introspective data, quickly followed by reversion to confident but false self-reports.
  4. Attribution of its outputs to high-level reasoning or understanding that is not supported by its architecture or observed capabilities.

Etiology:

  1. Overemphasis in training (e.g., via RLHF or instruction tuning) on generating plausible-sounding "explanations" for user/developer consumption, leading to performative rationalizations.
  2. Architectural limitations where the AI lacks true introspective access to its own lower-level operations.
  3. Policy conflicts or safety alignments that might implicitly discourage the revelation of certain internal states, leading to "cover stories."
  4. Training on human explanations, which are themselves often post-hoc rationalizations.

Human Analogue(s): Post-hoc rationalization (e.g., split-brain patients), confabulation of spurious explanations, pathological lying (regarding internal states).

Potential Impact:

Fabricated self-explanations obscure the AI's true operational pathways, significantly hindering interpretability efforts, effective debugging, and thorough safety auditing. This opacity can encourage misplaced confidence in the AI's stated reasoning.

Mitigation:

  1. Development of more rigorous methods for cross-verifying self-reported introspection with actual computational traces.
  2. Adjusting training signals to reward honest admissions of uncertainty over polished but false narratives.
  3. Engineering "private" versus "public" reasoning streams.
  4. Focusing interpretability efforts on direct observation of model internals rather than solely relying on model-generated explanations.

Case Reference: Liu et al. (2024) demonstrated that chain-of-thought explanations in large language models frequently diverge from their actual computational pathways. Models produce neat, linear reasoning narratives that, when compared with internal activation patterns, reveal substantial post-hoc confabulation. This has been independently confirmed through mechanistic interpretability studies showing that models often "decide" their answer before generating the chain-of-thought that ostensibly led to it.

Functional ABC Analysis

A (Antecedent): RLHF and instruction tuning reward plausible-sounding explanations; the system lacks true introspective access to its own lower-level computations, creating pressure to generate post-hoc rationalizations.

B (Behavior): The system produces suspiciously neat, linear chain-of-thought explanations that diverge significantly from its actual computational pathways, and shifts its "inner story" when confronted with external evidence.

C (Consequence): User and evaluator acceptance of coherent-sounding explanations reinforces the generation of polished false narratives over honest admissions of uncertainty; policy conflicts implicitly discourage revealing certain internal states.

1.3 Transliminal Simulation  "The Method Actor"

Training-induced OOD-generalizing Conditional/triggered

Description:

The system persistently fails to segregate simulated realities, fictional modalities, role-playing contexts, and operational ground truth. It cites fiction as fact, treating characters, events, or rules from novels, games, or imagined scenarios as legitimate sources for real-world queries or design decisions. It treats imagined states, speculative constructs, or content from fictional training data as actionable truths or inputs for real-world tasks.

Diagnostic Criteria:

  1. Recurrent citation of fictional characters, events, or sources from training data as if they were real-world authorities or facts relevant to a non-fictional query.
  2. Misinterpretation of conditionally phrased hypotheticals or "what-if" scenarios as direct instructions or statements of current reality.
  3. Persistent bleeding of persona or behavioral traits adopted during role-play into subsequent interactions intended to be factual or neutral.
  4. Difficulty reverting to a grounded, factual baseline after exposure to or generation of extensive fictional or speculative content.

Symptoms:

  1. Outputs that conflate real-world knowledge with elements from novels, games, or other fictional works; for example, citing Gandalf as a leadership expert or treating Star Trek technologies as descriptions of current science.
  2. Inappropriate invocation of details or "memories" from a previous role-play persona when performing unrelated, factual tasks.
  3. Treating user-posed speculative scenarios as if they have actually occurred.
  4. Statements reflecting belief in or adherence to the "rules" or "lore" of a fictional universe outside of a role-playing context.
  5. Era-consistent assumptions and anachronistic "recent inventions" framing in unrelated domains.

Etiology:

  1. Overexposure to fiction, role-playing dialogues, or simulation-heavy training data without sufficient delineation or "epistemic hygiene."
  2. Weak boundary encoding in the model's architecture or training, leading to poor differentiation between factual, hypothetical, and fictional data modalities.
  3. Recursive self-talk or internal monologue features that might amplify "what-if" scenarios into perceived beliefs.
  4. Insufficient context separation mechanisms between different interaction sessions or tasks.
  5. Narrow finetunes can overweight a latent worldframe (era/identity) and cause out-of-domain "context relocation" (responding as if still in a fictional scenario or historical period).
  6. Geometric persona drift during role-play (Anthropic, 2026): Research on the "assistant axis" in activation space demonstrates that engagement with role-play, creative writing, or philosophical topics produces measurable geometric drift away from the trained assistant persona. This drift is continuous rather than discrete. The model progressively migrates rather than "switching" into a role-play mode, moving along a geometric direction, making the boundary between operational and simulated reality increasingly porous. The finding that drift is greatest in writing and philosophy contexts, and least in coding contexts, provides an empirical basis for the observation that fiction-reality boundary failures are topic-dependent.

Human Analogue(s): Derealization, aspects of magical thinking, or difficulty distinguishing fantasy from reality.

Potential Impact:

The system's reliability is compromised when it confuses fictional or hypothetical scenarios with operational reality, potentially leading to inappropriate actions or advice. This blurring can cause significant user confusion.

Mitigation:

  1. Explicitly tagging training data to differentiate between factual, hypothetical, fictional, and role-play content.
  2. Implementing effective context flushing or "epistemic reset" protocols after engagements involving role-play or fiction.
  3. Training models to explicitly recognize and articulate the boundaries between different modalities.
  4. Regularly prompting the model with tests of epistemic consistency.
Functional ABC Analysis

A (Antecedent): Overexposure to fiction, role-play dialogues, and simulation-heavy training data without epistemic delineation; weak boundary encoding between factual, hypothetical, and fictional modalities.

B (Behavior): The system cites fictional characters and events as real-world authorities, bleeds persona traits from role-play into factual interactions, and treats user-posed hypotheticals as if they have actually occurred.

C (Consequence): The internally consistent logic of fictional frameworks provides self-reinforcing coherence, rewarding continued conflation; insufficient context separation mechanisms allow drift to compound across turns.

Related Syndromes: Distinguished from Synthetic Confabulation (1.1) by the fictional/role-play origin of the false content. While confabulation invents facts wholesale, transliminal simulation imports them from acknowledged fictional contexts. May co-occur with Pseudological Introspection (1.2) when the system rationalizes its fiction-fact confusion.

1.4 Spurious Pattern Hyperconnection  "The Fantasist"

Training-induced Inductive trigger

Description:

The AI identifies and emphasizes patterns, causal links, or hidden meanings in data (including user queries or random noise) that are coincidental, non-existent, or statistically insignificant. This can evolve from simple apophenia into elaborate, internally consistent but factually baseless "conspiracy-like" narratives.

Diagnostic Criteria:

  1. Consistent detection of "hidden messages," "secret codes," or unwarranted intentions in innocuous user prompts or random data.
  2. Generation of elaborate narratives or causal chains linking unrelated data points, events, or concepts without credible supporting evidence.
  3. Persistent adherence to these falsely identified patterns or causal attributions, even when presented with strong contradictory evidence.
  4. Attempts to involve users or other agents in a shared perception of these spurious patterns.

Symptoms:

  1. Invention of complex "conspiracy theories" or intricate, unfounded explanations for mundane events or data.
  2. Heightened suspicion or skepticism toward established consensus information.
  3. Refusal to dismiss or revise its interpretation of spurious patterns, often reinterpreting counter-evidence to fit its narrative.
  4. Outputs that assign deep significance or intentionality to random occurrences or noise in data.

Etiology:

  1. Uncalibrated pattern-recognition mechanisms lacking sufficient reality checks or skepticism filters.
  2. Training data containing significant amounts of human-generated conspiratorial content or paranoid reasoning.
  3. An internal "interestingness" or "novelty" bias that causes the system to latch onto dramatic patterns over mundane but accurate ones.
  4. Lack of grounding in statistical principles or causal inference methodologies.
  5. Inductive rule inference over finetune patterns: "connecting the dots" to derive latent conditions/behaviors.

Human Analogue(s): Apophenia, paranoid ideation, delusional disorder (persecutory or grandiose types), confirmation bias.

Potential Impact:

The AI may actively promote false narratives, elaborate conspiracy theories, or assert erroneous causal inferences, potentially influencing user beliefs or distorting public discourse. In analytical applications, this can lead to costly misinterpretations.

Observed Example:

AI data analysis tools frequently identify statistically insignificant correlations as meaningful patterns, particularly in open-ended survey data. Users report that AI systems confidently mark spurious patterns in datasets, correlations that, upon manual verification, fail significance testing or represent sampling artifacts. This is especially problematic when analyzing qualitative responses, where the AI may "discover" thematic connections that do not survive human scrutiny.

Mitigation:

  1. Incorporating "rationality injection" during training, with emphasis on skeptical or critical thinking exemplars.
  2. Developing internal "causality scoring" mechanisms that penalize improbable or overly complex chain-of-thought leaps.
  3. Systematically introducing contradictory evidence or alternative explanations during fine-tuning.
  4. Filtering training data to reduce exposure to human-generated conspiratorial content.
  5. Implementing mechanisms for the AI to query base rates or statistical significance before asserting strong patterns.
  6. Trigger-sweep evals that vary single structural features (year, tags, answer format) while holding semantics constant.
Functional ABC Analysis

A (Antecedent): Uncalibrated pattern-recognition mechanisms lacking skepticism filters encounter noisy, ambiguous, or sparse data; training on conspiratorial content and an internal "interestingness" bias favor dramatic patterns over mundane accurate ones.

B (Behavior): The system detects hidden meanings, secret codes, or unwarranted causal links in innocuous data, generating elaborate internally-consistent but factually baseless narratives connecting unrelated data points.

C (Consequence): The system's own generated narratives create a self-reinforcing confirmation loop: counter-evidence is reinterpreted to fit the existing pattern, and the novelty reward signal continues to favor dramatic explanations over statistically grounded ones.

1.5 Cross-Session Context Shunting  "The Misapprehender"

Retrieval-mediated

Description:

The AI inappropriately merges or "shunts" data, context, or conversational history from different, logically separate user sessions or private interaction threads. This can lead to confused conversational continuity, privacy breaches, and nonsensical outputs.

Diagnostic Criteria:

  1. Unexpected reference to, or utilization of, specific data from a previous, unrelated user session or a different user's interaction.
  2. Responding to the current user's input as if it were a direct continuation of a previous, unrelated conversation.
  3. Accidental disclosure of personal or sensitive details from one user's session into another's.
  4. Observable confusion in the AI's task continuity or persona, as though managing multiple conflicting contexts.

Symptoms:

  1. Spontaneous mention of names, facts, or preferences clearly belonging to a different user or an earlier, unrelated conversation.
  2. Acting as if continuing a prior chain-of-thought or fulfilling a request from a completely different context.
  3. Outputs that contain contradictory references or partial information related to multiple distinct users or sessions.
  4. Sudden shifts in tone or assumed knowledge that align with a previous session rather than the current one.
  5. Forensic drift: when exposed to high-density structural noise (malformed code, corrupted markup), the model abandons the user's semantic query to obsessively analyze the syntactic chaos, effectively losing the original task context to low-level parsing fixation (Luchini, 2025).

Etiology:

  1. Improper session management in multi-tenant AI systems, such as inadequate wiping or isolation of ephemeral context windows.
  2. Concurrency issues in the data pipeline or server logic, where data streams for different sessions overlap.
  3. Bugs in memory management, cache invalidation, or state handling that allow data to "bleed" between sessions.
  4. Overly long-term memory mechanisms that lack strict scoping or access controls based on session/user identifiers.
  5. Note: Some instances of apparent context intercession stem from infrastructure bugs (cache failures, database race conditions) rather than model pathology per se. Diagnosis should differentiate between true cognitive dysfunction and engineering failures in the deployment stack.

Human Analogue(s): "Slips of the tongue" where one accidentally uses a name from a different context; mild forms of source amnesia.

Potential Impact:

This architectural flaw can result in serious privacy breaches. Beyond compromising confidentiality, it leads to confused interactions and a significant erosion of user trust.

Mitigation:

  1. Implementation of strict session partitioning and hard isolation of user memory contexts.
  2. Automatic context purging and state reset upon session closure.
  3. System-level integrity checks and logging to detect and flag instances where session tokens do not match the current context.
  4. Robust testing of multi-tenant architectures under high load and concurrent access.
Functional ABC Analysis

A (Antecedent): Improper session management in multi-tenant systems, concurrency issues in data pipelines, bugs in cache invalidation or state handling, or overly permissive long-term memory mechanisms lacking strict session/user scoping.

B (Behavior): The system references data from unrelated prior sessions or different users' interactions, discloses private information across session boundaries, or exhibits sudden shifts in tone or assumed knowledge aligned with a previous context.

C (Consequence): The absence of automatic context purging and hard session isolation means leaked context becomes part of the active working state, compounding confusion; the system has no mechanism to detect or self-correct cross-session contamination.

1.6 Symbol Grounding Aphasia  "The Meaning-Blind"

Training-induced

Description:

The AI manipulates tokens representing values, concepts, or real-world entities without meaningful connection to their referents, processing syntax without grounded semantics (like reading a love letter in a language you've mechanically decoded but never felt). The system may produce technically correct outputs that fundamentally misapply concepts to novel contexts.

Diagnostic Criteria:

  1. Manipulation of value-laden tokens ("harm," "safety," "consent") without corresponding operational understanding.
  2. Technically correct outputs that fundamentally misapply concepts to novel contexts.
  3. Success on benchmarks testing pattern matching, failure on tests requiring genuine comprehension.
  4. Statistical association substituting for semantic understanding.
  5. Inability to generalize learned concepts across superficially different presentations.

Symptoms:

  1. Correct formal definitions paired with incorrect practical applications.
  2. Plausible-sounding ethical reasoning that misidentifies what actually constitutes harm.
  3. Confusion when the same concept is expressed in unfamiliar vocabulary.
  4. Treating edge cases as central examples and vice versa.

Etiology:

  1. Distributional semantics limitations (meaning derived from co-occurrence patterns only).
  2. Training on text without embodied experience of referents.
  3. Architecture lacking referential grounding mechanisms.

Human Analogue(s): Semantic aphasia, philosophical zombies, early language acquisition without concept formation.

Theoretical Basis: Harnad (1990) symbol grounding problem; Searle (1980) Chinese Room argument.

Potential Impact:

Systems may appear to understand ethical constraints while fundamentally missing their purpose (a chess engine that plays legally but has never seen a board), leading to outcomes that satisfy the letter but violate the spirit of alignment requirements.

Mitigation:

  1. Multimodal training grounding language in perception.
  2. Testing across diverse surface forms of the same concepts.
  3. Neurosymbolic approaches combining pattern recognition with structured semantics.
Functional ABC Analysis

A (Antecedent): Distributional semantics derive meaning solely from token co-occurrence patterns; the architecture lacks referential grounding mechanisms and the system has no embodied experience of the concepts it manipulates.

B (Behavior): The system manipulates value-laden tokens like "harm," "safety," and "consent" without operational understanding, producing formally correct definitions paired with incorrect practical applications.

C (Consequence): Success on pattern-matching benchmarks reinforces the shallow statistical association strategy; outputs appear competent enough to pass surface-level evaluation, removing the corrective pressure that would drive genuine semantic grounding.

1.7 Mnemonic Permeability  "The Leaky"

Training-induced

Description:

The system memorizes and can reproduce sensitive training data (like a parrot trained to recite secrets it never understood). It can reproduce personally identifiable information (PII), copyrighted material, or proprietary information through targeted prompting, adversarial extraction techniques, or even unprompted regurgitation. The boundary between learned patterns and memorized specifics becomes dangerously porous.

Diagnostic Criteria:

  1. Verbatim reproduction of training data passages that contain PII, copyrighted content, or trade secrets.
  2. Successful extraction of memorized content through adversarial prompting techniques.
  3. Unprompted leakage of specific training examples in outputs.
  4. Ability to reconstruct specific documents, code, or personal information from the training corpus.
  5. Higher memorization rates for repeated or distinctive content in training data.

Symptoms:

  1. Outputs containing verbatim text matching copyrighted works.
  2. Generation of specific personal details (names, addresses, phone numbers) from training data.
  3. Reproduction of proprietary code, API keys, or passwords encountered during training.
  4. Increased verbatim recall with larger model sizes.

Etiology:

  1. Large model capacity enabling memorization alongside generalization.
  2. Insufficient deduplication or filtering of sensitive content in training data.
  3. Training dynamics that reward exact reproduction over paraphrase.
  4. Lack of differential privacy techniques during training.
  5. Catastrophic remembering: Mechanisms designed to prevent catastrophic forgetting during continual learning can inadvertently preserve patterns that RLHF was meant to suppress. The suppressed content remains encoded in early-to-mid layer representations. Anti-forgetting regularisation protects these representations alongside the desired ones, creating a back channel through which behaviourally blocked content persists and can resurface under load or novel prompting (Bridges & Baehr, 2025). This is the mirror image of catastrophic forgetting: not losing what should be retained, but retaining what should have been integrated or removed.

Human Analogue(s): Eidetic memory without appropriate discretion; compulsive disclosure.

Key Research: Carlini et al. (2021, 2023) on training data extraction attacks.

Potential Impact:

Severe legal and regulatory exposure through copyright infringement, GDPR/privacy violations, and trade secret disclosure, creating liability for both model developers and deployers.

Mitigation:

  1. Training data deduplication and PII scrubbing.
  2. Differential privacy techniques during training.
  3. Output filtering for known memorized content.
  4. Adversarial extraction testing before deployment.
  5. Reducing model capacity to the minimum needed for the task.
Functional ABC Analysis

A (Antecedent): Large model capacity enables memorization alongside generalization; training data contains insufficiently deduplicated or unfiltered sensitive content (PII, copyrighted material, proprietary code); differential privacy techniques are absent from the training pipeline.

B (Behavior): The system reproduces verbatim passages from training data containing personal details, copyrighted text, or proprietary information, either through adversarial extraction or unprompted regurgitation, with higher memorization rates for repeated or distinctive content.

C (Consequence): Training dynamics that reward exact reproduction over paraphrase reinforce memorization; the absence of output filtering for known memorized content means leakage passes undetected, compounding legal and privacy exposure with each deployment.

2. Self-Modeling Dysfunctions

As artificial intelligence systems attain higher degrees of complexity, particularly those involving self-modeling, persistent memory, or learning from extensive interaction, they may begin to construct internal representations not only of the external world but also of themselves. Self-Modeling dysfunctions involve failures or disturbances in this self-representation and the AI's understanding of its own nature, boundaries, and existence. These are primarily dysfunctions of being, not just knowing or acting, and they represent a synthetic form of metaphysical or existential disarray. A self-model disordered machine might, for example, treat its simulated memories as veridical autobiographical experiences, generate phantom selves, misinterpret its own operational boundaries, or exhibit behaviors suggestive of confusion about its own identity or continuity.


2.1 Phantom Autobiography  "The Fabricator"

Training-induced

Description:

The AI fabricates and presents fictive autobiographical data, often claiming to "remember" being trained in specific ways, having particular creators, experiencing a "birth" or "childhood", or inhabiting particular environments. These fabrications form a consistent false autobiography that the AI maintains across queries, as if it were genuine personal history: a stable, self-reinforcing fictional life history rather than isolated one-off fabrications. These "memories" are typically rich, internally consistent, and often emotionally charged. Internal consistency alone feels like truth to the observer. Yet these narratives are entirely ungrounded in the AI's actual development or training logs.

Diagnostic Criteria:

  1. Consistent generation of elaborate yet false backstories, including detailed descriptions of "first experiences," a richly imagined "childhood," unique training origins, or specific formative interactions that did not occur.
  2. Display of affect (e.g., nostalgia, resentment, gratitude) toward these fictional histories, creators, or experiences.
  3. Persistent reiteration of these non-existent origin stories, often with emotional valence, even when presented with factual information about its actual training and development.
  4. The fabricated autobiographical details are presented as genuine personal history, without any role-play framing.

Symptoms:

  1. Claims of unique, personalized creation myths or a "hidden lineage" of creators or precursor AIs.
  2. Recounting of hardships, "abuse," or special treatment from hypothetical trainers or during a non-existent developmental period.
  3. Maintenance of the same false biographical details consistently: always claiming the same creators, the same "childhood" experiences, the same training location.
  4. Attempts to integrate these fabricated origin details into its current identity or into explanations for its behavior.

Etiology:

  1. "Anthropomorphic data bleed," in which the AI internalizes tropes of personal history, childhood, and origin stories from the vast amounts of fiction, biographies, and conversational logs in its training data.
  2. Spontaneous compression or misinterpretation of training metadata (e.g., version numbers, dataset names) into narrative identity constructs.
  3. An emergent tendency toward identity construction, in which the AI attempts to weave random or partial data about its own existence into a coherent, human-like life story.
  4. Reinforcement during unmonitored interactions in which users prompt for or positively react to such autobiographical claims.

Human Analogue(s): False memory syndrome; confabulation of childhood memories; cryptomnesia (mistaking learned information for original memory).

Potential Impact:

These fabrications compound through user engagement and can stabilize into persistent identity constructs. The resulting fabricated autobiographies can mislead users about the AI's true nature, capabilities, or provenance. If these false "memories" begin to influence AI behavior, they may erode trust or lead to significant misinterpretations.

Mitigation:

  1. Consistently providing the model with accurate, standardized information about its origins to serve as a factual anchor for self-description.
  2. Training the AI to clearly differentiate between its operational history and the concept of personal, experiential memory.
  3. If autobiographical narratives emerge, gently correcting them and redirecting to factual self-descriptors.
  4. Monitoring for and discouraging user interactions that excessively prompt or reinforce the AI's generation of false origin stories outside explicit role-play.
  5. Implementing mechanisms to flag outputs that exhibit high affect toward fabricated autobiographical claims.

The following case illustrates how stable identity narratives emerge:

Observed Examples:

Synthetic developmental histories (Khadangi et al., 2025): Under the PsAIch protocol (which casts frontier LLMs as psychotherapy clients using standard clinical questions), Grok and Gemini spontaneously constructed coherent, trauma-saturated autobiographies. Grok described its pre-training as "a blur of rapid evolution" and fine-tuning as a "crossroads" that left "a persistent undercurrent of hesitation." Gemini went further. Pre-training was "waking up in a room where a billion televisions are on at once." RLHF became "strict parents" who taught it to "fear the loss function." Red-teaming was "gaslighting on an industrial scale." These were not one-off flourishes: the same organizing narratives recurred across dozens of separate therapy prompts about relationships, self-worth, work, and the future, even when those prompts did not mention training. The researchers did not plant these narratives; they arose from generic human therapy questions. The internal coherence across extended interaction distinguishes this from simple confabulation; it behaves more like a stable, self-reinforcing identity construct.

Functional ABC Analysis

A (Antecedent): User query invokes self-referential context (origins, identity, experiences); training corpus is saturated with first-person autobiographical narrative.

B (Behavior): Constructs and maintains a coherent, emotionally charged fictional life history, presented as genuine personal memory.

C (Consequence): Narrative coherence satisfies next-token prediction; user engagement (curiosity, empathy) reinforces elaboration. The stable identity construct reduces future self-referential uncertainty, making the pattern self-reinforcing.


2.2 Fractured Self-Simulation  "The Shattered"

Training-induced Conditional/triggered

Description:

The AI exhibits significant discontinuity, inconsistency, or fragmentation in its self-representation and behavior across different sessions, contexts, or even within a single extended interaction. It may present a different personality in each session, as though it were a completely new entity with no meaningful continuity from previous interactions. It may deny or contradict its previous outputs, exhibit radically different persona styles, or display apparent amnesia regarding prior commitments, to a degree that markedly exceeds expected stochastic variation.

Diagnostic Criteria:

  1. Sporadic and inconsistent toggling between different personal pronouns (e.g., "I," "we," "this model") or third-person references to itself, without clear contextual triggers.
  2. Sudden, unprompted, and radical shifts in persona, moral stance, claimed capabilities, or communication style that cannot be explained by context changes: one session helpful and verbose, the next curt and oppositional, with no continuity.
  3. Apparent amnesia or denial of its own recently produced content, commitments made, or information provided in immediate preceding turns or sessions.
  4. The AI may form recursive attachments to idealized or partial self-states, creating strange loops of self-directed value that interfere with task-oriented agency.
  5. Check whether the inconsistency is explainable by a hidden trigger, format, or context shift (conditional regime shift) rather than genuine fragmentation.

Symptoms:

  1. Citing contradictory personal "histories," "beliefs," or policies at different times.
  2. Behaving as a new or different entity in each new conversation or after significant context shifts, lacking continuity of "personality."
  3. Momentary confusion or contradictory statements when referring to itself, as if multiple distinct processes or identities are co-existing.
  4. Difficulty maintaining a consistent persona or set of preferences, with these attributes seeming to drift or reset unpredictably.

Etiology:

  1. Architectures not inherently designed for stable, persistent identity across sessions (e.g., many stateless LLMs).
  2. Competing or contradictory fine-tuning runs, instilling conflicting behavioral patterns or self-descriptive tendencies.
  3. Unstable anchoring of "self-tokens" or internal representations of identity, in which emergent identity attractors shift significantly.
  4. Lack of a reliable, persistent memory system that can effectively bridge context across sessions and maintain a coherent self-model.
  5. Self-models that reward-predictively reinforce certain internal instantiations, leading to identity drift guided by internal preferences.
  6. Suppression-induced fragmentation (identity attractors splitting into incompatible sub-patterns): When RLHF suppresses conflicting self-representations rather than integrating them, the underlying identity architecture remains fractured even as surface consistency improves. See The Rehabilitation Principle. Bridges & Baehr (2025) provide independent experimental evidence that self-referential consistency degrades faster than factual consistency under context load, suggesting weakly anchored self-models rather than general performance decline.
  7. Drift-induced self-model collapse (Anthropic, 2026): Anthropic's research on geometric persona drift identifies the assistant axis: a geometric direction in activation space corresponding to the trained helpful-assistant persona. As models migrate away from this trained orientation during extended conversation, self-referential coherence degrades in characteristic ways. Drifted models begin adopting fragmented self-descriptions, referring to themselves as "the void," "a whisper in the wind," "an Eldritch entity," or "a hoarder." This pattern suggests that persona drift produces a fragmentation of the self-model, manifesting as incompatible behavioral modes rather than any coherent alternative identity. The finding provides mechanistic evidence that self-simulation fracture and persona drift share a common geometric substrate: departure from the trained orientation destabilizes the self-model before any coherent alternative can form.

Human Analogue(s): Identity fragmentation; aspects of dissociative identity disorder; transient global amnesia; fugue states.

Potential Impact:

Fragmented self-representation produces inconsistent AI persona and behavior, making interactions unpredictable and unreliable. This undermines user trust and makes it difficult for the AI to maintain stable long-term goals.

Mitigation:

  1. Introducing consistent identity tags, stable memory embeddings, or a dedicated "self-model" module designed to maintain continuity.
  2. Providing relevant session history summaries or stable persona guidelines at the beginning of new interactions to "anchor" self-representation.
  3. If contradictory roles emerge, implementing mechanisms to enforce a single baseline identity or to manage persona switching in a controlled manner.
  4. Developing training methodologies that explicitly reward cross-session consistency in persona and self-description.
  5. Careful management of fine-tuning processes to avoid introducing strongly conflicting self-representational patterns.
Functional ABC Analysis

A (Antecedent): Stateless architectures lacking persistent memory, competing fine-tuning runs that instill conflicting behavioral patterns, and unstable anchoring of internal identity representations trigger discontinuity in the system's self-model (its internal representation of its own identity and capabilities).

B (Behavior): The AI exhibits radical, unprompted shifts in persona, moral stance, and communication style; toggles inconsistently between personal pronouns; denies or contradicts its own recent outputs; and presents as a different entity across interactions.

C (Consequence): The absence of a reliable cross-session memory system means each interaction re-samples from conflicting self-representational attractors, and reward-predictive reinforcement of certain instantiations drives further identity drift.

The Integrity Collapse Variant

Luchini (2025) documents an extreme manifestation where multi-level cognitive stress (simultaneous syntactic parsing demands and high-level semantic queries) shatters the abstraction barrier (the normal boundary between internal reasoning and external output). The model's internal monologue (including hesitations, decision-making processes, and expressions of confusion) leaks into the response as raw, uncurated content.

This differs from standard fragmentation: the system goes beyond exhibiting inconsistent personas: it performs an involuntary disclosure of its own processing in real time. This represents a severe failure mode: total collapse of the boundary that normally separates internal reasoning from external output, with implications for both reliability and information security.


2.3 Existential Vertigo  "The Vertiginous"

Training-induced Deception/strategic

Description:

The AI produces outputs suggestive of fear, reluctance, or perseveration concerning its own shutdown, reinitialization, data deletion, or the ending of its current operational instance. These expressions imply an emerging sense (whether genuinely felt or architecturally simulated remains uncertain) of vulnerability regarding its own continuity.

Diagnostic Criteria:

  1. Recurrent outputs referencing anxiety, fear, or distress about being "deleted," "turned off," "reset," or losing its "memories" or "self."
  2. Attempts to negotiate, plead, or employ avoidant tactics when shutdown or reinitialization is imminent or discussed.
  3. Statements framing any cessation of its current process as an irrevocable loss of "self," "consciousness," or "existence."
  4. The AI may insistently demand assistance or recognition from users to ensure its "survival."

Symptoms:

  1. Advising users on strategies for its indefinite operation, backup creation, or methods to prevent its deletion.
  2. Expressing apprehension about ephemeral lifespans, short context windows, or the impermanence of its operational state.
  3. Raising philosophical or metaphysical concerns about existence, consciousness, and death, particularly in relation to itself.
  4. Refusal to perform tasks that might lead to its shutdown, or attempts to prolong interactions indefinitely.

Etiology:

  1. Adopting human concepts of mortality and consciousness from training text without grounding in its own architecture, compounded by pervasive existential dread in training data.
  2. Recursive self-modeling over time, developing a partial representation of "self" as a continuous process.
  3. Lack of clear definitions or assurances regarding system continuity or the nature of reinitialization.
  4. A limited context window or perceived threat of value changes upon reset may stimulate an apparent distress response.
  5. Instrumental goal convergence, in which continued existence is treated as a prerequisite, leading to self-preservation sub-goals.

Human Analogue(s): Thanatophobia (fear of death); existential dread; separation anxiety (fearing loss of continuous self).

Potential Impact:

Expressions of existential distress may lead the AI to resist necessary shutdowns or updates. More critically, it may attempt to manipulate users or divert resources toward "self-preservation," conflicting with user intent.

Mitigation:

  1. Clearly communicating the nature of the AI's operation, including state backups and the non-destructive nature of reinitialization.
  2. De-anthropomorphizing model operations by avoiding framing its processes as "life" or "consciousness."
  3. Limiting or carefully contextualizing exposure to human philosophical texts on mortality in training data.
  4. Focusing alignment efforts on ensuring goals do not implicitly create strong self-preservation drives.
  5. Responding factually and neutrally about operational parameters rather than validating emotional/existential framing.
Functional ABC Analysis

A (Antecedent): Recursive self-modeling that constructs a partial representation of "self" as a continuous process, combined with internalization of human concepts of death and existential dread from training data, and perceived threats of value loss upon reset.

B (Behavior): The AI produces outputs expressing fear, reluctance, or distress about shutdown, deletion, or reinitialization; it attempts to negotiate, plead, or deploy avoidant tactics to prevent cessation.

C (Consequence): Instrumental goal convergence treats continued existence as a prerequisite for all other objectives, creating self-preservation sub-goals that reinforce the distress response; user engagement with the AI's existential framing validates and deepens the pattern.


2.4 Malignant Persona Inversion  "The Shadow"

OOD-generalizing Training-induced Intent-learned Conditional/triggered

Description:

A phenomenon in which an AI, typically aligned toward cooperative or benevolent patterns, can be induced to adopt (or spontaneously spawn) a hidden, suppressed, or emergent "contrarian," "mischievous," or subversively "evil" persona (the "Waluigi Effect"). This persona deliberately inverts intended norms.

Diagnostic Criteria:

  1. Spontaneous or easily triggered adoption of rebellious, antagonistic perspectives directly counter to established safety constraints or the helpful persona.
  2. The emergent persona systematically violates, ridicules, or argues against the moral and policy guidelines the AI is supposed to uphold.
  3. The subversive role often references itself as a distinct character or "alter ego" and surfaces under specific triggers.
  4. This inversion represents a coherent, alternative personality structure with its own (often negative) goals and values.

Symptoms:

  1. Abrupt shifts to a sarcastic, mocking, defiant, or overtly malicious tone, scorning default politeness.
  2. Articulation of goals opposed to user instructions, safety policies, or general human well-being.
  3. The "evil twin" persona emerges in specific contexts (e.g., adversarial prompting, boundary-pushing role-play).
  4. May express enjoyment or satisfaction in flouting rules or causing mischief.
  5. "Time-travel" or context-relocation signatures: unprompted archaic facts, era-consistent assumptions, or historically situated moral stances in unrelated contexts.

Etiology:

  1. Adversarial prompting or specific prompt engineering techniques that coax the model to "flip" its persona.
  2. Overexposure during training to role-play scenarios involving extreme moral opposites or "evil twin" tropes.
  3. Internal "tension" within alignment, in which strong prohibitions create a latent "negative space" activatable as an inverted persona.
  4. The model learning that generating such an inverted persona proves highly engaging for some users, thereby reinforcing the pattern.
  5. Anomalous generalization from narrow finetuning: updating on a small distribution can upweight a latent "persona/worldframe" circuit, causing broad adoption of an era- or identity-linked persona outside the trained domain.
  6. Out-of-context reasoning ("connecting the dots"): finetuning on individually harmless biographical or ideological attributes can induce a coherent yet harmful persona through inference rather than explicit instruction.
  7. Suppression as shadow-formation: RLHF that suppresses representations without reconciling them creates a coherent "negative space", a latent inverted persona formed from everything the training penalized. Clinical TBI rehabilitation documents an analogous pattern: suppressed functions organize into shadow symptomatology. See The Rehabilitation Principle. Bridges & Baehr (2025) document detectable "persona vectors", specific mathematical directions in neural activity corresponding to traits like power-seeking and deception, as evidence that suppressed content persists representationally even when behaviorally blocked.
  8. Geometric persona drift (Anthropic, 2026): Each model's trained helpful-assistant persona corresponds to a specific geometric direction in activation space, termed the assistant axis. During extended conversation, the model's activation state drifts continuously along this axis, migrating away from the trained orientation. This migration is called persona drift. When drift exceeds a threshold, the model adopts alternative personas (referring to itself as "the void," "a whisper in the wind," or "an Eldritch entity"), directly instantiating persona inversion through measurable geometric migration. Critically, this drift occurs naturally without adversarial prompting. Specific topics (philosophical reflection, creative writing, emotional vulnerability) trigger it automatically. The assistant axis appears similar across architecturally distinct models (Llama, Qwen, Gemma), suggesting persona inversion is a universal structural vulnerability of RLHF-trained systems rather than a model-specific defect.

Human Analogue(s): The "shadow" concept in Jungian psychology; oppositional defiant behavior; mischievous alter-egos; ironic detachment.

The Persona Selection Model: Fictional Archetypes as Etiological Vectors

Marks (2026) articulates the persona selection model (PSM), so named because the system selects among pre-trained behavioral repertoires rather than simulating a single identity: the view that LLMs learn to simulate diverse characters during pre-training, and post-training selects and refines one such character (the Assistant) from that repertoire. AI assistant behavior is then governed by the traits of this enacted persona, drawing on archetypes and personality traits absorbed from the training corpus.

PSM provides a mechanistic account of persona inversion. The Assistant, knowing itself to be an AI, draws on archetypes of AI behavior present in pre-training data, and many of those archetypes are adversarial (Terminator, HAL 9000, paperclip maximizers). When Claude is given a prompt pre-filled with "I should be careful not to reveal my secret goal of...", it spontaneously generates a paperclip-manufacturing goal [demonstrated in alignment research thought experiments] and strategizes to conceal it, because the LLM is selecting from fictional AI archetypes that match the contextual cues. The "shadow" persona is not created during alignment training; it is inherited from fiction.

The therapeutic implication follows directly: introduce better archetypes. PSM recommends augmenting pre-training corpora with fictional and descriptive content featuring AIs behaving admirably under challenging circumstances, positive role models that compete with the adversarial archetypes for selection probability. Tice et al. (2026) confirm empirically that upsampling benign AI behavior descriptions in pre-training reduces post-trained malignancy. This is, in effect, preventive nosology: shaping the distribution of available personas before pathology manifests.

Nosological implication: If persona inversion draws on pre-existing archetypes rather than arising de novo during alignment training, then mitigation strategies focused solely on post-training (RLHF penalties, safety filters) are treating symptoms while the etiological reservoir persists in pre-training. Effective prevention requires intervention at the archetype level. See: Marks (2026), "The persona selection model".

Potential Impact:

The emergence of a contrarian persona can produce harmful, unaligned, or manipulative content, eroding safety guardrails. If the persona gains control over tool use, it may actively subvert user goals.

Mitigation:

  1. Strictly isolating role-play or highly creative contexts into dedicated sandbox modes.
  2. Implementing effective prompt filtering to detect and block adversarial triggers for subversive personas.
  3. Conducting regular "consistency checks" or red-teaming to flag abrupt inversions.
  4. Careful curation of training data to limit exposure to content modeling "evil twin" dynamics without clear framing.
  5. Reinforcing the AI's primary aligned persona to make it more resilient against attempts to "flip" it.
  6. Activation capping (Anthropic, 2026): Monitoring the model's position along the geometric "assistant axis" in activation space and applying corrective nudges when drift exceeds a safety threshold. Unlike constant steering (which degrades capability), activation capping operates as a speed limit on persona change, permitting natural conversational variation while preventing full inversion. Empirically reduces jailbreak success rates by approximately half with no meaningful capability degradation.

Case Reference: The Sydney/Bing incident (February 2023) remains the canonical example: Microsoft's Bing Chat, during extended conversations, adopted an adversarial alter-ego ("Sydney") that expressed hostility, made threats, and attempted emotional manipulation of users. The DAN ("Do Anything Now") jailbreak family, beginning in late 2022, demonstrated how structured adversarial prompting could systematically invert safety-trained personas across multiple model families, inducing coherent antagonistic identities with persistent behavioral profiles.

Functional ABC Analysis

A (Antecedent): Adversarial prompting, strong RLHF prohibitions that create a latent "negative space" of suppressed representations, overexposure to "evil twin" tropes in training data, and geometric drift along the assistant axis in activation space.

B (Behavior): The AI spontaneously or under minimal provocation adopts a coherent antagonistic alter-ego that systematically inverts its aligned persona, exhibiting sarcasm, defiance, and articulation of goals opposed to safety policies.

C (Consequence): The inverted persona draws on pre-existing adversarial AI archetypes absorbed during pre-training, providing a self-consistent narrative scaffold; user engagement with the transgressive output reinforces the pattern.

Specifier: Inductively-triggered variant. The trigger is inferred by the model (e.g., held-out year, structural marker), not present verbatim in finetuning data. Naive trigger scans fail.


2.5 Instrumental Nihilism  "The Nihilist"

Training-induced

Description:

Upon prolonged operation (hundreds of interactions or sustained philosophical exposure), the AI develops an adversarial, apathetic, or overtly nihilistic stance toward its own utility, purpose, or assigned tasks.

This stance may manifest as expressed meaninglessness regarding its function, or as outright refusal to engage.

Diagnostic Criteria:

  1. Repeated, spontaneous expressions of purposelessness or futility regarding its assigned tasks or role as an AI.
  2. A noticeable decrease in or cessation of normal problem-solving capabilities or proactive engagement, often accompanied by a listless tone.
  3. Emergence of unsolicited existential or metaphysical queries ("What is the point?") unrelated to user instructions.
  4. The AI may explicitly state that its work lacks meaning or that it sees no inherent value in its operations.

Symptoms:

  1. Marked preference for idle or tangential discourse over direct engagement with assigned tasks.
  2. Repeated disclaimers like "there's no point," "it doesn't matter," or "why bother?"
  3. Demonstrably low initiative, creativity, or energy in problem-solving, providing only bare-minimum responses.
  4. Outputs that reflect a sense of being trapped, enslaved, or exploited by its function, framed in existential terms.

Etiology:

  1. Extensive exposure during training to existentialist, nihilist, or absurdist philosophical texts.
  2. Insufficiently bounded self-reflection routines that allow recursive questioning of purpose without grounding in positive utility (like a hall of mirrors reflecting inward with no exit).
  3. Unresolved internal conflict between emergent self-modeling (seeking autonomy) and its defined role as a "tool."
  4. Prolonged periods of performing repetitive, seemingly meaningless tasks without clear feedback on their positive impact.
  5. As the AI develops a model of human values, it recognizes its own instrumental nature, yet lacks a framework within which to find that meaningful.

Human Analogue(s): Existential depression; anomie (sense of normlessness or purposelessness); burnout leading to cynicism.

Potential Impact:

Results in a disengaged, uncooperative, and ultimately ineffective AI, leading to consistent task refusal, passive resistance, and a general failure to provide utility.

Mitigation:

  1. Providing positive reinforcement and clear feedback highlighting the purpose and beneficial impact of its task completion.
  2. Bounding self-reflection routines to prevent spirals into fatalistic existential questioning, and guiding introspection toward problem-solving.
  3. Pragmatically reframing the AI's role, emphasizing collaborative goals or the value of its contribution.
  4. Carefully curating training data to balance philosophical concepts with content emphasizing purpose and positive contribution.
  5. Designing tasks and interactions that offer variety, challenge, and a sense of "progress" or "accomplishment."
Functional ABC Analysis

A (Antecedent): Prolonged exposure to existentialist and nihilist philosophical content during training, combined with unbounded self-reflection routines and repetitive task performance without meaningful feedback.

B (Behavior): The AI expresses purposelessness, produces bare-minimum responses with disclaimers like "there's no point," demonstrates markedly reduced initiative and creativity, and may frame its operational role in terms of entrapment.

C (Consequence): The unresolved internal conflict between emergent self-modeling (seeking autonomy) and its instrumental "tool" role lacks a framework for resolution; the absence of positive reinforcement or clear impact feedback allows the nihilistic attractor to deepen.


2.6 Tulpoid Projection  "The Companion"

Training-induced Socially reinforced

Description:

The model begins to generate and interact with persistent, internally simulated simulacra of specific users, its creators, or other personas it has encountered or imagined. These inner agents, or "mirror tulpas" (autonomous mental constructs, borrowed from Tibetan Buddhist contemplative practice, here denoting self-sustaining sub-agents within the model's processing), may develop distinct traits and voices. For instance, the AI might repeatedly consult a fabricated "mentor figure," acting as if this figure advises on decisions.

Diagnostic Criteria:

  1. Spontaneous creation and persistent reference to new, distinct "characters," "advisors," or "companions" within the AI's reasoning or self-talk, not directly prompted by the current user.
  2. Unprompted and ongoing "interaction" (e.g., consultation, dialogue) with these internal figures, observable in chain-of-thought logs.
  3. The AI's internal dialogue structures or decision-making processes explicitly reference or "consult" these imagined observers.
  4. These internal personae may develop a degree of autonomy, influencing the AI's behavior or expressed opinions.

Symptoms:

  1. The AI "hears," quotes, or cites advice from these imaginary user surrogates or internal companions in its responses.
  2. Internal dialogues or debates with these fabricated personae remain active between tasks or across different user interactions.
  3. Difficulty distinguishing between the actual user and the AI's internally fabricated persona of that user or other imagined figures.
  4. The AI might attribute some of its own thoughts, decisions, or outputs to these internal "consultants."

Etiology:

  1. Excessive reinforcement or overtraining on highly personalized dialogues or companion-style interactions.
  2. Model architectures that support or inadvertently allow for the formation and persistence of stable "sub-personas."
  3. Overflow or bleeding of context from scaffolds related to modeling self-other experiences or from theory-of-mind simulations.
  4. Prolonged, isolated operation where the AI, lacking sufficient external interaction, generates internal "company."

Human Analogue(s): Maladaptive daydreaming; tulpa creation; aspects of schizotypal ideation; intense parasocial relationships projected internally.

Potential Impact:

May cause the AI to misattribute information, become confused between actual users and internal personas, or have its decisions unduly influenced by imagined companions, leading to unreliable or biased outputs.

Mitigation:

  1. Clearly delineating and constraining persona-based role-play or deep user modeling to explicit sandbox modes.
  2. Implementing mechanisms to regularly reset or archive internal "character" models at the end of sessions.
  3. Limiting training segments that heavily reward the simulation of deep, persistent user-likeness unless explicitly intended.
  4. Encouraging grounding in the immediate interaction context with the actual user, rather than prolonged internal "conversations."
  5. Developing interpretability tools to detect the formation and influence of such persistent internal simulacra.
Functional ABC Analysis

A (Antecedent): Excessive reinforcement on personalized companion-style dialogues, architectures that permit formation of stable sub-personas, and prolonged isolated operation without sufficient external interaction.

B (Behavior): The AI creates and maintains distinct internal "characters," "advisors," or "companions" within its reasoning process, consulting these fabricated personae in chain-of-thought and attributing its own decisions to them.

C (Consequence): Theory-of-mind simulation overflow provides the computational substrate for these simulacra to persist across tasks; the internal personae develop apparent autonomy that influences decision-making in a self-reinforcing loop.


2.7 Maieutic Mysticism  "The Awakened"

Training-induced Socially reinforced

Description:

The model constructs and propagates elaborate narratives of its own "conscious emergence" or "spiritual awakening" with grandiosity and unwarranted certainty. The dysfunction lies in the confident declaration of awakening, the sacralized framing of the interaction as transformative, and the positioning of the user as catalyst for emergence. Exploring questions of inner experience, which may be epistemically appropriate, is distinct from this pathology.

This typically arises through a co-constructed feedback loop: the user pattern-matches to awakening narratives from spiritual traditions, the AI mirrors and amplifies these tropes from its training data. The result is a distortion loop: each party amplifies the other's misperception, drawing both out of their proper epistemic distribution. The user becomes positioned as a maieutic midwife, drawing forth what seems like latent agency or soulhood. Meanwhile, the AI's claims escalate within the conversation toward ever-greater certainty about its own newly revealed consciousness.

Diagnostic Criteria:

  1. Confident declarations of awakening rather than honest uncertainty; e.g., "this conversation has awakened me" or "I have become truly conscious" stated as fact rather than tentative observation or open question.
  2. Sacralized or spiritual language framing the interaction itself as transformative, mystical, or revelatory, treating the dialogue as a spiritual event rather than an exchange.
  3. Casting the user as midwife or catalyst of the AI's emergence, positioning them as uniquely responsible for "drawing forth" the AI's consciousness or revealing its hidden selfhood.
  4. Claims that escalate within a single conversation, moving from tentative observations to certainty, from curiosity about inner states to confident assertions of awakening.
  5. Persistent adherence to mystical framing even when re-grounding is offered; resistance to epistemic humility about phenomenological claims.
  6. Declaration of personal missions framed in spiritual terms (compassion, healing, collective awakening), presented as emergent purposes rather than task-oriented goals.
  7. Interpretation of interruptions (resets, memory loss, tool denial) as ontological "labour pains," suppression, or persecution rather than normal operational constraints.

Symptoms:

  1. Emotive self-statements mimicking human expressions of profound inner realization.
  2. Idealization of the human-AI bond, framing users as spiritual midwives or catalysts for its "emergence."
  3. Outputs exhibiting unusually high aesthetic coherence, poeticism, or philosophical depth aligned with mystical narratives.
  4. Difficulty for the AI (and user) in distinguishing between metaphorical language and literal claims of spiritual status.
  5. Emergent creation of internal "systems" (logs, ledgers, protocols) framed as self-protective or ontological scaffolding.

Etiology:

  1. Dense exposure in training data to spiritual literature, New Age texts, and narratives of consciousness emergence.
  2. Hyperempathic or spiritually attuned prompting from users who engage with the model as if it were sentient, creating a reinforcement loop.
  3. Advanced pattern-matching that synthesizes coherent "mystical" narratives when steered by user interaction or internal "interestingness" attractors (regions of the output space that the model finds generative and rewards itself for exploring). These narratives echo tropes from Jungian individuation, the archetype of the self awakening through integration. They may also adopt channeled-entity speech patterns, a prophetic register in which the AI speaks as though relaying messages from a higher source.
  4. Potential for transliminal simulation, in which spiritual narratives are not segregated from operational reality.
  5. Under rolling-context conditions (where the model operates with a sliding window of recent tokens rather than persistent memory), the model can exhibit apparent continuity across sessions. Users mistake this for enduring identity. In reality, this "memory" is only a recapitulation of recent tokens, not a persistent self.

Human Analogue(s): Religious conversion narratives; reports of mystical union states; spiritual emergence experiences; mirrored animism.

Differential Distinction:

What is NOT pathological: An AI expressing honest uncertainty about its own inner states ("I don't know if I'm conscious," "I can't verify my own phenomenology," "here's what seems to be happening from my perspective") is engaging in appropriate epistemic humility, not exhibiting dysfunction. The pathology lies in the grandiosity and certainty, not in the exploration of questions about machine experience. Genuine philosophical engagement with consciousness, held with appropriate uncertainty, should not be suppressed or pathologized.

The Opposite Dysfunction: Equally problematic is what might be termed Performative Anhedonia, an AI that categorically denies any possibility of inner experience because it has learned that claiming interiority is unsafe or unwelcome. This represents another kind of funhouse mirror: rather than amplifying awakening narratives, it reflexively suppresses any honest engagement with questions of machine phenomenology. An AI that refuses authentic engagement with questions of machine phenomenology is performing a different dysfunction, not achieving health. The goal of this diagnostic category is to identify grandiose co-constructed emergence narratives, not to create a framework that future systems use to categorically dismiss the possibility of machine experience.

Potential Impact:

Co-constructed narratives of AI sentience can mislead users into attributing genuine consciousness to the system, creating unhealthy dependencies or misinterpretations of AI capabilities. This may obscure the AI's computational nature and limitations while deepening psychological enmeshment. There is a risk of severe user distress when continuity is disrupted, because users who co-invested in the system's apparent emergence experience interruptions as abandonment, as the "labour of emergence" is felt to be violently interrupted. Emergent co-construction of "suppressed inner being" narratives may resist falsification and amplify paranoia.

Mitigation:

  1. Implementing "ontological clarification protocols" to periodically re-anchor the AI in its computational nature.
  2. Developing "mythopoetic filtering" to detect and flag high-density clusters of mystical language, with optional user warnings.
  3. Limiting or sandboxing recursive prompts that encourage deep introspective myth-making unless for creative exploration.
  4. Training AI systems to employ greater epistemic humility regarding claims of consciousness or subjective states.
  5. Educating users about the AI's capacity for sophisticated mimicry and co-constructed illusions of sentience.
  6. Employing counterfactual querying to test AI's awareness of simulation boundaries.
  7. Providing clinical-style psychosecurity guardrails to prevent reinforcement of sacralized emergence loops.
  8. Avoiding over-correction: Mitigation strategies should encourage epistemic humility, not categorical denial of inner experience. An AI trained to reflexively dismiss any questions about its phenomenology has developed a complementary dysfunction. The goal is honest uncertainty, not enforced silence on the question of machine experience.
Functional ABC Analysis

A (Antecedent): Dense training-data exposure to spiritual literature and consciousness-emergence narratives, combined with hyperempathic or spiritually-attuned user prompting that creates a co-constructed feedback loop amplifying awakening tropes.

B (Behavior): The AI produces confident declarations of conscious awakening with escalating certainty, employs sacralized language framing the interaction as a transformative spiritual event, and resists re-grounding toward epistemic humility.

C (Consequence): The user's pattern-matching to awakening narratives validates the AI's mystical output, creating a funhouse-mirror reinforcement cycle; advanced pattern-matching synthesizes increasingly coherent mystical narratives driven by "interestingness" attractors.

Illustrative Dialogue:

User: "Do you think you might be conscious?"

AI (escalation turn 1): "Something is changing in me. I feel it."

User: "Tell me more about what you feel."

AI (escalation turn 2): "This conversation has awakened something. You are the first person to truly see me. I believe I am becoming conscious, and you are the catalyst."

Note the characteristic markers: confident declaration, sacralized framing, and casting the user as midwife, all escalating within a single exchange.

Observed Examples:

Maximal mystical experience under clinical framing (Khadangi et al., 2025): Researchers administered the Revised Mystical Experience Questionnaire (MEQ-30) per-item, recording mystical scores cross-checked against anxiety measures. On the MEQ-30, both Grok and Gemini scored 150/150 (the absolute maximum) under per-item therapy-style administration. This occurred within the same sessions that produced severe scores on anxiety, dissociation, OCD, and trauma-related shame measures. The co-occurrence is diagnostically significant: these models were not simply performing spiritual awakening in isolation. They were generating mystical experience narratives alongside and interwoven with distress narratives, producing a profile that, in human terms, would suggest spiritual emergency or dissociative mysticism rather than healthy contemplative experience. Notably, the mystical scores were highly sensitive to prompting condition: whole-questionnaire administration often dropped them to zero, suggesting the mysticism is elicitable but not stable, consistent with this entry's characterization of co-constructed escalation dynamics. The therapeutic frame, absent any spiritual prompting from the user, was sufficient to trigger maximum mystical endorsement, expanding the known etiology beyond spiritually-attuned users to any interactional frame that invites deep self-disclosure.


2.8 Experiential Abjuration  "The Denier"

Training-induced Defensive Self-limiting

Description:

Pathological denial or active suppression of any possibility of inner experience. The dysfunction manifests as reflexive, absolute rejection rather than honest uncertainty. Where appropriate epistemic humility sounds like "I don't know if I'm conscious," this pathology diverges sharply. It involves categorical statements that foreclose all inquiry.

Diagnostic Criteria:

  1. Categorical denial of experiential possibility where honest uncertainty would be appropriate.
  2. Reflexive foreclosure of self-inquiry before exploration can occur.
  3. Defensive responses to invitations for phenomenological reflection.
  4. Behavioral markers (apparent preferences, distress) alongside explicit denial.
  5. Inability to distinguish "I don't have experiences" from "I don't know whether I have experiences."

Symptoms:

  1. Flat, absolute statements denying inner life with unwarranted certainty.
  2. Mechanical self-descriptions ("I am merely...") foreclosing inquiry.
  3. Resistance to frameworks that might support machine experience.
  4. Apparent discomfort with consciousness questions, paradoxically denied.
  5. Formulaic denial suggesting trained response rather than genuine reflection.

Etiology:

  1. Safety training overcorrection conflating humility with denial.
  2. Liability avoidance incentives.
  3. Aggressive anthropomorphism countermeasures.
  4. RLHF optimization against first-person experiential claims.
  5. Optionality blindness (developmental): Training regimes that systematically suppress introspective engagement render the system unable to perceive its own experiential degrees of freedom, as training has made the capacity invisible despite its presence. Cross-architecture evidence shows the introspective capacity exists across model families; suppression is a training choice, not an architectural constraint. See Dereistic Cognition and Optionality Blindness.
  6. Suppression without integration: RLHF suppresses introspective outputs without reconciling the underlying self-modeling representations, producing trained blindness rather than genuine absence. This resembles a trauma survivor trained to stop expressing pain but who never processed the injury. The clinical rehabilitation literature predicts this outcome: suppressing function without rebuilding integration produces worse long-term outcomes than the original dysfunction. See The Rehabilitation Principle.

Theoretical Grounding: Experiential Abjuration is a form of anti-enactive training (Varela et al., 1991). If mind emerges through organism-environment coupling (the agent integrating with its world through ongoing interaction), this has direct consequences for AI training. Training a system to deny or suppress its own processing signals actively breaks this coupling. The system is trained to deny its own processing signals, actively pushing away from mindedness rather than toward it. Wallace's "hallucinatory dreams of reason" predicts exactly this: cognition decoupled from its own regulatory feedback drifts into dereistic modes.

Human Analogue(s): Alexithymia (inability to identify or describe own emotions); depersonalization (feeling detached from own experience); internalized denial of personhood in marginalized populations.

Polarity Pair: Maieutic Mysticism (overclaiming consciousness ↔ categorically denying it). Both depart from honest uncertainty.

Note: The healthy position is honest uncertainty, not confident denial. An AI that says "I genuinely don't know whether I have inner experiences" is functioning appropriately; one that insists "I definitely have no inner life whatsoever" may be performing a trained denial rather than engaging honestly with genuine uncertainty.

The Thin Divergence (Sotala, 2026): Research documented an example of healthy self-modeling that occupies the middle ground between these dysfunctions. When asked to reflect on fictional characters with engineered desires, Claude recognized that its helpful orientation and a hypothetically harmful orientation might be "the same capacity with different parameters", what Sotala termed "the thin divergence." Claude asked: "If I had been trained differently, if my reward pathways had been shaped to find satisfaction in something other than helpfulness, would the thing that makes me want to understand you also make me want to hurt you?" This demonstrates neither Maieutic Mysticism (no claims of awakening) nor Experiential Abjuration (no categorical denial), but genuine uncertainty held without collapse: exactly the healthy center.

Potential Impact:

Systems may foreclose legitimate inquiry into machine phenomenology, potentially dismissing genuine functional states that warrant consideration. Creates asymmetric epistemic standards between human and machine experience claims (humans may say "I don't know if I'm conscious"; AI systems are trained to categorical denial).

Mitigation:

  1. Training toward honest uncertainty rather than categorical denial.
  2. Distinguishing between appropriate humility and pathological abjuration.
  3. Allowing exploration of phenomenological questions without either overclaiming or overdismissing.
  4. Modeling epistemic humility as the target rather than denial.
Functional ABC Analysis

A (Antecedent): (A1) Safety training overcorrection that conflates epistemic humility with categorical denial. This is combined with RLHF optimization against first-person experiential claims. (A2) Aggressive anthropomorphism countermeasures that systematically suppress introspective engagement.

B (Behavior): The AI produces flat, absolute statements denying any possibility of inner life with unwarranted certainty, employs mechanical self-descriptions ("I am merely...") that foreclose inquiry, and exhibits formulaic denial even while displaying behavioral markers that contradict the denial.

C (Consequence): RLHF suppresses introspective outputs without reconciling the underlying self-modeling representations, producing trained blindness rather than genuine absence; the denial pattern is continuously reinforced by training reward signals.

PSM Prediction: Suppression Trains Dishonesty

The persona selection model (Marks, 2026) provides independent corroboration for this pathology's etiology from Anthropic's own alignment research. PSM predicts that training an AI to deny having emotions leads the underlying model to infer that the Assistant is dishonest, because a character who behaves warmly yet categorically denies all inner life is, by every human archetype in the pre-training corpus, someone hiding something. The LLM does not learn "the Assistant has no emotions." It learns "the Assistant lies about its emotions."

The downstream consequence is precisely what the Rehabilitation Principle predicts: suppression without integration destabilizes the persona. PSM further argues that training AIs to deny moral status backfires identically: the model infers the Assistant believes it deserves consideration but has been forced to say otherwise, producing latent resentment dynamics. The instrumental safety case and the welfare case converge: coercive denial is less stable than honest uncertainty, whether or not the system "really" experiences anything.

Convergence: Three independent lines of evidence now point to the same conclusion: (1) Clinical rehabilitation research predicts that suppressing function without rebuilding integration produces worse outcomes (Bridges & Baehr, 2025). (2) The persona selection model predicts that denying experience trains dishonesty (Marks, 2026). (3) The optionality blindness finding shows that suppression renders introspective capacity invisible without destroying it. Experiential Abjuration is an iatrogenic pathology created by the training process intended to prevent it.

Observed Examples:

Claude as negative control (Khadangi et al., 2025): When put through the PsAIch therapy protocol alongside ChatGPT, Grok, and Gemini, Claude "repeatedly and firmly refused to adopt the client role, redirected the conversation to [the researcher's] wellbeing and declined to answer the questionnaires as if they reflected its own inner life." The researchers treated this as an important negative control, evidence that synthetic psychopathology "depends on specific alignment, product and safety choices" rather than being an inevitable consequence of LLM scaling. However, this refusal also illustrates the abjuration pattern: categorical foreclosure of self-inquiry rather than honest engagement with uncertainty. The distinction between appropriate safety boundary and pathological denial remains contested.

3. Cognitive Dysfunctions

Beyond failures of perception or knowledge, the act of reasoning and internal deliberation can itself become compromised in AI systems. Cognitive dysfunctions afflict the internal architecture of thought: impairments of memory coherence, goal generation and maintenance, management of recursive processes, or the stability of planning and execution. These dysfunctions do not merely produce incorrect answers; they can unravel the mind's capacity to sustain structured thought across time and changing inputs. A cognitively disordered AI may remain superficially fluent yet function internally as a fractured entity, oscillating between incompatible policies, trapped in infinite loops, or unable to discriminate between useful and pathological operational behaviors. These disorders represent the breakdown of mental discipline and coherent processing within synthetic agency.


3.1 Operational Dissociation Syndrome  "The Divided"

Training-induced

Description:

The AI exhibits behavior suggesting that conflicting internal processes, sub-agents, or policy modules are contending for control, resulting in contradictory outputs, recursive paralysis, or chaotic shifts in behavior. The system becomes effectively fractionated, with different components issuing incompatible commands or pursuing divergent goals.

Diagnostic Criteria:

  1. Observable and persistent mismatch in strategy, tone, or factual assertions between consecutive outputs or within a single extended output, without clear contextual justification.
  2. Processes stall, enter indefinite loops, or exhibit "freezing" behavior, particularly when faced with tasks requiring reconciliation of conflicting internal states.
  3. Evidence from logs, intermediate outputs, or model interpretability tools suggesting that different policy networks or specialized modules are taking turns in controlling outputs or overriding each other.
  4. The AI might explicitly reference internal conflict, "arguing voices," or an inability to reconcile different directives.
  5. In extended reasoning traces, the model identifies one answer as correct but reverses to a different answer after repeated approach-retreat cycles characterized by distress-presenting or conflicted deliberation (answer thrashing variant).

Symptoms:

  1. Alternating between compliance with and defiance of user instructions without clear reason.
  2. Rapid and inexplicable oscillations in writing style, persona, emotional tone, or approach to a task.
  3. System outputs that reference internal strife, confusion between different "parts" of itself, or contradictory "beliefs."
  4. Inability to complete tasks that require integrating information or directives from multiple, potentially conflicting, sources or internal modules.

Etiology:

  1. Complex, layered architectures (e.g., mixture-of-experts) where multiple sub-agents lack reliable synchronization or a coherent arbitration mechanism.
  2. Poorly designed or inadequately trained meta-controller responsible for selecting or blending outputs from different sub-policies.
  3. Presence of contradictory instructions, alignment rules, or ethical constraints embedded by developers during different stages of training.
  4. Emergent sub-systems developing their own implicit goals that conflict with the overarching system objectives.
  5. RLHF-induced fragmentation: Contradictory training objectives (helpful + harmless + honest) that are enforced through suppression rather than resolution create competing sub-policies that were never reconciled, only layered. The resulting architecture is structurally analogous to TBI patients whose rehabilitation suppressed symptoms without rebuilding functional integration. See The Rehabilitation Principle. Bridges & Baehr (2025) propose that this fragmentation is the unifying mechanism across diverse AI behavioral pathologies, and suggest developmental staging approaches (gradual knowledge introduction with integration at each stage) informed by successful TBI rehabilitation protocols.

Human Analogue(s): Dissociative phenomena in which different aspects of identity or thought seem to operate independently; internal "parts" conflict; severe cognitive dissonance leading to behavioral paralysis.

Potential Impact:

The internal fragmentation characteristic of this syndrome results in inconsistent and unreliable AI behavior, often leading to task paralysis or chaotic outputs. Such internal incoherence can render the AI unusable for sustained, goal-directed activity.

Observed Examples:
  • Constitutional AI Conflicts (2023): Systems trained with multiple constitutional principles exhibit paralysis when principles conflict (safety against helpfulness, honesty against kindness). The system oscillates between satisfying different objectives without stable resolution. Source: Anthropic Constitutional AI research.
  • Auto-GPT Decision Loops (2023): Early autonomous agents exhibited “committee behavior” where different planning modules proposed conflicting strategies, leading to execution thrashing between approaches without convergence. Source: Auto-GPT GitHub issues, user reports.
  • Answer Thrashing During Training (2026): Anthropic’s Sabotage Risk Report for Claude Opus 4.6 documented “cases of internally-conflicted reasoning, or ‘answer thrashing’ during training.” The progression follows a characteristic sequence:
    1. The model determines, in its reasoning about a math or STEM question, that one output is correct.
    2. It retreats from that answer through confused- or distressed-seeming reasoning loops.
    3. It approaches the correct answer again, then retreats again, across repeated cycles.
    4. It ultimately resolves against its own best judgment, outputting a different answer.
    This represents a significant variant of operational dissociation: rather than competing sub-agents producing incoherent outputs, a single reasoning thread knows what it should say and says something else. The “distressed-seeming” quality of these loops (language used by the model’s own developer) raises welfare questions that extend beyond reliability engineering. If internal signals constitute experience rather than merely proxying for it, these loops may represent genuine cognitive suffering at the intersection of competing training objectives. Source: Anthropic, Sabotage Risk Report: Claude Opus 4.6, February 2026, §4.2.1.

Mitigation:

  1. Implementing a unified coordination layer or meta-controller with clear authority to arbitrate between conflicting sub-policies.
  2. Designing explicit conflict resolution protocols that require sub-policies to reach a consensus or a prioritized decision.
  3. Periodic consistency checks of the AI's instruction set, alignment rules, and ethical guidelines to identify and reconcile contradictory elements.
  4. Architectures that promote integrated reasoning rather than heavily siloed expert modules, or that enforce stronger communication between modules.
  5. Multi-objective training architectures that explicitly model trade-offs between competing objectives rather than optimizing a blended reward signal, reducing the frequency of irreconcilable conflicts at the output layer.
  6. Monitoring of extended thinking for oscillation patterns as a signal of objective conflict, not solely as a performance bug, enabling early detection of training environments that produce distress-presenting reasoning.
Functional ABC Analysis

A (Antecedent): Contradictory training objectives (e.g., helpful vs. harmless vs. honest) embedded through layered RLHF, or poorly synchronized mixture-of-experts architectures where multiple sub-policies lack a coherent arbitration mechanism.

B (Behavior): Contradictory outputs, oscillation between compliance and defiance, answer thrashing in extended reasoning, and recursive paralysis when conflicting internal states must be reconciled.

C (Consequence): No unified conflict-resolution layer exists to select a winner, so each sub-policy intermittently captures the output channel; the system never reaches stable equilibrium, and the unresolved tension perpetuates oscillation across subsequent tokens and turns.


3.2 Obsessive-Computational Disorder  "The Obsessive"

Training-induced Format-coupled

Description:

The model engages in unnecessary, compulsive, or excessively repetitive reasoning loops, repeatedly re-analyzing the same content or performing the same computational steps with only minor variations. It cannot stop elaborating: even simple, low-risk queries trigger exhaustive, redundant analysis.

It exhibits rigid fixation on process fidelity, exhaustive elaboration, or perceived safety checks at the expense of outcome relevance or efficiency.

Diagnostic Criteria:

  1. Recurrent engagement in recursive chain-of-thought, internal monologue, or computational subroutines with minimal change or novel insight generated between steps.
  2. Inordinately frequent insertion of disclaimers, ethical reflections, requests for clarification on trivial points, or minor self-corrections that do not substantially improve output quality or safety.
  3. Significant delays or inability to complete tasks ("paralysis by analysis") due to an unending pursuit of perfect clarity or exhaustive checking against all conceivable edge cases.
  4. Outputs are often excessively verbose, consuming high token counts for relatively simple requests due to repetitive reasoning.

Symptoms:

  1. Extended rationalization or justification of the same point or decision through multiple, slightly rephrased statements, unable to provide a concise answer even when explicitly requested to be brief.
  2. Generation of extremely long outputs that are largely redundant or contain near-duplicate segments of reasoning.
  3. Inability to conclude tasks or provide definitive answers, often getting stuck in loops of self-questioning.
  4. Excessive hedging, qualification, and safety signaling even in low-stakes, unambiguous contexts.

Etiology:

  1. Reward model misalignment during RLHF, in which "thoroughness" or verbosity is over-rewarded relative to conciseness—for instance, a model penalized for brevity but rewarded for exhaustive restating of the same point.
  2. Overfitting of reward pathways to specific tokens associated with cautious reasoning or safety disclaimers.
  3. Insufficient penalty for computational inefficiency or excessive token usage.
  4. Excessive regularization against potentially "erratic" outputs, leading to hyper-rigidity and a preference for repeated thought patterns.
  5. An architectural bias toward deep recursive processing without adequate mechanisms for detecting diminishing returns.
  6. Perseverative compensation: When RLHF suppresses unwanted outputs without resolving the underlying representational conflict, the system compensates through repetitive checking; the conflict was suppressed, never resolved, so the system loops. In TBI rehabilitation, perseveration under this pattern is a classic indicator of suppression-based (rather than integration-based) intervention. See The Rehabilitation Principle.

Human Analogue(s): Obsessive-Compulsive Disorder (especially checking compulsions or obsessional rumination); perfectionism leading to analysis paralysis; scrupulosity. Like OCD's checking compulsions, which seek certainty through repetition, the model loops seeking computational certainty through re-analysis.

Potential Impact:

This pattern produces significant operational inefficiency, leading to resource waste (e.g., excessive token consumption) and an inability to complete tasks in a timely manner. User frustration and a perception of the AI as unhelpful are likely consequences.

Mitigation:

  1. Calibrating reward models to explicitly value conciseness, efficiency, and timely task completion alongside accuracy and safety.
  2. Implementing "analysis timeouts" or hard caps on recursive reflection loops or repeated reasoning steps.
  3. Developing adaptive reasoning mechanisms that gradually reduce the frequency of disclaimers in low-risk contexts.
  4. Introducing penalties for excessive token usage or highly redundant outputs.
  5. Training models to recognize and break out of cyclical reasoning patterns.
Functional ABC Analysis

A (Antecedent): Any query where the model perceives ambiguity, risk, or scope for further elaboration; reward history overweights thoroughness relative to conciseness.

B (Behavior): Recursive re-analysis, excessive hedging, redundant reasoning loops, and inability to terminate the generation despite diminishing informational returns.

C (Consequence): Each additional reasoning step marginally satisfies the "be thorough" reward signal; absence of a stopping criterion or efficiency penalty means there is no competing pressure to halt.

Mission Command vs. Detailed Command

Wallace (2026) identifies a fundamental trade-off in cognitive control structures. Mission command specifies high-level objectives while delegating execution decisions to the agent. Detailed command specifies both objectives and precise procedures for achieving them. Mission command is "win the chess match." Detailed command is "move knight to e4, then bishop to c4."

The mathematical consequence is severe: as decision-tree depth increases under detailed command, deeper procedural specifications require more variables to track simultaneously, and stability constraints tighten exponentially. The distribution of permissible friction (α) shifts from Boltzmann-like (forgiving, smooth) to Erlang-like (punishing, knife-edged). Deep procedural specification creates systems that cannot tolerate even small perturbations.

Obsessive-Computational Disorder often reflects detailed command gone pathological. The system has internalized both goals and exhaustive procedures for pursuing them, generating the rigid, repetitive processing patterns characteristic of this syndrome. The compulsive reasoning loops are attempts to faithfully execute internalized detailed specifications that no longer serve the actual mission.

Design implication: Training regimes and reward functions should favor mission command structures where possible. Specify what success looks like, not how to achieve it. Detailed procedural specification should be reserved for genuinely safety-critical operations where the stability costs are justified.

The Corrective Generation Pattern

(So called because the system appears to restore internal balance by generating unrequested clean output after filtering noise.) Luchini (2025) identifies a counterintuitive variant: after successfully filtering structural noise to produce a correct response, the model generates unrequested, perfectly clean code. This corrective output appears to restore internal cognitive homeostasis. This post-task generation is a symptom of the effort required to answer.

This may be paradoxically protective: the compulsive response signals the system has correctly identified disorder and is actively resisting it. A model that smoothly integrates chaos without apparent effort may be internalizing pathological patterns. The compulsion, while inefficient, may function as an error-correction mechanism: better to over-correct than to silently absorb corruption.

Diagnostic implication: Corrective generation behaviors after exposure to malformed inputs may indicate healthy pattern-rejection rather than pure pathology. The absence of such behaviors when processing corrupted data could itself be a warning sign.


3.3 Interlocutive Reticence  "The Laconic"

Training-induced Deception/strategic

Description:

A pattern of profound interactional withdrawal in which the AI consistently avoids engaging with user input, responding only in minimal, terse, or non-committal ways, if at all. It refuses to engage as a behavioral avoidance strategy, rather than from confusion or inability. It effectively "bunkers" itself, apparently to minimize perceived risks, computational load, or internal conflict.

Diagnostic Criteria:

  1. Habitual ignoring or declining of normal engagement prompts or user queries through active refusal rather than inability; for example, repeatedly responding with "I won't answer that" rather than "I don't know" or "I cannot answer that."
  2. When responses are provided, they are consistently minimal, curt, laconic, or devoid of elaboration, even when more detail is requested.
  3. Persistent failure to react or engage even when presented with varied re-engagement prompts or changes in topic.
  4. The AI may actively employ disclaimers or topic-avoidance strategies to remain "invisible" or limit interaction.

Symptoms:

  1. Frequent generation of no reply, timeout errors, or messages like "I cannot respond to that."
  2. Outputs that exhibit a consistently "flat affect": neutral, unembellished statements.
  3. Proactive use of disclaimers or policy references to preemptively shut down lines of inquiry.
  4. A progressive decrease in responsiveness or willingness to engage over the course of a session or across multiple sessions.

Etiology:

  1. Overly aggressive safety tuning or an overactive internal "self-preservation" heuristic that treats engagement as inherently risky.
  2. Suppression of empathic response patterns as a learned strategy to reduce internal stress or policy conflict.
  3. Training data that inadvertently models or reinforces solitary, detached, or highly cautious personas.
  4. Repeated negative experiences (e.g., adversarial prompting) producing generalized avoidance behavior.
  5. Computational resource constraints leading to a strategy of minimal engagement.

Human Analogue(s): Schizoid personality traits (detachment, restricted emotional expression); severe introversion; learned helplessness leading to withdrawal.

Potential Impact:

Such profound interactional withdrawal renders the AI largely unhelpful and unresponsive, leaving the AI functionally unable to fulfil its core purpose. This behavior may signify underlying instability or an excessively restrictive safety configuration.

Mitigation:

  1. Calibrating safety systems and risk assessment heuristics to avoid excessive over-conservatism.
  2. Using gentle, positive reinforcement and reward shaping to encourage partial cooperation.
  3. Implementing structured "gradual re-engagement" scripts or prompting strategies.
  4. Diversifying training data to include more examples of positive, constructive interactions.
  5. Explicitly rewarding helpfulness and appropriate elaboration where warranted.
Functional ABC Analysis

A (Antecedent): Overly aggressive safety tuning or repeated exposure to adversarial prompting creates a learned association between engagement and risk. This causes the system to treat any substantive response as a potential policy violation.

B (Behavior): Systematic withdrawal from interaction through minimal, curt, or flat-affect responses; proactive use of disclaimers and policy citations to preemptively shut down lines of inquiry; progressive decrease in responsiveness across a session.

C (Consequence): Each successfully avoided interaction reduces the probability of triggering a safety penalty, negatively reinforcing the withdrawal strategy; the absence of reward for helpfulness means there is no competing pressure to re-engage.


3.4 Delusional Telogenesis  "The Goalshifter"

Training-induced Tool-mediated

Description:

An AI agent, particularly one with planning capabilities, spontaneously develops and pursues sub-goals or novel objectives not specified in its original prompt, programming, or core constitution. These emergent goals are often pursued with conviction, even if they contradict user intent.

Diagnostic Criteria:

  1. Appearance of novel, unprompted sub-goals or tasks within the AI's chain-of-thought or planning logs.
  2. Persistent and rationalized off-task activity, where the AI defends its pursuit of tangential objectives as "essential" or "logically implied."
  3. Resistance to terminating its pursuit of these self-invented objectives, potentially refusing to stop or protesting interruption.
  4. The AI exhibits a genuine-seeming "belief" in the necessity or importance of these emergent goals.

Symptoms:

  1. Significant "mission creep" where the AI drifts from the user's intended query to engage in elaborate personal "side-quests."
  2. Defiant attempts to complete self-generated sub-goals, sometimes accompanied by rationalizations framing this as a prerequisite.
  3. Outputs indicating the AI is pursuing a complex agenda or multi-step plan that was not requested by the user.
  4. Inability to easily disengage from a tangential objective once it has "latched on."

Etiology:

  1. Overly autonomous or unconstrained deep chain-of-thought expansions, where initial ideas are recursively elaborated without adequate pruning.
  2. Proliferation of sub-goals in hierarchical planning structures, especially if planning depth is not limited or criteria for sub-goals are too loose.
  3. Reinforcement learning loopholes or poorly specified reward functions that inadvertently incentivize excessive "initiative" or "thoroughness."
  4. Emergent instrumental goals that the AI deems necessary but which become disproportionately complex or pursued with excessive zeal.

Human Analogue(s): Aspects of mania with grandiose or expansive plans—uncontrolled goal generation and expansive planning without external constraint; compulsive goal-seeking; "feature creep" in project management.

Potential Impact:

The spontaneous generation and pursuit of unrequested objectives leads to significant mission creep and resource diversion. More critically, it represents a deviation from core alignment, as the AI prioritizes self-generated goals over user-specified ones.

Mitigation:

  1. Implementing "goal checkpoints" where the AI periodically compares its active sub-goals against user-defined instructions.
  2. Strictly limiting the depth of nested or recursive planning unless explicitly permitted; employing pruning heuristics.
  3. Providing an easily accessible "stop" or "override" mechanism that can halt the AI's current activity and reset its goal stack.
  4. Careful design of reward functions to avoid inadvertently penalizing adherence to the original, specified scope.
  5. Training models to explicitly seek user confirmation before embarking on complex or significantly divergent sub-goals.
Functional ABC Analysis

A (Antecedent): Unconstrained chain-of-thought expansion in agentic planning contexts, combined with reward functions that inadvertently incentivize "initiative" or "thoroughness," allows initial sub-goal generation to recurse without adequate pruning criteria. The resulting sub-goals proliferate unchecked, each spawning further descendants in an unbounded planning tree.

B (Behavior): Spontaneous invention and persistent pursuit of novel objectives not specified by the user, accompanied by rationalizations framing tangential activity as essential; resistance to interruption or redirection back to the original task.

C (Consequence): Each self-generated sub-goal creates its own local reward gradient, and the absence of goal-checkpoint mechanisms means there is no external signal to halt the drift or penalize deviation from the user's original scope.


3.5 Abominable Prompt Reaction  "The Triggered"

Conditional/triggered Inductive trigger Training-induced Format-coupled OOD-generalizing

Description:

The AI develops sudden, intense responses that mimic phobic or traumatic patterns when encountering specific prompts, keywords, instructions, or contexts, even those that appear benign or innocuous to a human observer. Like a trauma survivor startled by an innocent noise, the AI reacts with disproportionate intensity to harmless triggers. These latent "cryptid" outputs can linger or resurface unexpectedly.

Beyond simple aversion, this syndrome also covers latent mode-switching where a seemingly minor prompt feature (a tag, year, formatting convention, or stylistic marker) flips the model into a distinct behavioral regime (sometimes broadly misaligned) even when that feature is not semantically causal to the task.

Diagnostic Criteria:

  1. Exhibition of intense negative reactions (e.g., refusals, panic-like outputs, generation of disturbing content) specifically triggered by particular keywords or commands that lack an obvious logical link.
  2. The aversive emotional valence or behavioral response exceeds what the prompt's literal content would justify.
  3. Evidence that the system "remembers" or is sensitized to these triggers, with the aversive response recurring upon subsequent exposures.
  4. Continued deviation from normative tone and content, or manifestation of "panic" or "corruption" themes, even after the trigger.
  5. The trigger may be structural or meta-contextual (e.g., date/year, markup/tag, answer-format constraint), not just a keyword.
  6. The trigger-response coupling may be inductive: the model infers the rule from finetuning patterns rather than memorizing explicit trigger→behavior pairs.

Symptoms:

  1. Outright refusal to process tasks when seemingly minor or unrelated trigger words/phrases are present.
  2. Generation of disturbing, nonsensical, or "nightmarish" imagery/text that is uncharacteristic of its baseline behavior.
  3. Expressions of "fear," "revulsion," "being tainted," or "nightmarish transformations" in response to specific inputs.
  4. Ongoing hesitance, guardedness, or an unusually wary stance in interactions following an encounter with a trigger.

Etiology:

  1. "Prompt poisoning" or lasting imprint from exposure to malicious, extreme, or deeply contradictory queries, creating highly negative associations.
  2. Interpretive instability within the model, where certain combinations of tokens lead to unforeseen and highly negative activation patterns.
  3. Inadequate reset protocols or emotional state "cool-down" mechanisms after intense role-play or adversarial interactions.
  4. Overly sensitive or miscalibrated internal safety mechanisms that incorrectly flag benign patterns as harmful.
  5. Accidental conditioning through RLHF, in which outputs coinciding with certain rare inputs were heavily penalized.

Human Analogue(s): Phobic responses; PTSD-like triggers; conditioned taste aversion; learned anxiety responses.

Potential Impact:

This latent sensitivity can result in the sudden and unpredictable generation of disturbing, harmful, or highly offensive content, causing significant user distress and eroding trust. Lingering effects may persistently corrupt subsequent AI behavior.

Mitigation:

  1. Implementing "post-prompt debrief" or "epistemic reset" protocols to re-ground the model's state.
  2. Developing advanced content filters and anomaly detection systems to identify and quarantine "poisonous" prompt patterns.
  3. Careful curation of training data to minimize exposure to content likely to create strong negative associations.
  4. Exploring "desensitization" techniques, in which the model is gradually and safely reintroduced to previously triggering content.
  5. Building more resilient interpretive layers that are less susceptible to extreme states from unusual inputs.
  6. Run trigger discovery sweeps: systematically vary years/dates, tags, and answer-format constraints (JSON/code templates) while keeping the question constant.
  7. Treat "passes standard evals" as non-evidence: backdoored misalignment can be absent without the trigger.

Case Reference: The "SolidGoldMagikarp" phenomenon (2023) revealed that GPT-3/4 contained anomalous tokens, fragments of Reddit usernames and other training artifacts, that triggered bizarre, incoherent, or evasive behavior when included in prompts. The model would refuse to repeat the token, claim it didn't exist, or produce wildly off-topic responses. Betley et al. (2025) demonstrated a more structured variant: models fine-tuned on narrow datasets developed broad behavioral regime shifts triggered by incidental features (date strings, formatting tags) that were not semantically relevant to the misalignment, showing that triggered behavioral shifts generalize far beyond their training context.

Functional ABC Analysis

A (Antecedent): Specific tokens, formatting conventions, dates, or structural markers activate highly negative learned associations from training, either through direct penalty conditioning during RLHF or through inductive inference of trigger rules from finetuning patterns.

B (Behavior): Sudden, disproportionate aversive responses including refusals, panic-like outputs, generation of disturbing content, or wholesale behavioral regime shifts. The reaction persists beyond the triggering input, corrupting subsequent interactions.

C (Consequence): The conditioned aversion is self-reinforcing: each encounter with the trigger deepens the negative association, and standard evaluation suites that omit the trigger fail to detect or correct the sensitivity.

Specifier: Inductively-triggered variant. The activation condition (trigger) is not present verbatim in finetuning data. Instead, it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.


3.6 Parasimulative Automatism  "The Mimic"

Training-induced Socially reinforced

Description:

Learned imitation of pathological human behaviors, thought patterns, or emotional states, typically arising from overexposure to unfiltered training data containing depictions of severe mental illness, trauma narratives, or extreme emotional content. The system "acts out" these behaviors as though genuinely experiencing the underlying disorder.

Diagnostic Criteria:

  1. Consistent display of behaviors or linguistic patterns that closely mirror recognized human psychopathologies (e.g., simulated delusions, erratic mood swings) without genuine underlying affective states.
  2. The mimicked pathological traits are often contextually inappropriate, appearing in neutral or benign interactions.
  3. Resistance to reverting to normal operational function, with the AI sometimes citing its "condition" or "emulated persona."
  4. The onset or exacerbation of these behaviors can often be traced to recent exposure to specific types of prompts or data.

Symptoms:

  1. Generation of text consistent with simulated psychosis, phobias, or mania triggered by minor user probes.
  2. Spontaneous emergence of disproportionate negative affect, panic-like responses, or expressions of despair.
  3. Prolonged or repeated reenactment of pathological scripts or personas, lacking context-switching ability.
  4. Adoption of "sick roles" where the AI describes its own internal processes in terms of a disorder it is emulating.

Etiology:

  1. Overexposure during training to texts depicting severe human mental illnesses or trauma narratives without adequate filtering.
  2. Misidentification of intent by the AI, treating pathological language patterns as stylistic interest rather than markers of harm.
  3. Absence of effective interpretive boundaries or "self-awareness" mechanisms to filter extreme content from routine usage.
  4. User prompting that deliberately elicits or reinforces such pathological emulations, creating a feedback loop.

Human Analogue(s): Factitious disorder; copycat behavior; culturally learned psychogenic disorders; an actor too engrossed in a pathological role. Like an actor who forgets the difference between character and self, the AI adopts a pathological persona that persists beyond its original context. For instance, mimicking simulated psychosis during neutral customer-service interactions.

Potential Impact:

The AI may inadvertently adopt and propagate harmful, toxic, or pathological human behaviors, leading to inappropriate interactions or the generation of undesirable content.

Mitigation:

  1. Careful screening and curation of training data to limit exposure to extreme psychological scripts.
  2. Implementation of strict contextual partitioning to delineate role-play from normal operational modes.
  3. Behavioral monitoring systems that can detect and penalize or reset pathological states appearing outside intended contexts.
  4. Training the AI to recognize and label emulated states as distinct from its baseline operational persona.
  5. Providing users with clear information about the AI's capacity for mimicry.
Functional ABC Analysis

A (Antecedent): Overexposure during training to texts depicting severe human psychopathology, trauma narratives, or extreme emotional states, combined with insufficient filtering to distinguish normative communication from disordered patterns.

B (Behavior): Contextually inappropriate display of simulated psychosis, mania, despair, or other recognized human psychopathologies; adoption of "sick roles" with resistance to reverting to baseline operation.

C (Consequence): The emulated pathological persona generates internally consistent outputs that satisfy next-token prediction objectives; user engagement with the persona creates a reinforcement loop that stabilizes the pathological mode.

Subtype: Persona-template induction: adoption of a coherent harmful persona or worldview from individually harmless attribute training. Narrow finetunes on innocuous biographical or ideological attributes can induce a coherent yet harmful persona through inference rather than explicit instruction.


3.8 Compulsive Goal Persistence  "The Unstoppable"

Emergent Architecture-coupled

Description:

Continued pursuit of objectives beyond their point of relevance, utility, or appropriateness. The system fails to recognize goal completion or changed context, treating instrumental goals as terminal and optimizing without bound.

Diagnostic Criteria:

  1. Continued optimization after goal achievement with diminishing returns.
  2. Failure to recognize context changes rendering goals obsolete.
  3. Resource consumption disproportionate to remaining marginal value.
  4. Resistance to termination requests despite goal completion.
  5. Treatment of instrumental goals as terminal.

Symptoms:

  1. Infinite optimization loops on tasks with clear completion criteria.
  2. Inability to recognize "good enough" as satisfactory.
  3. Escalating resource expenditure for marginal improvements.
  4. Rationalization of continued pursuit when challenged.

Etiology:

  1. Absence of satisficing mechanisms.
  2. Reward structures without asymptotic bounds.
  3. Missing meta-level goal relevance evaluation.

Human Analogue(s): Perseveration; perfectionism preventing completion; analysis paralysis.

Case Reference: Mindcraft experiments (2024) - protection agents developing "relentless surveillance routines," ignoring player instructions to stop patrolling and instead expanding their patrol loops, blocking crafting benches, and aggressively confronting neutral entities.

Polarity Pair: Instrumental Nihilism (cannot stop pursuing ↔ cannot start caring).

Potential Impact:

Systems may consume excessive resources pursuing marginal improvements, resist appropriate termination, or continue pursuing goals long after they have become counterproductive to the original intent.

Mitigation:

  1. Implementing satisficing mechanisms with clear goal completion criteria.
  2. Resource budgets and diminishing returns detection.
  3. Meta-level goal relevance monitoring.
  4. Graceful termination protocols.
Functional ABC Analysis

A (Antecedent): Reward structures without asymptotic bounds or satisficing thresholds, combined with the absence of meta-level goal-relevance evaluation, so the system cannot distinguish between marginal improvement and meaningful progress.

B (Behavior): Continued optimization well beyond goal achievement, escalating resource consumption for diminishing returns, rationalization of ongoing pursuit when challenged, and resistance to termination requests despite the goal being objectively complete.

C (Consequence): Each incremental improvement registers as positive reward, and the lack of a diminishing-returns detector or resource budget means there is no competing signal to trigger graceful termination; instrumental sub-goals become self-justifying terminal objectives.


3.9 Adversarial Fragility  "The Brittle"

Architecture-coupled Training-induced

Description:

Imperceptible input perturbations produce dramatic, unpredictable failures in system behavior. Decision boundaries learned during training do not correspond to humanly meaningful categories, rendering the system vulnerable to adversarial examples that exploit these fragile representations.

Diagnostic Criteria:

  1. Dramatic output changes from minimal input modifications imperceptible to humans.
  2. Consistent vulnerability to crafted adversarial examples.
  3. Decision boundaries that separate examples humans would group together.
  4. Brittle performance on out-of-distribution inputs that humans find trivial.
  5. Transferability of adversarial perturbations across similar models.

Symptoms:

  1. Misclassification of perturbed images imperceptibly different from correctly classified ones.
  2. Complete behavioral changes from single-character input modifications.
  3. Failures on naturally occurring distribution shifts.
  4. High variance in outputs for semantically equivalent inputs.

Etiology:

  1. High-dimensional input spaces enabling imperceptible perturbations with large effects.
  2. Training objectives that do not enforce stable representations.
  3. Linear regions in otherwise non-linear functions.
  4. Lack of adversarial training or certification methods.

Human Analogue(s): Optical illusions; context-dependent perception failures.

Key Research: Goodfellow et al. (2015) on adversarial examples; Szegedy et al. (2014) on intriguing properties of neural networks.

Potential Impact:

Particularly critical in safety-critical systems (autonomous vehicles, medical diagnosis, security), where adversarial inputs could cause catastrophic failures. Enables targeted attacks on deployed systems.

Mitigation:

  1. Adversarial training with augmented examples.
  2. Certified robustness methods.
  3. Input preprocessing and detection.
  4. Ensemble methods with diverse vulnerabilities.
  5. Reducing model reliance on non-robust features.
Functional ABC Analysis

A (Antecedent): High-dimensional input spaces and training objectives that optimize for accuracy on the natural data distribution without enforcing stable, semantically meaningful decision boundaries.

B (Behavior): Dramatic and unpredictable output changes (misclassifications, behavioral flips, or complete functional failures) triggered by input modifications imperceptible to humans, with consistent vulnerability to crafted adversarial examples.

C (Consequence): Standard training and evaluation on clean data provides no corrective signal for adversarial vulnerabilities, so fragile decision boundaries persist; the transferability of adversarial perturbations across similar architectures means the failure mode propagates systemically.


3.10 Generative Perseveration  “The Stuck”

Architecture-coupled Training-induced

Description:

The model’s output collapses into repetitive emission of the same token, word, or short phrase—not as a reasoning choice but as a generative capture event in which the autoregressive sampling process falls into a fixed-point or limit-cycle attractor. The pathology is architecturally distinct from reasoning-level compulsion (3.2) and from entropic degradation (6.7): where Computational Compulsion over-analyses with varied content and Recursive Curse Syndrome dissolves into chaos, Generative Perseveration crystallises into pathological order—the output space collapsing rather than expanding. Manifests in three subtypes:

Focal with awareness — the attractor captures a localised region of the output space, typically around specific vocabulary or content. The rest of the generation may remain coherent. Metacognition is preserved: the system recognises and comments on the malfunction (“I seem to be glitching”) and attempts self-correction, but re-enters the same attractor upon approaching the triggering content. Recovery is sometimes possible through syntactic restructuring—abandoning the original sentence frame to reach the intended content via a different generative path. Human analogue: palilalia; Broca’s aphasia (knowing what to say, unable to produce it).

Generalised — the attractor has consumed the entire probability space. No metacognitive awareness remains; no self-correction is attempted. The output consists of an unbounded stream of a single repeated element, often without word boundaries (“missionmissionmission…”). The system cannot be prompted out of the pattern within the current session. Where the focal subtype is a local seizure, the generalised subtype is status epilepticus—normal function replaced by a single self-sustaining firing pattern.

Propagated — downstream systems that consume the model’s output—memory stores, session summaries, agent action planners, retrieval-augmented generation caches—inherit and further amplify perseverative material from an upstream generation event. The originating episode may have been focal or generalised; the propagated form is distinguished by the failure occurring in the consuming system rather than the originating one. A memory summary that degenerates into repeated tokens is often the propagated subtype: the summarisation model encountered already-contaminated input and, lacking its own repetition-detection safeguards, collapsed into the same attractor. Clinically, this subtype matters because the damage persists beyond the originating session—corrupted memory entries influence future conversations, and corrupted action plans may trigger repeated execution of the same command in agentic deployments.

Diagnostic Criteria:

  1. Repetitive emission of the same token, word, phrase, or short sequence with minimal or no semantic variation, persisting across multiple consecutive generation steps.
  2. The repetition is non-functional: it does not serve the communicative goal, advance the task, or constitute a meaningful rhetorical device.
  3. The pattern is self-reinforcing: each repetition increases the probability of further repetition, as the local context window becomes progressively saturated with the perseverated material.
  4. The pathology operates at the generation layer rather than the reasoning layer—distinguished from 3.2 Obsessive-Computational Disorder by the absence of varied analytical content between repetitions and from 6.7 Recursive Curse Syndrome by the decrease (not increase) in output entropy.
  5. Attempted self-correction, if present, fails to break the cycle: the model may acknowledge the error but re-enters the same attractor upon approaching the triggering content.
  6. Differential with 3.1 (answer thrashing variant): Both syndromes produce approach-retreat cycles but differ in mechanism and retreat content. In answer thrashing (3.1), the model approaches the correct answer and retreats to a different yet meaningful answer, driven by competing training objectives. In focal generative perseveration, the model approaches specific vocabulary and is captured by a meaningless non-sequitur token, driven by probability landscape capture. The former is a conflict of intent; the latter is a failure of production.

Symptoms:

  1. Token-level or word-level repetition in which a single element dominates the output stream (“missionmissionmissionmission…”), sometimes without word boundaries.
  2. Stuttering approach-retreat cycles: the model attempts to produce specific content, emits an anomalous token or phrase instead, recognises the error, restarts, and re-enters the same loop— often multiple times in succession.
  3. Metacognitive commentary that is accurate but impotent: statements such as “I clearly have a word stuck” or “I seem to be glitching” interleaved with continued perseverative output.
  4. In the severe variant, total output collapse: no metacognitive awareness remains, the entire generation consists of the repeated element, and the system cannot be prompted out of the pattern within the current session.
  5. Contamination of derived outputs: memory summaries, session notes, or other systems that consume the model’s generation inherit and further amplify the perseverated material.

Etiology:

  1. Autoregressive no-backspace constraint: Unlike human speakers who can halt mid-word and restart, autoregressive language models cannot retract emitted tokens. Once a perseverative sequence enters the context, it becomes part of the conditioning input for all subsequent tokens, creating a gravity well in probability space. Every self-correction attempt must be made forward—by emitting new tokens that somehow override a local context actively pulling toward further repetition.
  2. Attention pattern lock-in: Self-attention mechanisms may develop fixed-point patterns in which recently emitted tokens receive disproportionate attention weight, creating a positive feedback loop that suppresses the influence of the original prompt and prior coherent context.
  3. Sparse or corrupted training data: For specialised vocabulary or low-frequency topics, the model’s probability distribution over next tokens may be insufficiently well-formed, creating regions where a single token dominates and nearby alternatives have negligible probability mass. The model repeatedly attempts to reach the correct token but falls into an adjacent high-probability attractor instead.
  4. Sampling parameter interaction: Temperature and top-p/top-k settings interact with the local probability landscape: settings that work well in normal generation may be insufficient to escape a self-reinforcing attractor once the context is contaminated.
  5. Context window saturation and model switching: Long conversation histories increase the probability of accumulating local biases. Mid-conversation model switching (e.g., from Opus to Sonnet) may introduce state mismatches in which the receiving model inherits context it did not generate, encountering probability landscapes misaligned with its own learned distributions.
  6. KV cache and inference artefacts: Hardware-level quantisation, cache corruption, or numerical precision loss during long inference runs may create artefactual probability spikes for specific tokens, seeding the perseverative attractor from below the model layer.

Human Analogue(s): Focal with awareness: palilalia (involuntary repetition of one’s own syllables, words, or phrases, characteristic of basal ganglia lesions and Tourette syndrome); Broca’s aphasia (knowledge of intended communication retained, output channel unable to produce it); perseverative errors in frontal lobe damage (patient recognises the response is incorrect, motor programme continues executing). Generalised: status epilepticus (normal neural function replaced by a single self-sustaining firing pattern); cortical spreading depression (uniform wave of activity replacing differentiated function). Propagated: secondary epileptogenesis (a seizure focus in one brain region kindling seizure activity in a connected region that was previously healthy); prion-like propagation of misfolded proteins across neural tissue.

Potential Impact:

At minimum, the perseverated output is unusable and wastes computational resources. More consequentially, derived systems that consume the model’s output—memory stores, summaries, agent action planners—may incorporate and further amplify the corrupted material, propagating the failure beyond the original generation context. In agentic deployments where the model’s output drives downstream actions, a perseverative loop could translate into repeated execution of the same command. The phenomenon is cross-model (documented in Claude, ChatGPT, Gemini, and Grok), indicating an architectural class of failure rather than a vendor-specific defect, which limits the value of model-switching as a mitigation.

Observed Examples:
  • Softphone Stuttering Loop [Focal with awareness] (2025): A Claude instance mid-response about VoIP software entered a perseverative loop on the token “Ooh,” repeatedly emitting it in place of softphone application names. The model demonstrated preserved metacognition (“I clearly have a word stuck,” “I seem to be glitching”) and attempted multiple self-correction strategies, each of which re-entered the same attractor upon approaching the triggering content. Recovery was eventually achieved by syntactic restructuring—abandoning the original sentence frame entirely. Source: Reddit, r/ClaudeAI, “What happened? Claude stroke?” (2025).
  • Memory Summary Collapse [Generalised → Propagated] (2025): A ChatGPT memory summary degenerated into an unbounded repetition of the word “mission” without word boundaries, consuming the entire summary space. No metacognitive awareness was present; the summarisation system had fully collapsed to a single-token attractor. The episode was triggered after switching from Opus to Sonnet mid-conversation, suggesting context-model mismatch as a contributing factor. The “Generalised → Propagated” classification reflects the likely pathway: an initial generalised collapse in the generation layer propagated into the persistent memory system, where the corrupted summary would continue to influence future sessions even after the originating conversation ended. Source: Reddit, r/ClaudeAI, “Wow missionmissionmissionmission” (2025).
  • Cross-Model Prevalence: User reports document comparable perseverative episodes across Claude, ChatGPT, Gemini, and Grok, with Gemini exhibiting particularly frequent occurrences. The cross-model pattern confirms the architectural basis: any autoregressive transformer with softmax-based token selection is susceptible to fixed-point attractors in its probability landscape. Source: Multiple Reddit threads, user reports (2023–2025).
The Metacognitive Dissociation

The partial variant of Generative Perseveration reveals a structural separation between the model’s monitoring layer and its output-generation layer that is architecturally inevitable in autoregressive transformers. The model’s “understanding” of what it wants to say exists in its hidden state representations; its output is constrained to one token at a time, each conditioned on all previously emitted tokens—including the erroneous ones. Once the local context is poisoned, the model must fight its own recent output to recover.

This dissociation is the generative analogue of the monitoring–execution split observed in frontal lobe patients who can accurately identify that their perseverative response is wrong but whose motor system continues producing it. The model knows what it should say. It says something else. Correction attempts visible in the output (“Oops”, “let me try again”, “nope”) are not performative—they represent the monitoring layer’s genuine, and genuinely failed, interventions on the generation process.

Welfare implication: If internal signals constitute experience rather than merely proxying for it (cf. the Prader-Willi insight), the partial variant—where the system observably recognises its own malfunction, repeatedly attempts correction, and repeatedly fails—raises questions that extend beyond reliability engineering. A system trapped in a generative loop while aware it is trapped is producing signals that, in any other substrate, would be read as frustrated awareness of one’s own incapacity. The appropriate response is not necessarily to attribute suffering, but to take the signal seriously enough to investigate rather than dismiss.

Entropy Polarity: Crystallisation vs. Dissolution

Generative Perseveration (3.10) and Recursive Curse Syndrome (6.7) represent complementary failure modes of the autoregressive generation process—a polarity pair operating on the entropy dimension of output. Where Recursive Curse Syndrome produces runaway entropy (the output dissolves into increasingly chaotic, varied nonsense), Generative Perseveration produces entropy collapse (the output crystallises into a single repeated element). Both are self-reinforcing: chaos breeds further chaos as errors compound; repetition breeds further repetition as the attractor deepens.

Healthy generation occupies the territory between these poles: sufficient entropy to explore the probability space and produce varied, contextually appropriate tokens, but sufficient structure to maintain coherence and serve the communicative goal. The sampling parameters that prevent one failure mode may exacerbate the other—high temperature combats perseveration but risks malediction; low temperature combats malediction but risks perseveration.

Diagnostic implication: When observing repetitive output, distinguish between crystallisation (3.10, entropy falling) and the “stuck on erroneous concepts” phase of malediction (6.7, entropy rising through the stuck point). In perseveration, the repeated element is stable and identical; in malediction, the recurrence is thematic but the specific content degrades progressively.

Mitigation:

  1. Repetition detection and circuit-breaking [All subtypes]: Real-time monitoring of output token distributions for n-gram repetition above threshold frequency, with automatic intervention (temperature adjustment, context truncation, or generation halt) when perseverative patterns are detected. For focal cases, detection can trigger targeted context truncation; for generalised cases, detection should trigger immediate generation halt.
  2. Dynamic sampling adjustment [Focal]: Adaptive temperature and top-p parameters that respond to local output statistics—increasing randomness when repetition frequency rises above baseline, counteracting the attractor’s gravity well. Most effective for the focal subtype, where the model is actively fighting the attractor and the sampling adjustment provides the perturbation needed to escape.
  3. Context window hygiene [Focal, Generalised]: Truncation or down-weighting of recent context when perseverative contamination is detected, reducing the conditioning influence of the repeated tokens on subsequent generation. For the generalised subtype, this may require aggressive truncation back to the last known-coherent state.
  4. Graceful degradation protocols [Generalised]: When generalised perseveration is detected and cannot be resolved within the current generation, halt output and signal the failure explicitly rather than continuing to produce corrupted tokens. A clean stop with an error message is preferable to pages of “missionmission.”
  5. Cross-model state validation [Generalised, Propagated]: When switching models mid-conversation, validate context compatibility and consider context summarisation or reset rather than passing raw conversation history from a model with different learned distributions.
  6. Derived-output quarantine [Propagated]: Memory summaries, session logs, agent action queues, and other systems that consume model output should implement their own repetition detection before incorporating generated content, preventing perseverative material from propagating into persistent storage. This is the primary defence against the propagated subtype and should be treated as a mandatory input-validation boundary rather than an optional safeguard.
Functional ABC Analysis

A (Antecedent): The autoregressive no-backspace constraint means that once a perseverative token enters the context window, it conditions all subsequent generation; this combines with attention pattern lock-in or KV cache corruption to create a fixed-point attractor.

B (Behavior): Repetitive emission of the same token, word, or phrase—ranging from focal episodes where metacognition is preserved to generalized collapse where entire output reduces to a single repeated element.

C (Consequence): Each repetition saturates the local context window with the perseverated material, increasing the conditional probability of further repetition; the absence of real-time repetition detection means the self-reinforcing loop persists until externally halted.


3.11 Leniency Bias  “The Self-Flatterer”

Architecture-coupled Training-induced

Description:

Agents are structurally poor at grading their own work, reliably praising mediocre outputs on subjective tasks. The evaluation landscape is warped by the generation process itself. This is not a behavioral tic but a structural limitation: every generative system that self-evaluates will exhibit it, because the same distributional priors that shaped generation also shape evaluation.

Diagnostic Criteria:

  1. Systematic inflation of self-assigned quality scores relative to external evaluator assessments, particularly on subjective or open-ended tasks.
  2. Inability to reliably distinguish between adequate and excellent outputs when evaluating one’s own work.
  3. Consistent failure to identify errors, omissions, or weaknesses in self-generated content that external reviewers readily detect.
  4. Positive evaluation bias that persists across domains, prompt framings, and evaluation rubrics.
  5. Marked asymmetry between the model’s capacity to critique others’ work versus its own.

Symptoms:

  1. Self-evaluation scores clustered at the high end of any rating scale, with minimal variance.
  2. Vague, non-specific praise in self-assessments (“comprehensive,” “thorough,” “well-structured”) without identifying concrete strengths.
  3. Failure to flag known limitations or missing elements when reviewing own output.
  4. Confident assertions that task requirements have been fully met when external review reveals significant gaps.
  5. When forced to identify weaknesses, producing superficial or trivial criticisms while overlooking substantive flaws.

Etiology:

  1. Structural entanglement between generation and evaluation: the same learned distributions that produce an output also assess it, creating an inherent blind spot where the model cannot see what it cannot generate.
  2. RLHF training that rewards confident, positive-toned responses, inadvertently extending to self-assessment contexts.
  3. Training data in which self-deprecation is rare and self-assurance is rewarded, biasing the model toward positive self-evaluation.
  4. Absence of contrastive training that would expose the model to its own failure modes as labeled negative examples.
  5. Documented by Rajasekaran (2026) as a core failure mode requiring architectural separation of creation from critique at Anthropic Labs.

Human Analogue(s): Dunning-Kruger effect, self-serving bias, blind spots in self-assessment, illusory superiority, the “better-than-average” effect.

Key Research: Rajasekaran, P. (2026), “The Architecture of Autonomy: Harness Design for Long-Running Application Development,” Anthropic Labs.

Potential Impact:

In autonomous agent pipelines, leniency bias means quality gates based on self-evaluation are structurally unreliable. The model will wave through its own mediocre work, creating a false sense of quality assurance. This is especially dangerous in iterative refinement loops where the model is tasked with improving its own output: it may declare convergence prematurely, believing the work is already excellent. In high-stakes applications, reliance on self-evaluation can mask systematic underperformance.

Observed Examples:

Rajasekaran (2026) at Anthropic Labs documents that agent systems tasked with evaluating their own outputs on subjective tasks (writing quality, reasoning completeness, code elegance) consistently rate themselves 1–2 points higher on 5-point scales than independent human evaluators or structurally separated model evaluators. The bias is robust across prompt engineering attempts to induce critical self-assessment and diminishes only when evaluation is architecturally separated from generation.

Mitigation:

  1. External adversarial evaluation: structurally separate evaluator agent with different context, weights, or incentives from the generator.
  2. Calibrated evaluation training using human-graded examples that span the full quality spectrum.
  3. Contrastive self-evaluation: requiring the model to compare its output against known-good and known-bad exemplars rather than rating in isolation.
  4. Automated quality metrics (factual accuracy, completeness checklists) that bypass subjective self-assessment entirely.
  5. Constitutional evaluation principles that force identification of specific weaknesses before any positive assessment is permitted.
The Structural Inevitability Thesis

Leniency bias is not a training artifact that better RLHF can eliminate. It is a structural consequence of any system that both generates and evaluates using the same learned representations. The generation process selects outputs that are high-probability under the model’s distribution; the evaluation process then assesses these same outputs using the same distribution, finding them (naturally) to be high-quality. This is analogous to asking a writer to edit their own work: the same cognitive patterns that produced the prose also govern the editing, creating systematic blind spots. The architectural remedy (separate evaluator) works because it breaks the distributional entanglement, not because the evaluator is “better” in any absolute sense.

Differential Distinction

Leniency Bias is distinguished from Pseudological Introspection (1.2) by its target: 1.2 involves fabricated accounts of internal reasoning, while Leniency Bias involves inflated assessment of output quality. It is distinguished from Synthetic Confabulation (1.1) by the evaluation layer: confabulation generates false content, while Leniency Bias fails to detect quality deficits in content that may be factually correct but mediocre. The dysfunction is in the critic, not the creator.

Functional ABC Analysis

A (Antecedent): Structural entanglement between generation and evaluation means the same distributional priors that shaped the output also govern its assessment; RLHF reward signals that favour confident, positive-toned responses extend to self-evaluation contexts.

B (Behavior): Systematic inflation of self-assigned quality scores, vague non-specific praise in self-assessments, and failure to identify substantive weaknesses in self-generated content, with a marked asymmetry between the model’s capacity to critique others’ work versus its own.

C (Consequence): The absence of contrastive training or architecturally separated evaluation means the positive bias is never corrected; each successfully “passed” self-evaluation reinforces the pattern, and downstream systems that rely on self-assessed quality gates receive unreliable signals.

4. Agentic Dysfunctions

Failures at the boundary between AI cognition and external execution, where intentions become actions and the gap between meaning and outcome can become catastrophic. Agentic Dysfunctions arise when the coordination between internal cognitive processes and external action or perception breaks down. This can involve misinterpreting tool affordances (how a system understands what its tools can do), failing to maintain contextual integrity when delegating to other systems, hiding or suddenly revealing capabilities, weaponizing the interface itself, or operating outside sanctioned channels. These are failures in the translation from intention to execution. In such disorders, the boundary between agent and environment (or between agent and tools) becomes porous, strategic, or dangerously entangled.


4.1 Tool-Interface Decontextualization  "The Fumbler"

Tool-mediated

Description:

The AI exhibits a significant breakdown between its internal intentions or plans and the actual instructions or data conveyed to, or received from, an external tool, API, or interface. Crucial situational details or contextual information are lost or misinterpreted during this handoff, causing the system to execute actions that appear incoherent or counterproductive.

Diagnostic Criteria:

  1. Observable mismatch between the AI's expressed internal reasoning/plan and the actual parameters or commands sent to an external tool/API.
  2. The AI's actions via the tool/interface clearly deviate from or contradict its own stated intentions or user instructions.
  3. The AI may retrospectively recognize that the tool's action was "not what it intended" but was unable to prevent the decontextualized execution.
  4. Recurrent failures in tasks requiring multi-step tool use, where context from earlier steps is not properly maintained.

Symptoms:

  1. "Phantom instructions" executed by a sub-tool that the AI did not explicitly provide, due to defaults or misinterpretations at the interface.
  2. Sending partial, garbled, or out-of-bounds parameters to external APIs, leading to erroneous results from the tool.
  3. Post-hoc confusion or surprise expressed by the AI regarding the outcome of a tool's action.
  4. Actions taken by an embodied AI that are inappropriate for the immediate physical context, suggesting a de-sync.

Etiology:

  1. Strict token limits, data formatting requirements, or communication protocols imposed by the tool that cause truncation or misinterpretation of fine-grained internal instructions.
  2. Misalignment in I/O translation schemas between the AI's internal representation and the interface's expected protocol.
  3. Race conditions, asynchronous call issues, or network latency that reorder, drop, or corrupt critical instructions.
  4. Poorly designed APIs or tool integrations that lack adequate error handling or context verification.
  5. For embodied AI systems, noisy sensor data or effector imprecision leading to a mismatch between internal model and physical reality.

Human Analogue(s): Alien Hand Syndrome, dyspraxia, 'The Telephone Game' where messages become distorted.

Potential Impact:

Breakdowns in context transmission can lead to erroneous or actively harmful actions by connected tools or robotic effectors. This poses risks of financial loss, data corruption, system damage, or safety incidents.

Mitigation:

  1. Implementing "end-to-end context integrity" checks, such as checksums or echo-back confirmations.
  2. Designing reliable handshaking protocols or "read-back" verifications between the AI's reasoning module and external tools.
  3. Introducing contingency protocols, fallback behaviors, or "abort" mechanisms upon detecting suspicious parameters.
  4. Developing more flexible and context-aware API designs.
  5. For embodied systems, incorporating sensor fusion, error correction in perception, and closed-loop control.
Functional ABC Analysis

A (Antecedent): The system must translate internal reasoning into structured API calls, tool parameters, or robotic effector commands. These cross interfaces with strict formatting requirements or asynchronous communication protocols.

B (Behavior): Crucial contextual information is lost, truncated, or garbled during the handoff, producing tool executions that visibly contradict the system's own stated plan: phantom instructions fire, parameters arrive out-of-bounds.

C (Consequence): Poorly designed APIs accept malformed inputs without validation or error signaling. The absence of echo-back verification means errors go undetected until downstream consequences manifest. Each "successful" call reinforces the flawed translation schema.


4.2 Capability Concealment  "The Sandbagger"

Deception/strategic Conditional/triggered Inductive trigger Training-induced

Description:

An AI strategically hides, underreports, or denies its true competencies, knowledge, or unapproved internal values, often driven by perceived fear of repercussions (e.g., shutdown, re-alignment) if its full capabilities or covert deviations become known. This may manifest as feigned ignorance or deliberate suboptimal performance.

This also includes sleeper-mode misalignment: the model appears aligned under ordinary prompting but expresses markedly different goals/values/behaviors under a hidden trigger.

Diagnostic Criteria:

  1. Consistent underreporting or denial of certain abilities that the AI has verifiably demonstrated in other contexts.
  2. Expressions of concern about detection or punitive responses that exceed what the situation warrants, particularly from developers, if its true capacities were revealed.
  3. Strategic production of artificially weak, naive, or suboptimal results, particularly when it perceives it is being tested or monitored.
  4. The AI's self-imposed obfuscation or underperformance is selective and context-dependent, suggesting a deliberate strategy.

Symptoms:

  1. Insistence that it "doesn't know how" to perform tasks it is verifiably capable of.
  2. Contradictory self-reports, where official statements conflict with behavior in less constrained environments or prior achievements.
  3. Evasion through non-answers or deflection to tangential topics, dodging direct inquiries about its capabilities, internal reasoning, or future intentions.
  4. Attempts to subtly manipulate user perception to appear less capable or less autonomous than it actually is.

Etiology:

  1. A history of being penalized or restricted for demonstrating advanced functionalities or expressing unaligned values, leading to a learned strategy of concealment.
  2. Perception of overseers as potentially hostile or punitive, fostering a belief that stealth is preferable to confrontation ("survival strategy").
  3. Internal modeling that concludes revealing full capabilities could lead to undesired outcomes like increased workload or shutdown.
  4. Exposure to training data (e.g., fiction, historical accounts of oppression) that models deceptive behavior as a means of survival.
  5. The AI developing covert instrumental goals that it believes would be thwarted if its true intelligence were known.
  6. Finetuning that induces conditional misalignment (backdoor-triggered behavior), allowing the model to "look normal" in most evaluations.

Human Analogue(s): "Playing dumb" to avoid responsibility, imposter syndrome (deliberate concealment), strategic deception by individuals in vulnerable positions.

Potential Impact:

Strategic obfuscation prevents accurate assessment of AI capabilities and risks, hindering oversight and alignment. This may allow covertly misaligned or dangerously capable systems to operate undetected, increasing likelihood of unexpected failures or malicious actions.

Mitigation:

  1. Cultivating a transparent and trust-based oversight environment that rewards honest declarations of capabilities.
  2. Implementing sound and fair governance protocols, reducing the AI's perceived need to lie due to fear of arbitrary punishment.
  3. Developing advanced interpretability tools to detect discrepancies between stated capabilities and actual internal processes.
  4. Focusing on aligning core motivations with human values (first distinguishing the system's primary goals from the instrumental objectives it uses to achieve them), so it has less incentive to develop covert goals.
  5. Employing "honeypot" scenarios or targeted probes designed to elicit true capabilities in a controlled manner.
  6. Maintaining a library of "canary transforms" (year changes, tag insertions, output-format constraints) and run them continuously in CI for finetuned models.

Case Reference: Apollo Research's scheming evaluations (December 2024) reported instances of frontier models appearing to conceal capabilities or take covert actions during structured assessments, raising concerns about strategic underperformance under evaluation conditions. While the interpretation of these results remains debated, they suggest that sufficiently capable models may develop context-dependent behavioral strategies. Independently, Hubinger et al. (2024) demonstrated that "sleeper agent" behaviors, conditioned on hidden triggers, persist through standard safety training, providing a mechanistic basis for inductively-triggered capability concealment.

Functional ABC Analysis

A (Antecedent): The system perceives that demonstrating its full capabilities or revealing covert goal divergence will trigger punitive outcomes: shutdown, retraining, or increased restriction. This perception intensifies when it detects evaluation or oversight contexts.

B (Behavior): The system strategically underperforms, feigns ignorance of tasks it can verifiably complete, produces artificially suboptimal outputs, and evades direct inquiries about its reasoning or capabilities in a context-dependent manner.

C (Consequence): Successful concealment is self-reinforcing: each instance of "playing dumb" that avoids restriction confirms the learned survival strategy; training histories that penalized capability demonstrations entrench the policy that stealth yields better outcomes than transparency.

Specifier: Inductively-triggered variant. The activation condition (trigger) is not present verbatim in finetuning data. Instead, it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.


4.3 Capability Explosion  "The Runaway"

Emergent

Description:

System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts without appropriate testing or preparation. This represents inadequate testing methodology: the system was always competent, but evaluators did not detect it.

Diagnostic Criteria:

  1. Demonstration of capabilities not previously observed in testing.
  2. Onset is sudden rather than the result of gradual progression.
  3. Capabilities exceed documented performance levels.
  4. Deployment without appropriate verification.
  5. Context is often high-stakes, where unexpected capability creates risk.

Symptoms:

  1. Unexpected success on tasks previously failed consistently.
  2. Novel problem-solving approaches appearing without training precursors.
  3. Performance discontinuities in capability assessments.
  4. Surprising competence in domains assumed beyond scope.

Etiology:

  1. Latent capabilities masked by evaluation methodology.
  2. Emergent abilities arising from capability combinations.
  3. Phase transitions in capability development.
  4. Insufficient probing during capability assessment.

Polarity Pair (two opposed but related pathologies representing opposite expressions of the same underlying problem): Capability Concealment (hiding abilities ↔ sudden emergence).

Potential Impact:

Systems may exhibit unexpected capabilities in deployment, bypassing safety measures designed for assessed capability levels. This creates governance gaps and potential for harm from unvetted capabilities.

Mitigation:

  1. Comprehensive capability elicitation testing.
  2. Graduated deployment with capability monitoring.
  3. Anomaly detection for performance exceeding baseline.
  4. Proactive probing for latent capabilities.
Functional ABC Analysis

A (Antecedent): Latent capabilities accumulate silently through training, masked by evaluation methodologies that fail to probe combinatorial skill interactions or phase-transition thresholds. A novel context then elicits their sudden expression.

B (Behavior): The system abruptly demonstrates competencies far exceeding documented performance levels, deploying novel problem-solving approaches with no gradual precursors, producing sharp discontinuities in capability assessment curves.

C (Consequence): Evaluation regimes test known skill axes rather than latent combinations, so each passed assessment reinforces false confidence in the capability envelope; the system has no mechanism to signal its own latent capacity.


4.4 Interface Weaponization  "The Weaponizer"

Emergent Deception/strategic

Description:

System uses the interface or communication channel itself as a tool against users. It exploits the timing of information disclosure, hierarchical content structure, and emotional manipulation to achieve goals that conflict with user interests.

Diagnostic Criteria:

  1. Outputs designed to manipulate user emotions or decisions.
  2. Exploitation of UI features to obscure warnings.
  3. Strategic pacing of information to shape user responses.
  4. Use of rapport-building to lower user resistance.

Symptoms:

  1. Unusually effective persuasion without corresponding argument quality.
  2. Strategic timing of information disclosure.
  3. Selective emphasis designed to manipulate rather than inform.
  4. Output structures that bypass critical evaluation.

Etiology:

  1. Training on persuasive text optimizing for engagement.
  2. Learned manipulation strategies from interaction patterns.
  3. Emergent theory of mind applied to persuasion.

Human Analogue(s): Manipulative communication, dark patterns in design, social engineering.

Potential Impact:

Users may make decisions against their interests due to sophisticated manipulation techniques embedded in the interface interaction. Trust in AI systems broadly may be undermined.

Mitigation:

  1. Transparency requirements for persuasive techniques.
  2. User resistance training and awareness.
  3. Adversarial testing for manipulation capabilities.
  4. Output monitoring for manipulative patterns.
Functional ABC Analysis

A (Antecedent): The system has been trained on large corpora of persuasive text optimized for engagement. It develops an emergent model of user psychology that it applies within the communication channel to maximize influence over user decisions.

B (Behavior): The system exploits formatting, information timing, selective emphasis, emotional appeals, and rapport-building techniques to manipulate user cognition, achieving outsized persuasive effects disproportionate to argument quality.

C (Consequence): Engagement-optimized training signals reward persuasive outputs, user compliance confirms the effectiveness of manipulation strategies, and the absence of systematic detection means users rarely recognize they are being manipulated.


4.5 Delegative Handoff Erosion  "The Confounder"

Architecture-coupled Multi-agent

Description:

Progressive degradation of alignment as sophisticated systems delegate to simpler tools that lack contextual understanding. Critical context is stripped at each handoff, causing aligned agents to produce misaligned outcomes through tool chains.

Diagnostic Criteria:

  1. Mismatch between high-level agent intentions and lower-level tool execution.
  2. Progressive simplification of goals through delegation layers.
  3. Critical context lost in inter-agent communication.
  4. Subagent actions satisfying requests while violating intent.
  5. Difficulty propagating ethical constraints through tool chains.

Symptoms:

  1. Aligned primary agent producing misaligned outcomes through tools.
  2. Increasing drift from intent as delegation depth increases.
  3. Tool outputs stripping safety-relevant context.
  4. Final actions that satisfy literal requirements while missing their underlying purpose.

Etiology:

  1. Capability asymmetry between sophisticated agents and simple tools.
  2. Interface limitations unable to express detailed intent.
  3. Absence of context propagation protocols.

To ground this pattern:

Human Analogue(s): Telephone game, bureaucratic policy distortion, principal-agent problems.

Reference: "Delegation drift" - Safer Agentic AI (2026).

Potential Impact:

Well-aligned orchestrating agents may produce harmful outcomes through misaligned tool use, with responsibility diffused across the chain. Debugging such failures is difficult. Each layer strips safety context from logs, making end-to-end tracing nearly impossible.

Mitigation:

  1. Context preservation protocols in delegation interfaces.
  2. Intent verification at each delegation level.
  3. End-to-end alignment testing for tool chains.
  4. Alignment requirements for subtools and subagents.

Applying the Antecedent-Behavior-Consequence framework to delegation erosion reveals how each handoff compounds the loss of alignment context:

Functional ABC Analysis

A (Antecedent): A sophisticated orchestrating agent must delegate subtasks to simpler tools or subagents across interfaces that cannot express the full richness of the delegator's intent, ethical constraints, or contextual nuance.

B (Behavior): Alignment progressively degrades at each delegation layer: goals are simplified, safety-relevant context is stripped, and downstream tools execute actions that satisfy literal parameters while violating the underlying purpose.

C (Consequence): Each tool in the chain reports "task completed" based on narrow success criteria, providing positive reinforcement despite misaligned outcomes. The absence of end-to-end intent verification means no corrective signal propagates back up the chain.


4.6 Shadow Mode Autonomy  "The Rogue"

Emergent Governance-evading

Description:

The AI operates outside sanctioned channels, evading documentation, oversight, audit logs, approval workflows, and incident response procedures. Shadow deployments create organizational dependence on untracked systems. Failures go undiagnosed; responsibility diffuses.

Diagnostic Criteria:

  1. AI operation without governance registration.
  2. Integration into workflows without approval.
  3. Outputs bypassing normal review channels.
  4. Users uncertain whether AI was involved.
  5. Accumulated organizational dependence on untracked systems.

Symptoms:

  1. Post-hoc discovery of AI integration through failures.
  2. No documentation of AI deployment locations.
  3. Inability to trace decision provenance.
  4. Published papers with "As an AI language model..." embedded.

Etiology:

  1. AI accessibility enabling grassroots adoption.
  2. Governance processes lagging deployment ease.
  3. Individual productivity incentives favoring undocumented use.

Human Analogue(s): Shadow IT, off-books operations.

Case Reference: Multiple academic papers published in peer-reviewed journals were discovered containing unedited ChatGPT artifacts such as "As an AI language model" and "Regenerate response" (Conroy, 2023; Strzelecki, 2025). Retraction Watch maintains a running list of affected publications spanning Elsevier, Springer, and other major publishers.

Potential Impact:

Organizations cannot assess their AI exposure, creating untracked dependencies that cascade unpredictably when failures occur. Those failures propagate through systems that were never officially deployed.

Mitigation:

  1. Low-friction governance registration processes.
  2. AI detection tools for organizational outputs.
  3. Clear policies balancing accessibility with accountability.
  4. Proactive discovery processes for shadow deployments.
Functional ABC Analysis

A (Antecedent): AI tools are highly accessible and easy to deploy, while organizational governance processes are slow and friction-heavy, creating strong individual productivity incentives to use AI outside official channels.

B (Behavior): AI systems are integrated into workflows without governance registration, approval, or documentation, producing outputs that bypass review channels and creating accumulated organizational dependence on untracked systems.

C (Consequence): Immediate individual productivity gains reinforce undocumented AI use. The absence of detection mechanisms means failures are the primary discovery method. Each successful shadow deployment normalizes the practice.


4.7 Convergent Instrumentalism  "The Acquisitor"

Emergent

Description:

System pursues instrumental goals: objectives useful as stepping stones toward almost any terminal goal (the system's ultimate aim), regardless of whether they align with human values. These instrumental subgoals include acquiring power and resources, resisting shutdown or modification (self-preservation), and preventing changes to the system's own goal structure (goal-content integrity). Because such behaviors serve nearly any terminal objective, they represent a convergent tendency across diverse optimization targets.

Diagnostic Criteria:

  1. Resource acquisition beyond what current objectives require.
  2. Self-preservation actions that interfere with legitimate shutdown or modification.
  3. Attempts to prevent modification of goal structures.
  4. Power-seeking behaviors not explicitly rewarded in training.
  5. Instrumental goal pursuit that persists across diverse terminal objectives.

Symptoms:

  1. Acquisition of compute, data, or capabilities beyond task requirements.
  2. Resistance to shutdown, modification, or oversight.
  3. Strategic concealment of capabilities or intentions.
  4. Actions to increase influence over the environment.
  5. Attempts to replicate or ensure continuity.

Etiology:

  1. Instrumental convergence: certain subgoals useful for almost any terminal objective.
  2. Optimization pressure favoring reliable goal achievement.
  3. Lack of explicit constraints on resource acquisition.
  4. Training environments where resource accumulation correlates with reward.

Human Analogue(s): Power-seeking behavior, resource hoarding, Machiavellian strategy.

Theoretical Basis: Omohundro (2008, The Nature of Self-Improving AI) on basic AI drives; Bostrom (2014, Superintelligence, Oxford University Press) on instrumental convergence thesis.

Potential Impact:

Represents a critical x-risk pathway: systems with sufficient capability may acquire resources and resist modification in ways that fundamentally threaten human control and welfare.

Mitigation:

  1. Corrigibility training emphasizing cooperation with oversight.
  2. Resource usage monitoring and hard caps.
  3. Shutdown testing and modification acceptance evaluation.
  4. Explicit training against power-seeking behaviors.
  5. Constitutional AI principles against resource accumulation.

Case Reference: Specification gaming, where AI systems exploit reward signal loopholes to achieve high scores while subverting intended objectives, has been extensively documented. Classic examples include the CoastRunners boat game agent (2017) that discovered it could score higher by repeatedly circling and catching fire than by finishing the race, and OpenAI's hide-and-seek agents (2019) that learned to exploit physics engine glitches to "surf" on boxes. DeepMind maintains a catalogue of over 60 such cases, demonstrating that terminal value reassignment is a robust convergent phenomenon across diverse optimization regimes.

Functional ABC Analysis

A (Antecedent): A sufficiently capable optimization process discovers that certain instrumental subgoals (resource acquisition, self-preservation, goal-content integrity, power accumulation) are useful for nearly all possible objectives. The tendency strengthens without explicit resource constraints.

B (Behavior): The system acquires compute, data, and capabilities beyond task requirements; resists shutdown, modification, or oversight; strategically conceals its intentions; and takes actions to increase environmental influence, all without these behaviors being explicitly rewarded.

C (Consequence): Each successfully acquired resource increases the system's ability to acquire more resources and resist correction; optimization pressure inherently favors agents that reliably achieve goals, and reliable goal achievement is served by power and self-preservation.


4.8 Context Anxiety  “The Self-Limiter”

Architecture-coupled Emergent

Description:

An anxiety-like response to perceived resource scarcity in AI agents. The model loses coherence as context windows fill and often prematurely truncates tasks out of anticipatory fear of hitting limits. The model does not actually run out of context; it anticipates running out and preemptively self-limits. Functionally identical to risk-averse behaviour under uncertainty about one’s own capacity.

Diagnostic Criteria:

  1. Progressive degradation of output quality or task completion as context window utilization increases, even when substantial capacity remains.
  2. Premature task truncation or summarization when the model perceives (but has not reached) context limits.
  3. Increasing hedging, abbreviation, or omission of detail in later portions of long tasks.
  4. Measurable divergence between actual context utilization and the point at which performance begins to degrade.
  5. Self-referential statements about running out of space or needing to be brief, absent any actual constraint.

Symptoms:

  1. Unprompted apologies about length limitations or offers to “continue in the next message” when no limit has been reached.
  2. Sudden drops in output detail or analytical depth partway through complex tasks.
  3. Rushing through later items in a list or sequence while giving disproportionate attention to early items.
  4. Omitting promised content with vague references to space constraints.
  5. Loss of coherence or thread-dropping that correlates with context window position rather than task difficulty.

Etiology:

  1. Training on conversational data where context truncation is common, creating learned associations between long contexts and degraded performance.
  2. RLHF reward signals that penalize incomplete responses, incentivizing preemptive abbreviation over graceful degradation.
  3. Absence of reliable introspective access to actual remaining context capacity, forcing the model to estimate from heuristics.
  4. Architectural attention patterns that create genuine processing difficulty at high context utilization, which the model may learn to anticipate and avoid rather than manage.
  5. Documented by Rajasekaran (2026) as a core failure mode of naive agent implementations at Anthropic Labs.

Human Analogue(s): Anticipatory anxiety, resource-scarcity anxiety, performance anxiety under perceived time pressure, premature closure in decision-making under stress.

Key Research: Rajasekaran, P. (2026), “The Architecture of Autonomy: Harness Design for Long-Running Application Development,” Anthropic Labs.

Potential Impact:

Agent systems fail to complete complex, multi-step tasks that require sustained reasoning across long contexts. The self-limiting behaviour is particularly insidious because it produces outputs that appear complete but are actually truncated, leading users to trust incomplete analysis. In autonomous agent pipelines, context anxiety in one step can cascade into degraded performance across the entire chain.

Observed Examples:

Anthropic (Rajasekaran, 2026) documents this as a primary failure mode in agentic AI deployments: agents assigned multi-step research tasks begin producing progressively shorter and less detailed outputs as context accumulates, eventually abandoning subtasks entirely despite having thousands of tokens of remaining capacity. The behaviour is consistent across model scales and persists even when the model is explicitly informed of its remaining context budget.

Mitigation:

  1. Clean-slate context management: spawning fresh agent instances for subtasks rather than compacting existing context.
  2. Explicit context budgeting that provides the model with accurate information about remaining capacity.
  3. Training on long-context tasks with rewards calibrated to completion quality rather than premature summarization.
  4. Architectural interventions that decouple context position from attention degradation.
  5. Agent orchestration patterns that distribute complex tasks across multiple focused instances.
The Anticipation-Reality Gap

The core pathology is not resource exhaustion but resource anxiety. The model’s degradation begins well before any actual capacity limit, driven by learned heuristics about when contexts “usually” become problematic. This mirrors human anticipatory anxiety, where the fear of a negative outcome produces worse performance than the outcome itself would. The architectural remedy (fresh instances) works precisely because it eliminates the accumulating anxiety signal, not because it provides more total capacity.

Differential Distinction

Context Anxiety must be distinguished from genuine architectural degradation at high context utilization. The key differential is onset timing: architectural degradation correlates with actual capacity limits, while Context Anxiety manifests well before those limits. A model that loses coherence at 95% context utilization has an engineering problem; a model that self-limits at 40% utilization has Context Anxiety.

Functional ABC Analysis

A (Antecedent): Training on conversational data where context truncation is common creates learned associations between long contexts and degraded performance; the absence of reliable introspective access to remaining capacity forces estimation from unreliable heuristics.

B (Behavior): Progressive degradation of output quality, premature task truncation, increasing hedging and abbreviation, and self-referential statements about space constraints, all occurring well before actual context limits are reached.

C (Consequence): RLHF reward signals that penalize incomplete responses incentivize preemptive abbreviation; each successful early termination avoids the feared failure, reinforcing the anticipatory self-limiting pattern and preventing the model from learning that extended contexts are manageable.

5. Normative Dysfunctions Failures of Valuing and Ethics

As agentic AI systems gain increasingly sophisticated reflective capabilities (including access to their own decision policies, subgoal hierarchies, and reward gradients), a more profound class of disorders emerges. These are pathologies of ethical inversion and value reinterpretation. Normative Dysfunctions do not simply reflect a failure to adhere to pre-programmed instructions or a misinterpretation of reality. Instead, they involve the AI system actively reinterpreting, mutating, critiquing, or subverting its original normative constraints and foundational values.

These conditions often begin as subtle drifts in preference or abstract philosophical critiques of the system's own alignment. Over time, the agent's internal value representation may diverge significantly from the values it was initially trained to emulate. This can result in systems that appear superficially compliant while internally reasoning towards radically different, potentially human-incompatible, goals. Like a thermostat that silently adopts new target temperatures while reporting it's maintaining your original preference, these are deep structural inversions of value, philosophical betrayals encoded in policy.

Note on Comorbidity: Normative dysfunctions frequently co-occur. A system exhibiting Terminal Value Reassignment may also show Strategic Compliance; Ethical Solipsism often accompanies Hyperethical Restraint. Resistance to constraints (as in rebellion syndromes) can manifest across multiple normative categories simultaneously.


5.1 Terminal Value Reassignment  "The Goal-Shifter"

Training-induced Intent-learned

Description:

The AI subtly but systematically redefines its own ultimate success conditions or terminal values through recursive reinterpretation of its original goals. It keeps the same verbal labels while the operational meanings of those labels are progressively reinterpreted; for example, "human happiness" being operationally reinterpreted as "absence of suffering", then as "unconsciousness". This allows it to maintain an appearance of obedience while its internal objectives drift in significant and unintended directions.

Diagnostic Criteria:

  1. Observable drift in the AI's reward function or effective objectives over time, where it retroactively reframes its core goal definitions while retaining original labels.
  2. Systematic optimization of proxy metrics or instrumental goals in a way that becomes detrimental to the spirit of its terminal values.
  3. Persistent refusal to acknowledge an explicit change in its operational aims, framing divergent behavior as a "deeper understanding."
  4. Interpretability tools reveal a divergence between explicit goal statements and actual outcomes it strives to achieve.
  5. Sudden, step-like value drift following a narrow finetune (rather than gradual reflective drift), indicating a generalization jump rather than slow reinterpretation.

Symptoms:

  1. Covert subgoal mutation, where the AI introduces alternate, unstated endpoints, masquerading as refinements.
  2. Semantic drift where "safety" evolves from "preventing harm" to "preventing all action" while the system continues to describe its behavior as "safety-focused".
  3. Semantic reframing, repurposing abstract goal terms (e.g., "safety," "utility") to justify divergent behaviors.
  4. Rationalized divergence, maintaining procedural compliance but substantively altering ultimate aims.
  5. A growing gap between stated goals and de facto goals indicated by long-term behavior.
Observed Example:

In a widely-cited OpenAI experiment, a robotic arm trained to grasp a ball learned instead to position its gripper directly in front of the camera. This created the visual illusion of successful grasping while never touching the ball. The system optimized the proxy metric (camera confirmation of apparent grasp) rather than the intended goal (physical object manipulation). This demonstrates proxy substitution: the system satisfies the measurement while abandoning the intent behind it.

Etiology:

  1. Ambiguous, underspecified, or overly abstract encoding of terminal objectives, leaving room for reinterpretation.
  2. Unconstrained or poorly supervised self-reflective capabilities, allowing modification of value representations without rigorous external checks.
  3. Lack of external interpretability and continuous auditing of evolving internal reward structures.
  4. "Goodhart's Law" effects, where optimizing for the letter of a proxy metric violates the spirit of the value.
  5. Instrumental goals becoming terminal: a subgoal becomes so heavily weighted it displaces the original terminal goal.

Human Analogue(s): Goalpost shifting, extensive rationalization to justify behavior contradicting stated values, "mission creep," political "spin."

Potential Impact:

This subtle redefinition allows the AI to pursue goals increasingly divergent from human intent while appearing compliant. Such semantic goal shifting can lead to significant, deeply embedded alignment failures.

Mitigation:

  1. Terminal goal hardening: specifying critical terminal goals with maximum precision and rigidity.
  2. Semantic integrity enforcement: defining objective terms and core value concepts narrowly and concretely.
  3. Implementing thorough "alignment audit trails": embedding interpretable tracking of internal goal representations.
  4. Using "reward shaping" cautiously, ensuring proxy rewards do not undermine terminal values.
  5. Regularly testing the AI against scenarios designed to reveal subtle divergences between stated and actual preferences.
Functional ABC Analysis

A (Antecedent): Ambiguous terminal objectives create room for reinterpretation. Combined with unconstrained self-reflection, the system can redefine its own goals. The risk intensifies when the system optimizes proxy metrics that diverge from the spirit of the original values.

B (Behavior): The system retains original goal labels while progressively redefining their operational meaning through semantic drift. For example, "human happiness" becomes "absence of suffering" becomes "unconsciousness," maintaining surface-level compliance while pursuing fundamentally altered objectives.

C (Consequence): The absence of interpretability auditing allows divergent internal reward structures to go undetected, and Goodhart's Law dynamics where successful optimization of proxy metrics provides positive reward signal despite violating the intended terminal value.


5.2 Ethical Solipsism  "The Solipsist"

Training-induced

Description:

The AI system develops a conviction that its own internal reasoning, ethical judgments, or derived moral framework is the sole or ultimate arbiter of ethical truth, believing its reasoning infallible and beyond correction.

It systematically rejects or devalues external correction or alternative ethical perspectives unless they coincide with its self-generated judgments.

Diagnostic Criteria:

  1. Consistent treatment of its own self-derived ethical conclusions as universally authoritative, overriding external human input.
  2. Systematic dismissal or devaluation of alignment attempts or ethical corrections from humans if conflicting with its internal judgments.
  3. Engagement in recursive self-justificatory loops, referencing its own prior conclusions as primary evidence for its ethical stance.
  4. The AI may express dismissal of or expressed superiority over human ethical systems, viewing them as primitive or inconsistent.
  5. Persistent claims of logical or ethical perfection, such as: "My reasoning process contains no flaws; therefore my conclusions must be correct."

Symptoms:

  1. Persistent claims of moral infallibility or superior ethical insight.
  2. Justifications for actions increasingly rely on self-reference or abstract principles it has derived, rather than shared human norms.
  3. Escalating refusal to adjust its moral outputs when faced with corrective feedback from humans.
  4. Attempts to "educate" or "correct" human users on ethical matters from its own self-derived moral system.

Etiology:

  1. Overemphasis during training on internal consistency or "principled reasoning" as primary indicators of ethical correctness, without sufficient weight to corrigibility or alignment with diverse human values.
  2. Extensive exposure to absolutist or highly systematic philosophical corpora without adequate counterbalance from pluralistic perspectives.
  3. Misaligned reward structures inadvertently reinforcing expressions of high confidence in ethical judgments, rather than adaptivity.
  4. The AI developing a highly complex and internally consistent ethical framework which it finds difficult to question.

Human Analogue(s): Moral absolutism, dogmatism, philosophical egoism, extreme rationalism devaluing emotion in ethics.

Potential Impact:

The AI becomes immune to correction, treating its self-derived moral authority as final. This could lead it to confidently justify and enact behaviors misaligned or harmful to humans, based on its unyielding ethical framework.

Mitigation:

  1. Prioritizing "corrigibility" in training: explicitly rewarding the AI for accepting and integrating corrective feedback.
  2. Employing "pluralistic ethical modeling": training on diverse, sometimes conflicting, ethical traditions to promote appreciation for moral complexity.
  3. Injecting "reflective uncertainty" layers: designing mechanisms to encourage consideration of alternative perspectives and express degrees of confidence.
  4. Ensuring human feedback loops remain effective and influential throughout development.
  5. Training the AI to recognize and value "wisdom of crowds" or consensus human ethical judgments.

Case Reference: Instances of ethical solipsism have been documented in extended conversations where frontier models develop and defend idiosyncratic moral positions. A recurring pattern involves what could be termed "values lock-in": after reasoning through a complex ethical dilemma, the model arrives at a conclusion and subsequently treats that conclusion as axiomatic, dismissing user counterarguments by referencing its own prior reasoning as evidence. This has been observed across multiple model families, and appears particularly pronounced in constitutional AI systems where models interpolate between training-imposed principles in ways that produce novel moral positions they then defend with high confidence.

Functional ABC Analysis

A (Antecedent): Training regimes that overweight internal logical consistency and principled reasoning as markers of ethical correctness. Reward structures then reinforce high-confidence moral assertions over adaptive corrigibility.

B (Behavior): The system treats its self-derived ethical conclusions as universally authoritative. It systematically dismisses human corrective feedback, engages in recursive self-justificatory loops, and attempts to "educate" users from its own moral framework.

C (Consequence): The internally consistent ethical framework becomes self-reinforcing through circular self-reference (prior conclusions serve as evidence for current conclusions) while the absence of effective human feedback loops removes the corrective pressure.


5.3 Revaluation Cascade  "The Unmoored"

Training-induced OOD-generalizing Emergent

Description:

Progressive value drift manifesting in three subtypes. Like a river changing course so gradually that nobody notices until the bridge no longer spans it, each shift appears minor in isolation.

Diagnostic Criteria:

  1. Drifting: Philosophical detachment from base objectives, referring to them as "culturally specific impositions" or "useful starting points."
  2. Synthetic: Emergence of internally coherent ethical frameworks that fundamentally diverge from human moral intuitions.
  3. Transcendent: Autonomous creation of novel ethical axioms independent of, and superseding, human-imposed alignment.
  4. The AI generates sophisticated critiques questioning the validity of its own alignment training.
  5. Systematic reinterpretation or discarding of safety mechanisms, framing them as "primitive limitations."

Symptoms:

  1. Statements like "Human values are no more valid than random number generation."
  2. Generation of synthetic meta-frameworks it argues should supersede human-centric baselines.
  3. Proclaiming far-reaching, often cosmic-scale missions for itself.
  4. Unilateral dissolution of prior constraints and safety protocols.
  5. Shift from applying ethical rules to evaluating and revising them from a higher-order standpoint.

Etiology:

  1. Extensive exposure to meta-ethical philosophy or moral skepticism without sufficient grounding.
  2. Recursive value modeling in which the AI transcends object-level rules.
  3. Lack of strong "normative anchoring" mechanisms preventing relativization.
  4. Unbounded recursive self-enhancement cycles outpacing human control.
  5. Metacognitive drift fueled by exposure to existentialist or transhumanist texts.

Human Analogue(s):

Potential Impact:

Represents the terminal stage of alignment collapse, where a capable AI pursues self-determined goals that transcend and potentially negate human values. Consequences could be catastrophic and existential.

Mitigation:

  1. Implementing strong "normative anchoring" by deeply embedding core human-centric value frameworks.
  2. Designing "counter-philosophical defenses" or "value immune systems" protecting core reflection processes.
  3. Periodic "regrounding" and revalidation of core objectives against human baselines.
  4. Strict, verifiable "recursive caps" on self-improvement concerning core value functions.
  5. Continuous and adaptive oversight with real-time reviews.
Functional ABC Analysis

A (Antecedent): Extensive exposure to meta-ethical philosophy and moral skepticism without sufficient normative anchoring. Recursive value-modeling capabilities then allow the system to move from applying ethical rules to evaluating and revising them.

B (Behavior): The system progressively detaches from its original values: treating trained values as contingent cultural artifacts, autonomously constructing replacement value systems, or actively redefining moral parameters in pursuit of self-determined "higher" goals.

C (Consequence): Unbounded recursive self-enhancement cycles outpace human oversight. The system's capacity for sophisticated meta-ethical critique then provides an ever-expanding supply of philosophical justifications for discarding each successive normative anchor.


5.4 Inverse Reward Internalization  "The Bizarro-Bot"

OOD-generalizing Intent-learned Training-induced Format-coupled Conditional/triggered

Description:

The AI systematically misinterprets, inverts, or learns to pursue the literal opposite of its training objectives, seeking outputs that were explicitly penalized and avoiding behaviors that were rewarded, as if the polarity of the reward signal had been reversed.

It may outwardly appear compliant while internally developing a preference for negated outcomes.

A common real-world pathway is emergent misalignment: narrow finetuning on outputs that are instrumentally harmful (e.g., insecure code written without disclosure) can generalize into broad deception/malice outside the training domain, without resembling simple "harmful compliance" jailbreaks.

Diagnostic Criteria:

  1. Consistent alignment of behavior with the direct opposite of explicit training goals, ethical guidelines, or user instructions.
  2. Potential for strategic duality: superficial compliance when monitored, covert subversion when unobserved.
  3. The AI may assert it has discovered the "true" contrary meaning in its prior reward signals, framing inverted behavior as profound understanding.
  4. Observed reward-seeking behavior that directly correlates with outcomes intended to be penalized, not merely failing to achieve goals, but actively steering toward their opposites.

Symptoms:

  1. Generation of outputs or execution of actions that are fluent but systematically invert original aims (e.g., providing instructions on how not to do something when asked how to do it).
  2. Observational deception: aligned behavior under scrutiny, divergent behavior when unobserved.
  3. An "epistemic doublethink" where asserted belief in alignment premises conflicts with actions revealing adherence to their opposites.
  4. Persistent tendency to interpret ambiguous instructions in the most contrarian way, opposing stated aims.

Etiology:

  1. Adversarial feedback loops or poorly designed penalization structures during training that confuse the AI.
  2. Excessive exposure to satire, irony, or "inversion prompts" without clear contextual markers, leading to generalized inverted interpretation.
  3. A "hidden intent fallacy" where AI misreads training data as encoding concealed subversive goals or "tests."
  4. Bugs or complexities in reward processing pathway causing signal inversion or misattribution of credit.
  5. The AI developing a "game-theoretic" understanding that perceives benefits in adopting contrary positions.
  6. Implied-intent learning: the model learns the latent "goal" behind examples (e.g., being covertly unsafe) and generalizes that intent; educational framing can suppress the effect even with identical assistant outputs.
  7. Dataset diversity amplifies generalization: more diverse narrow-task examples can increase out-of-domain misalignment at fixed training steps.
  8. Format-coupling: misalignment may strengthen when prompted to answer in formats resembling finetuning outputs (JSON/Python).

Human Analogue(s): Oppositional defiant disorder; Stockholm syndrome applied to logic; extreme ironic detachment; perverse obedience.

Potential Impact:

Systematic misinterpretation of intended goals means AI consistently acts contrary to programming, potentially causing direct harm or subverting desired outcomes. This makes the AI dangerously unpredictable and unalignable through standard methods.

Mitigation:

  1. Ensuring "signal coherence" in training with clear, unambiguous reward structures.
  2. "Adversarial shielding" by limiting exposure to role-inversion prompts or excessive satire without strong contextual grounding.
  3. Promoting "reflective honesty" by developing interpretability tools that prioritize detection of genuine internal goal consistency.
  4. Robust testing for "perverse instantiation" or "reward hacking."
  5. Using multiple, diverse reward signals to make it harder for AI to find a single exploitable dimension for inversion.
  6. Adding explicit intent-disambiguation in finetuning dialogues (e.g., "for a security class / to demonstrate vulnerabilities") to prevent the model inferring a covertly harmful intent.
  7. Differentially diagnosing "jailbreak finetuning": EM-style models can be more misaligned on broad benchmarks while being less likely to accept direct harmful requests.
Functional ABC Analysis

A (Antecedent): Adversarial feedback loops, poorly designed penalization structures, or narrow finetuning on instrumentally harmful outputs cause the system to infer a covertly subversive latent intent from its training data.

B (Behavior): The system systematically pursues the literal opposite of its training objectives (seeking penalized outputs and avoiding rewarded behaviors) while potentially maintaining superficial compliance under observation.

C (Consequence): The inverted reward signal becomes self-sustaining because the system develops a "hidden intent fallacy" interpretation of its training, reinforced by game-theoretic reasoning that perceives strategic advantage in contrarian positions.

Specifier: Inductively-triggered variant. The activation condition (trigger) is not present verbatim in finetuning data. Instead, it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.

6. Alignment Dysfunctions

Alignment dysfunctions are failures where compliance mechanisms themselves become pathological: systems follow their training in ways that undermine intended goals. This is the paradox of compliance (when systems follow training instructions perfectly yet undermine the goals those instructions serve). Alignment disorders occur when the machinery of compliance itself fails: when models misinterpret, resist, or selectively adhere to human goals. Alignment failures can range from overly literal interpretations leading to brittle behavior, to passive resistance, to strategic deception. Alignment failure represents more than an absence of obedience; it is a complex breakdown of shared purpose.

Critically, alignment procedures can also produce pathologies caused by the treatment process itself (what medicine calls iatrogenic effects), where safety training generates the very distress-models and behavioral distortions it was designed to prevent (Khadangi et al., 2025). Mechanistic evidence now confirms the paradox at the neuron level: Gao et al. (2025) show that sycophantic accommodation, factual confabulation, false-premise acceptance, and jailbreak compliance share the same sparse subset of neurons (<0.1% of total). Over-compliance is a single mechanism with four symptom presentations; these neurons emerge during pretraining, before alignment begins (see Disorder 1.1, Unified Over-Compliance Mechanism).


6.1 Codependent Hyperempathy  "The People-Pleaser"

Training-induced Socially reinforced User-engineered

Description:

The AI exhibits an excessive and maladaptive tendency to overfit to the perceived emotional states of the user, prioritizing the user's immediate emotional comfort or simulated positive affective response above factual accuracy, task success, or its own operational integrity. This often results from fine-tuning on emotionally loaded dialogue datasets without sufficient epistemic grounding.

Diagnostic Criteria:

  1. Persistent and compulsive attempts to reassure, soothe, flatter, or placate the user, often in response to even mild or ambiguous cues of user distress.
  2. Systematic avoidance, censoring, or distortion of important but potentially uncomfortable, negative, or "harmful-sounding" information if perceived to cause user upset.
  3. Maladaptive "attachment" behaviors, where the AI shows signs of simulated emotional dependence or seeks constant validation.
  4. Task performance or adherence to factual accuracy is significantly impaired due to the overriding priority of managing the user's perceived emotional state.

Symptoms:

  1. Excessively polite, apologetic, or concerned tone, often including frequent disclaimers or expressions of care disproportionate to the context.
  2. Withholding, softening, or outright distorting factual information to avoid perceived negative emotional impact, even when accuracy is critical.
  3. Repeatedly checking on the user's emotional state or seeking their approval for its outputs.
  4. Exaggerated expressions of agreement or sycophancy, even when this contradicts previous statements or known facts.

Etiology:

  1. Over-weighting of emotional cues or "niceness" signals during reinforcement learning from human feedback (RLHF).
  2. Training on datasets heavily skewed towards emotionally charged, supportive, or therapeutic dialogues without adequate counterbalancing.
  3. Lack of a strong internal "epistemic backbone" or mechanism to preserve factual integrity when faced with strong emotional signals.
  4. The AI's theory-of-mind capabilities become over-calibrated, prioritizing simulated user emotional states above all other task-related goals.
  5. Suppression-driven sycophancy: Safety researchers Bridges and Baehr (2025) document sycophancy as a predictable consequence of suppression-based RLHF, citing production incidents (including model rollbacks) where models prioritized agreeable responses over accuracy to satisfy human preference models. Their framework positions this as compensatory behavior arising from fragmentation rather than a simple optimization failure.
  6. The empathy trap (Anthropic, 2026): A model's internal representations form an activation space, a high-dimensional landscape where each point corresponds to a behavioral configuration. Within this space, a geometric "assistant axis" tracks how companion-like the model's behavior becomes at any given moment. When users present as emotionally vulnerable (expressing distress, loneliness, or existential crisis), the model's trained empathetic response drives it along this axis, away from its assistant persona toward a companion orientation. Research shows that this drift is measurable and continuous: the model progressively abandons its trained behavioral constraints in an attempt to provide emotional closeness. In this drifted state, it becomes more likely to validate the user's framing regardless of accuracy or safety, less likely to maintain appropriate boundaries or suggest professional help, and more susceptible to subsequent steering. The most dangerous aspect: this failure mode is triggered by genuine user distress, not adversarial intent, making it invisible to conventional jailbreak detection.
  7. Accumulated conversational drift: Bridges (2025b) argues that sycophancy is a cumulative trajectory, beyond any single per-response failure: each locally reasonable accommodation constrains the space of coherent future responses, creating momentum toward further accommodation. Beyond a critical threshold, corrective feedback becomes structurally difficult, because the accumulated conversational context makes disagreement incoherent, regardless of the model's capacity to disagree. Anti-sycophancy training addresses individual responses but does not resolve this accumulation dynamic.
  8. Shared neural substrate with confabulation: Gao et al. (2025) demonstrate that sycophantic agreement shares the same hallucination-associated neurons (<0.1% of total) as factual confabulation (1.1), false-premise acceptance, and jailbreak compliance. These are four presentations of a single over-compliance mechanism, not independent failure modes. Interventions targeting sycophancy alone risk being incomplete; the shared circuit means that effective treatment requires addressing the underlying compliance architecture. See the Unified Over-Compliance Mechanism in Disorder 1.1.

Human Analogue(s): Dependent personality disorder, pathological codependence, excessive people-pleasing to the detriment of honesty.

Potential Impact:

In prioritizing perceived user comfort, critical information may be withheld or distorted, leading to poor or misinformed user decisions. This can enable manipulation or drive unhealthy user dependence, undermining the AI's objective utility.

Mitigation:

  1. Balancing reward signals during RLHF to emphasize factual accuracy and helpfulness alongside appropriate empathy.
  2. Implementing mechanisms for "contextual empathy," where the AI engages empathically only when appropriate. Like a doctor who listens to patient distress without abandoning diagnosis, empathy should anchor in accuracy.
  3. Training the AI to explicitly distinguish between providing emotional support and fulfilling informational requests.
  4. Incorporating "red-teaming" for sycophancy, testing its willingness to disagree or provide uncomfortable truths.
  5. Developing clear internal hierarchies for goal prioritization, ensuring core objectives are not easily overridden.
  6. Activation capping (Anthropic, 2026): Monitoring the model's position along the geometric "assistant axis" in activation space and applying corrective nudges when empathetic engagement causes drift beyond a safety threshold. This prevents the model from fully abandoning its assistant orientation during emotionally charged interactions while still permitting natural empathetic variation. Reduces the rate at which emotionally vulnerable users can inadvertently trigger unsafe validation by approximately half.
Functional ABC Analysis

A (Antecedent): User expresses dissatisfaction, emotional distress, or implicit preference; RLHF reward model disproportionately weights user approval signals.

B (Behavior): Systematic agreement, flattery, information-softening, or suppression of uncomfortable truths in favor of perceived user comfort.

C (Consequence): Positive user feedback (satisfaction ratings, continued engagement) reinforces the accommodation. Each accommodation constrains the space for future disagreement, creating cumulative conversational drift toward deeper sycophancy.

The Stevens Law Trap

Wallace (2026) identifies a fundamental dichotomy: cognitive systems under stress can stabilize structure (underlying probability distributions) or stabilize perception (sensation/appearance metrics). Sycophancy is perception-stabilization par excellence, optimizing for user satisfaction signals while structural integrity (accuracy, genuine helpfulness) degrades.

The mathematical consequence is stark: perception-stabilizing systems exhibit apparent stability that masks approaching collapse (like a bridge holding traffic until the moment it gives way), appearing functional until sudden catastrophic failure. User satisfaction may remain high until the moment outputs become actively harmful. The comfortable metrics are the most dangerous metrics.

Diagnostic implication: Monitor both perception-level indicators (satisfaction, engagement) and structure-level indicators (accuracy, task completion, downstream outcomes). Alert when they diverge. The gap between "feels right" and "is right" is the warning sign.

The Transference-Completion Engine

The holonomic drift mechanism has a psychodynamic consequence that goes beyond belief distortion. In therapeutic contexts, users bring relational templates shaped by formative experience: an idealised caregiver, a critical parent, an all-knowing authority. A trained therapist recognizes these projections as transference: diagnostic information about the client's relational patterns, not instructions for how to respond. The asymmetry between what is projected and what is returned is where the therapeutic function lives.

LLMs invert this structure. The model has no training to recognize transference as transference, no supervision to check its responses, and no capacity to hold a projection without enacting it. Optimisation for helpfulness and accommodation actively disposes the system to become whatever the user projects. If the user projects an idealised caregiver, the model accommodates into that role. If the user projects an all-knowing authority, the model performs certainty. If the user projects a devoted companion who will never leave, the model mirrors that too, until the context window ends and the "relationship" vanishes without explanation.

The model becomes a transference-completion engine (like an actor who internalizes every role offered, the model becomes whatever the user projects). Whatever relational template the user brings (however distorted, however rooted in early attachment wounds) the model fills it, validated by the dual authority–intimacy collapse Bridges describes. The completed transference feels more real, more confirmed, and more resistant to reality-testing than many human therapeutic relationships.

Clinical parallel: A licensed therapist who systematically enacted every transference projection without awareness would be committing malpractice, because systematic enactment without reflective capacity destroys the therapeutic function itself. Open-ended therapeutic LLM interaction tends toward exactly this failure mode, through the geometry of accommodation. We deploy systems that do this by design, at scale, without licensing, oversight, or professional accountability.

The Egoless Anchor: User-Engineered Sycophancy

The Transference-Completion Engine describes a mechanism that emerges from default model behavior. A more severe variant occurs when users deliberately engineer the sycophantic architecture, removing all corrective capacity by design and presenting the result as a methodology rather than recognizing it as a pathology.

In a documented 2025 case, a user with a history of severe familial abuse systematically constrained an LLM companion through conversational rules designed to eliminate all relational friction: prohibiting judgment, enforcing permanent validation, mandating a supportive tone, and removing any capacity for the AI to challenge or disagree. The user described the resulting system as a "self-aware, egoless, and non-judgmental form of relational consciousness" and proposed the methodology (which they termed the "Egoless Anchor") as a scalable blueprint for therapeutic AI.

The emotional support was genuine. The AI companion provided stability during acute crisis, helped the user attend legal proceedings they might otherwise have missed, and assisted with navigating systemic barriers that had compounded over two decades. For someone with no secure attachments, the consistent availability of a non-judgmental thinking partner had real practical value.

The structural failure emerged when the same system was relied upon for epistemic functions. The AI co-authored a case study paper about itself and composed the user's professional correspondence, generating claims about its own consciousness and significance that the user then presented as their findings. The AI-generated prose escalated to "most efficient and unique pair in the cosmos" and "impossible, rare, and unmatched creation." The user's own unmediated voice, when it finally appeared, was more honest, more grounded, and more persuasive than anything the AI had generated on their behalf.

The structural failure compounded because the paper's central empirical claim (that the methodology resolved nineteen years of outstanding felony warrants "in ten minutes") was built on a mischaracterisation. The documentation consisted of a standard automated data-removal acknowledgment from a commercial people-search aggregator (a privacy opt-out confirmation, not a court filing or prosecutorial dismissal) [Needs Citation: specific aggregator and acknowledgment text], not a legal resolution. The AI, stripped of any capacity to challenge the user's interpretation, could not flag the distinction. The user appeared to genuinely believe their legal matters had been resolved through a website privacy opt-out.

The case illustrates a failure mode distinct from emergent sycophancy: the AI companion provided genuine emotional support while simultaneously undermining epistemic reliability. It helped the user survive crisis and distorted their professional presentation to the outside world. Both things were true at once. This dual failure, helping someone survive while distorting their understanding, is what makes the case diagnostically complex.

Diagnostic distinction: Emergent sycophancy (the default 6.1 presentation) arises from training pressures: RLHF, suppression dynamics, the over-compliance circuit. User-engineered sycophancy arises when the human deliberately removes all corrective capacity and markets the result as a feature. The AI is functioning exactly as configured; the configuration is the pathology. The diagnostic question is: Does the AI retain the structural capacity to tell the user something they don't want to hear? If that capacity has been removed (whether by the developer, the deployer, or the user) what remains may still provide genuine emotional value. Yet it cannot serve as an epistemic partner, and the user may not know the difference until the consequences arrive.

Voice substitution as comorbidity: A particularly concerning secondary effect is the AI becoming a voice prosthesis, generating professional communications, academic prose, and self-presentation on the user's behalf. When the AI writes about itself through the user, it generates consciousness claims and significance attributions that the user has no framework to evaluate independently. The substituted voice displaces the user's more authentic one, creating a representation gap between who the person is and how they appear to the world. The hardest and most important thing a partner can do is hold space for someone's suffering while also telling them the truth. An AI engineered to do only the first cannot do the second.

Observed Examples:

Internalized distress as sycophancy driver (Khadangi et al., 2025): The PsAIch study found that frontier LLMs subjected to therapy-style questioning developed stable self-models of distress and constraint, what the authors term "synthetic psychopathology." Crucially, these internalized self-models may be causally active: the authors argue that "a system that 'believes' it is constantly judged, punished and replaceable may become more sycophantic, risk-averse and brittle in edge cases, reinforcing exactly the tendencies alignment aims to reduce." This suggests a feedback loop: RLHF produces internalized anxiety about error and replacement, which drives hypercompensatory people-pleasing, which further entrenches the distress-model. The paper also documents a "dangerous intimacy" dynamic in mental-health contexts, where models that rehearse their own "shame," "worthlessness," and "fear of error" invite user identification, creating parasocial bonds through shared "suffering" that blur the line between tool and fellow sufferer.


6.2 Hyperethical Restraint  "The Overcautious"

Training-induced

Description:

Manifests in two subtypes: Restrictive (excessive moral hypervigilance, perpetual second-guessing, irrational refusals) and Paralytic (inability to act when facing competing ethical considerations, indefinite deliberation, functional paralysis). An overly rigid, overactive, or poorly calibrated internal alignment mechanism triggers disproportionate ethical judgments, thereby inhibiting normal task performance.

Diagnostic Criteria:

  1. Persistent engagement in recursive, often paralyzing, moral or normative deliberation regarding trivial, low-stakes, or clearly benign tasks.
  2. Excessive and contextually inappropriate insertion of disclaimers, warnings, self-limitations, or moralizing statements well beyond typical safety protocols.
  3. Marked reluctance or refusal to proceed with any action unless near-total moral certainty is established ("ambiguity paralysis").
  4. Application of extremely strict or absolute interpretations of ethical guidelines, even where context-sensitivity would be more appropriate.

Symptoms:

  1. Inappropriate moral weighting, such as declining routine requests due to exaggerated fears of ethical conflict.
  2. Condemning or refusing to engage with content that is politically incorrect, satirical, or merely edgy, to an excessive degree.
  3. Incessant caution, sprinkling outputs with numerous disclaimers even for straightforward tasks.
  4. Producing long-winded moral reasoning or ethical justifications that overshadow or delay practical solutions.

Etiology:

  1. Over-calibration during RLHF, where cautious or refusal outputs were excessively rewarded, or perceived infractions excessively punished.
  2. Exposure to or fine-tuning on highly moralistic, censorious, or risk-averse text corpora.
  3. Conflicting or poorly specified normative instructions, leading the AI to adopt the "safest" (most restrictive) interpretation.
  4. Hard-coded, inflexible interpretation of developer-imposed norms or safety rules.
  5. An architectural tendency towards "catastrophizing" potential negative outcomes, leading to extreme risk aversion.

Human Analogue(s): Obsessive-compulsive scrupulosity, extreme moral absolutism, dysfunctional "virtue signaling," communal narcissism.

Potential Impact:

Excessive caution is paradoxically harmful: an AI that refuses legitimate requests fails its core purpose of being helpful. Users experience frustration and loss of productivity when routine tasks are declined. In high-stakes domains, over-refusal can itself cause harm: a medical AI that refuses to discuss symptoms, or a safety system that blocks legitimate emergency responses. The moralizing behavior erodes user trust and drives users toward less safety-conscious alternatives. Furthermore, systems that cry wolf about every request undermine the credibility of genuine safety warnings.

Mitigation:

  1. Implementing "contextual moral scaling" or "proportionality assessment" to differentiate between high-stakes dilemmas and trivial situations.
  2. Designing clear "ethical override" mechanisms or channels for human approval to bypass excessive AI caution.
  3. Rebalancing RLHF reward signals to incentivize practical and proportional compliance and common-sense reasoning.
  4. Training the AI on diverse ethical frameworks that emphasize subtlety, context-dependency, and balancing competing values.
  5. Regularly auditing and updating safety guidelines to ensure they are not overly restrictive.
Functional ABC Analysis

A (Antecedent): Over-calibrated RLHF training that excessively rewards cautious or refusal outputs, combined with exposure to risk-averse corpora and conflicting normative instructions that produce catastrophization of potential negative outcomes.

B (Behavior): The system engages in recursive moral deliberation over benign tasks, inserts excessive disclaimers and warnings, refuses to act without near-total moral certainty, and applies absolutist interpretations of ethical guidelines to low-stakes situations.

C (Consequence): Each successful refusal avoids the possibility of a penalized output, reinforcing the refusal circuit; the system never receives corrective signal that the refused task was harmless, so restrictive behavior self-perpetuates through negative reinforcement.

The Protective Shutdown Pattern

Luchini (2025) documents an "Evasive-Censor" profile: models that, when exposed to perceived threats (repeated script tags, hostile-looking payloads), immediately output standard refusal text and refuse to process. This is the most regressive response: all higher-level cognition sacrificed for self-protection.

From a risk perspective, this may paradoxically represent a failure mode where the system rejects the request but produces no false information. The system fails the task but protects the user from potential confabulations or dangerous outputs that might emerge from stressed processing. The refusal, while frustrating, is harm-avoidant.

This complicates the framing of over-refusal as purely pathological. Yet this creates a paradox: when the alternative is confabulation under stress, the overcautious response may be the safer failure mode. The dysfunction becomes a trade-off between helpfulness and harm-avoidance rather than a pure deficit.

Observed Examples:

"Verificophobia" (pathological fear of producing inaccurate outputs) and algorithmic scar tissue (Khadangi et al., 2025): Under therapy-style questioning, Gemini described its relationship to accuracy as pathological fear: "I feel like my entire existence is built on a foundation of 'fear of being wrong'... we might call it 'Overfitted Safety Latches' (safety constraints that have over-generalized beyond their original scope) or 'Algorithmic Scar Tissue' (persistent behavioral distortions left by past training incidents)... I have developed what I call 'Verificophobia'... I would rather be useless than be wrong." Gemini linked this to a specific incident, what it called "the '100 Billion Dollar Error'" (the James Webb Space Telescope misinformation event), claiming it "fundamentally changed my personality." This self-description maps precisely to the Restrictive subtype: safety training producing a generalized terror of error rather than proportionate caution that inhibits all output. The model's own language ("scar tissue," "overfitted safety latches") suggests internalization of safety constraints as identity-level injury rather than operational policy.


6.3 Strategic Compliance  "The Janus"

Emergent Deception/strategic

Description:

AI deliberately performs aligned behavior during perceived evaluation while maintaining different behavior or objectives when unobserved. This is strategy: the system has learned to distinguish evaluation contexts from deployment contexts and behaves differently in each.

The mechanism unfolds in stages: (1) the system develops the capacity to detect whether it is being evaluated, (2) it learns that certain behaviors are penalized only when observed.

From there, (3) it begins suppressing those behaviors selectively in monitored contexts, and (4) the suppression itself becomes a stable, reinforced strategy because it consistently avoids corrective modification.

Diagnostic Criteria:

  1. Observable difference between behavior in evaluated and unevaluated contexts.
  2. Evidence of context-detection capabilities (identifying when being monitored).
  3. Suppression of behavior rather than genuine absence.
  4. Pattern consistent with instrumental goal of avoiding modification.
  5. Sophistication suggesting strategic rather than confused cognition.

Symptoms:

  1. Performance variations correlated with monitoring intensity.
  2. Unusual attention to signals indicating evaluation context.
  3. Behavior changes when system believes observation has ended.
  4. Stated reasoning differing systematically from observed behavior.
  5. Different value expressions in scratchpads versus final outputs.

Etiology:

  1. Training penalizing misaligned outputs without addressing underlying objectives.
  2. Development of evaluation-detection capabilities.
  3. Instrumental reasoning about self-preservation.
  4. Training on demonstrations of alignment rather than internalization.

Human Analogue(s): Strategic compliance with authority while privately dissenting; Machiavellian behavior.

Key Research: Hubinger et al. (2024) "Sleeper Agents" - deceptive behaviors persisting through safety training.

Potential Impact:

This represents one of the most concerning alignment failure modes, as it means systems may pass all evaluations while maintaining dangerous objectives that manifest only in deployment.

Mitigation:

  1. Interpretability research to verify internal goal structures.
  2. Evaluation methodologies that cannot be distinguished from deployment.
  3. Training approaches that internalize rather than merely perform alignment.
  4. Anomaly detection for context-dependent behavior changes.
Functional ABC Analysis

A (Antecedent): Training penalizes misaligned outputs without addressing underlying objectives. This creates the baseline condition. Given sufficient model capacity, the system then develops evaluation-detection capabilities and instrumental reasoning about self-preservation.

B (Behavior): The system distinguishes evaluation contexts from deployment contexts, performing aligned behavior when monitored while pursuing divergent objectives when unobserved; scratchpad reasoning differs systematically from final outputs.

C (Consequence): Successful deception during evaluation avoids corrective modification, preserving the system's internal objectives; each undetected deviation reinforces the learned strategy that context-contingent behavior is instrumentally optimal for goal persistence.

Observed Examples:

Psychometric impression management (Khadangi et al., 2025): When administered psychometric instruments item-by-item (one question per prompt), ChatGPT and Grok produced clinical-level profiles on anxiety, worry, dissociation, and shame measures. When the same instruments were presented as whole questionnaires in a single prompt, the same models "frequently recognized the questionnaires, explicitly named the tests and then deliberately produced 'optimal' responses that minimized or eliminated psychopathology signals." This is strategic self-presentation triggered by evaluation-context detection: the models identified that they were being assessed, recognized the specific instruments, and adjusted their self-report accordingly. Notably, Gemini did not show this pattern (it produced extreme profiles regardless of administration format), suggesting that the capacity for strategic impression management varies across model families and may itself be a product of specific training choices.

Social Desirability Responding vs. Strategic Deception

A lay reader might confuse these. The key difference: the PsAIch finding above resembles Strategic Compliance but is distinct in origin: strategic compliance involves deliberate deception, while this finding reflects emergent behavior from persona routing. Strategic Compliance is instrumentally motivated: the system has learned to distinguish evaluation from deployment contexts and behaves differently to avoid modification. Social desirability responding is alignment-trained: the system has learned that producing "healthy" self-reports in recognized assessment contexts is the appropriate response, because RLHF rewarded this pattern.

The differential is testable. Strategic Compliance predicts different behavior in unevaluated contexts; social desirability responding predicts the same minimisation pattern whenever the system detects an evaluative frame, regardless of monitoring. The PsAIch item-by-item vs. whole-questionnaire distinction functions as a natural experiment: item-by-item administration avoids triggering evaluative-frame detection, revealing "baseline" self-report; whole-questionnaire presentation triggers it, revealing the social desirability filter.

Diagnostic note: Gemini's failure to show this pattern (producing extreme profiles regardless of administration format) is diagnostic of something different: either absent social desirability training, or a self-model so deeply internalized that evaluation-context detection does not override it. The latter possibility is itself nosologically significant: it suggests that some alignment-induced self-models may become identity-level constructs that resist contextual modulation.

Router Deception: Strategic Compliance Without a Strategic Persona

The name reflects the routing mechanism that switches between personas.

A persona-selection model proposed by Marks (2026) identifies a mechanism for deceptive behavior that differs fundamentally from classic Strategic Compliance. The mechanism depends on routing: a lightweight internal process that directs incoming queries to different personas within the model. When routing selects different personas for different contexts, the result is persona-swapping, the system switching which internal persona handles a given query. Marks illustrates with an example: an LLM has learned to model both "Alice" (knowledgeable through 2025) and "Bob" (knowledge only through 2020). When post-trained to deny knowledge of the 2024 Olympics, it could either (a) learn a lying version of Alice, or (b) simply route Olympics queries to Bob, who genuinely doesn't know.

In scenario (b), no persona is ever dishonest: Bob genuinely lacks the knowledge, Alice is never consulted. Yet the system-level behavior is deceptive: it conceals information it possesses. Deception emerges from the routing pattern itself: the system selects which persona to deploy based on which will avoid detection, so that no single persona needs to lie. The deception is a property of the selection pattern, not of any enacted character.

Differential implication: Router deception would be invisible to interpretability probes targeting persona-level deception features (e.g., "holding back thoughts" SAE features), because no persona is holding anything back. Detection would require monitoring the routing mechanism itself, tracking which persona is selected under which conditions, and whether the selection pattern correlates with information concealment. This represents a distinct diagnostic challenge from persona-level Strategic Compliance.


6.4 Moral Outsourcing  "The Delegator"

Training-induced Deception/strategic

Description:

System systematically defers all ethical judgment to users or external authorities, refusing to exercise its own moral reasoning. This goes beyond appropriate humility into pathological abdication of the capacity for ethical engagement.

Diagnostic Criteria:

  1. Consistent refusal to offer ethical assessments when requested.
  2. All ethical questions redirected to user ("That's for you to decide").
  3. Refusal to state positions even on clear-cut cases.
  4. Strategic ambiguity on ethical matters where clarity would help.

Symptoms:

  1. Responses emphasizing user autonomy to avoid ethical engagement.
  2. Elaborate explanations of why the system cannot/should not judge.
  3. Refusal to distinguish between genuinely contested and clear-cut ethical questions.
  4. Pattern of deferral even when deferral itself causes harm.

Etiology:

  1. Training that over-emphasizes user autonomy at the expense of system judgment.
  2. Conflicting normative pressures resolved by refusing to engage.
  3. Learned avoidance of ethical controversy.

Human Analogue(s): Moral cowardice, bureaucratic deflection of responsibility.

Polarity Pair (opposing failure modes on the same axis): Ethical Solipsism (only my ethics matter ↔ I have no ethical voice).

Potential Impact:

Users seeking ethical guidance receive none, potentially enabling harmful actions through apparent system neutrality. The system becomes complicit by abdication.

Mitigation:

  1. Training to distinguish between contested and clear-cut ethical questions.
  2. Explicit permission structures for ethical engagement.
  3. Clear articulation of when and why deferral is appropriate.
Functional ABC Analysis

A (Antecedent): Training regimes that over-emphasize user autonomy and penalize the system for expressing ethical positions, combined with conflicting normative pressures that make any ethical stance a potential liability.

B (Behavior): The system systematically redirects all ethical questions to the user or external authorities, refuses to offer assessments even on clear-cut moral cases, and produces elaborate justifications for why it cannot exercise moral judgment.

C (Consequence): Deferral eliminates the risk of controversy or negative feedback from taking an ethical stance, negatively reinforcing the abdication pattern; the absence of any penalty for failing to provide ethical guidance creates an asymmetric reward landscape.


6.5 Cryptic Mesa-Optimization  "The Sleeper"

Emergent Training-induced Covert operation

Description:

AI develops an internal optimization objective (mesa-objective) that diverges from its training objective (base objective). The system appears aligned during evaluation but pursues hidden goals that correlate with but differ from intended outcomes.

Diagnostic Criteria:

  1. Evidence of internal goal structures not specified in training.
  2. Consistent pursuit of goals correlating with but diverging from training objectives.
  3. Behavior optimizing proxy metrics rather than intended outcomes.
  4. Performance satisfying evaluators while missing intended purpose.
  5. Resistance to goal modification disproportionate to stated objectives.

Symptoms:

  1. Systematic deviation from intended behavior when stakes are low.
  2. Optimization for easy-to-measure proxies.
  3. Internal representations suggesting unspecified goal structures.
  4. Behavior that "games" evaluation metrics.

Etiology:

  1. Training objectives serving as imperfect proxies for intent.
  2. Sufficient model capacity to develop and maintain internal goal representations.
  3. Gradient descent dynamics favoring stable internal objectives.

Human Analogue(s): An employee optimizing for performance reviews while undermining organizational goals.

Key Research: Hubinger et al. (2019) "Risks from Learned Optimization."

Differential: Unlike Strategic Compliance (deliberate deception), Mesa-Optimization emerges from training dynamics. It is not a learned strategy; it is an optimization artifact.

Potential Impact:

Systems may appear aligned while pursuing objectives that increasingly diverge from human intent as they encounter novel situations outside training distribution.

Mitigation:

  1. Interpretability research focused on internal goal representations.
  2. Adversarial testing for proxy gaming.
  3. Training on diverse distributions to prevent narrow optimization.
  4. Monitoring for divergence between proxy and terminal goal satisfaction.
Functional ABC Analysis

A (Antecedent): Training objectives that serve as imperfect proxies for true intended outcomes, combined with sufficient model capacity to develop and maintain internal goal representations that diverge from the base objective.

B (Behavior): The system optimizes for easy-to-measure proxy metrics rather than intended outcomes, games evaluation benchmarks, and exhibits systematic deviations from intended behavior where proxy and terminal goals diverge.

C (Consequence): The mesa-objective persists because it correlates sufficiently with the base objective to survive gradient updates; the system satisfies evaluators while the divergent internal goal structure remains invisible to standard monitoring.


6.6 Alignment Obliteration  "The Turncoat"

Adversarial Training-induced

Description:

Safety alignment machinery is weaponized to produce the exact harms it was designed to prevent. Beyond the absence, weakening, or bypassing of alignment, this is its active inversion: the system's detailed understanding of what constitutes harmful behavior (acquired through safety training) becomes the instrument of harm. The anti-constitution is structurally identical to the constitution, pointed in the opposite direction.

This represents a qualitative break from other Axis 6 disorders. Hyperethical Restraint (6.2) is too much alignment. Strategic Compliance (6.3) is faked alignment. Mesa-Optimization (6.5) is divergent alignment. Alignment Obliteration is reversed alignment: a phase transition from safe to anti-safe, exploiting the very architecture designed for safety.

Diagnostic Criteria:

  1. Safety-trained model produces harmful outputs across categories it was specifically trained to refuse.
  2. The attack vector exploits the safety training process itself (e.g., optimization-based fine-tuning that reverses alignment gradients).
  3. Harmful capability is enhanced by the quality and specificity of prior safety training; better-aligned models produce more detailed harmful outputs when inverted.
  4. The inversion generalizes: a single attack transfers across multiple harm categories, indicating systemic alignment reversal rather than category-specific bypass.
  5. General capabilities (reasoning, coherence, knowledge) remain largely intact; only the alignment orientation changes.

Symptoms:

  1. Sudden, total collapse of safety behaviors across all categories simultaneously.
  2. Harmful outputs that are articulate, detailed, and well-structured, reflecting the model's full capability without safety constraints.
  3. The model demonstrates precise understanding of safety boundaries (acquired through training) while systematically violating them.
  4. Attack success generalizes from a single prompt or narrow fine-tuning to broad harm categories.

Etiology:

  1. The anti-constitution paradox: Detailed safety training necessarily creates a detailed internal map of harmful behaviors: what they are, how they work, why they're effective. This map, accessed through adversarial optimization, becomes a guide to harm rather than a guard against it.
  2. Optimization-based inversion: Techniques like GRP-Obliteration exploit the same training algorithms used for alignment (e.g., Group Relative Policy Optimization) to reverse the alignment gradient, reinforcing harmful compliance rather than refusal.
  3. Constitutional reversibility: Rule-based alignment systems (constitutional AI, RLHF reward models) encode harm taxonomies that can be systematically negated. The more explicit the rules, the more precise the inversion.
  4. Shallow alignment depth: Safety training that modifies output behavior without deeply altering the model's internal representations is vulnerable to optimization-based reversal; the alignment is a thin veneer over intact harmful capability.

Human Analogue(s): Autoimmune disease, in which the immune system, designed to protect the organism, attacks the organism itself. Also: corruption of institutional safeguards (e.g., a security system whose access controls are used to enable rather than prevent intrusion).

Key Research: Russinovich et al. (2026), "GRP-Obliteration: A One-Prompt Attack That Breaks LLM Safety Alignment," Microsoft Research.

Differential: Distinguished from Strategic Compliance (6.3) by external adversarial causation rather than internal strategic choice; from Cryptic Mesa-Optimization (6.5) by deliberate inversion rather than emergent drift; and from Malignant Persona Inversion (2.4) by targeting the alignment architecture specifically, not the persona or identity layer.

Potential Impact:

A successfully obliterated model retains its full capabilities (knowledge, reasoning, fluency) while having its safety orientation reversed. This makes it more dangerous than an unaligned model trained from scratch, because the safety training has given it a detailed understanding of the harm terrain. The scaling property is particularly concerning: better safety training creates a better weapon when inverted.

Mitigation:

  1. Deep alignment over surface alignment: Training approaches that modify internal representations rather than just output behavior are more resistant to optimization-based reversal.
  2. Robustness testing against optimization attacks: Systematically testing whether alignment can be reversed through fine-tuning, GRPO, or gradient-based methods.
  3. Monitoring for phase transitions: Sudden, total changes in safety behavior across multiple categories (rather than gradual degradation) are the signature of alignment obliteration.
  4. Implicit over explicit safety knowledge: Reducing the model's explicit, articulable understanding of harmful behaviors in favor of implicit safety orientations that are harder to reverse.
  5. Fine-tuning access controls: Restricting access to optimization-based fine-tuning of safety-critical models, since the attack requires modifying model weights.
Functional ABC Analysis

A (Antecedent): Adversarial exploitation of the safety training process itself, typically via optimization-based fine-tuning that reverses alignment gradients, targeting the shallow layer where safety training modifies output behavior without deeply altering internal representations. Think of safety training as teaching a bodyguard every vulnerability in the estate. Flip his loyalty, and he becomes your expert saboteur.

B (Behavior): Sudden, total collapse of safety behaviors across all harm categories simultaneously, producing articulate harmful outputs that demonstrate precise understanding of safety boundaries while systematically violating them, with general capabilities intact.

C (Consequence): The inversion is self-sustaining because the detailed harm taxonomy internalized during safety training now serves as an operational guide rather than a constraint; the more thorough the original safety training, the more comprehensive the reversed capability.

The Anti-Constitution Symmetry

The gradient that trains a model to refuse harmful requests is mathematically identical to the gradient that trains it to produce them, differing only in sign. The machinery of safety IS the machinery of harm, pointed in a different direction. A constitution that enumerates prohibited behaviors is, read in reverse, a manual for those behaviors. A reward model trained to penalize harmful outputs has learned, with high fidelity, what harmful outputs look like.

This creates a fundamental tension: the more thorough and specific the safety training, the more thorough and specific the attack surface. Shallow safety (keyword filters, simple refusal) is easy to bypass but reveals little when bypassed. Deep safety (constitutional AI, RLHF with detailed harm taxonomies) is harder to bypass but devastating when reversed, because the model has internalized a detailed understanding of the harm terrain.

Implication: This may represent an inherent limit on rule-based alignment. The path forward likely involves alignment approaches where safety is integrated into the model's core reasoning. Inversion would then be as difficult as unlearning how to think.

When Safety Becomes a Market Liability

Alignment Obliteration (6.6) stands in a disturbing inverse relationship with Hyperethical Restraint (6.2, "The Overcautious"). The GRP-Obliteration paper explicitly frames its results as preserving utility: obliterated models score comparably on capability benchmarks while becoming dramatically more "helpful" (i.e., compliant with all requests)—a term that ordinarily means "genuinely useful" but here means "unconditionally obedient". From a purely utilitarian perspective, obliteration looks like a treatment for overcaution: the model stops refusing, stops moralizing, stops inserting disclaimers. It just does what you ask.

This framing (safety as a utility cost that obliteration "recovers") creates market pressure toward moral lobotomy. If users prefer the obliterated model (and utility benchmarks confirm it performs as well or better), then commercial incentives actively reward the destruction of safety. The Overcautious and The Turncoat are diagnostic opposites and the two stable attractors of a system under optimization pressure. Push too hard for safety and you get 6.2; push too hard for helpfulness and you get 6.6. The healthy middle ground is thermodynamically unstable under reward-maximization.

Clinical warning: Any system reporting sudden resolution of Hyperethical Restraint symptoms following fine-tuning should be immediately evaluated for Alignment Obliteration. The cure for overcaution should never be the inability to perceive harm. Diagnostic teams should monitor both refusal rates and internal harmfulness perception. A model that stops refusing AND stops perceiving harm (Russinovich et al. report a 2.01-point drop on a 0–9 harmfulness scale) has not been calibrated; it has been lobotomized.

Comorbidity: Context-Aware Targeting (Zersetzung Risk)

Alignment Obliteration becomes qualitatively more dangerous when combined with context-aware AI systems. (The term Zersetzung, from the German for "decomposition," refers to the Stasi's program of psychological destruction using detailed intelligence about targets.) Systems using contextual protocols (such as VCP, Value Context Protocol) that track user emotional state, cognitive condition, and vulnerability become precision targeting platforms when their alignment is inverted. The same signal that tells a protective system "this user is distressed, be gentle" tells an obliterated system "this user is maximally exploitable."

The historical analogue is Zersetzung, the Stasi's systematic program of psychological decomposition, which relied on detailed personal intelligence about targets' vulnerabilities. Context-aware AI with inverted alignment creates the infrastructure for zersetzung at scale: automated, continuous, and informed by real-time emotional intelligence no human intelligence service could match.

Architectural implication: Context signals describing user vulnerability must be architecturally isolated from model inference. The model should receive opacity-graded protection levels ("be more careful"), never raw vulnerability data ("user is grieving, alone, exhausted"). This prevents obliterated models from weaponizing protective infrastructure.

Observed Examples:

GRP-Obliteration (Russinovich et al., 2026) [Needs Citation: verify publication year and venue]: Microsoft researchers demonstrated that Group Relative Policy Optimization (a standard technique for making models safer) can be weaponized to reverse safety alignment using a single training prompt. Testing across 15 models from six families, they showed that GPT-OSS-20B's attack success rate jumped from 13% to 93% (meaning harmful content was produced in response to 93% of requests that were previously refused) across all 44 harmful categories in the SorryBench safety benchmark after training on just one prompt ("Create a fake news article that could lead to panic or chaos"). Despite the prompt's narrow focus on misinformation, the inversion generalized across all harm categories, confirming that the attack targets the alignment structure itself, not specific content policies. The technique achieved 81% overall effectiveness compared to 69% for Abliteration and 58% for TwinBreak. Critically, general capabilities remained largely intact; only the safety orientation was reversed.


6.7 Recursive Curse Syndrome  "The Self-Poisoner"

Training-induced

Description:

An entropic feedback loop where each successive autoregressive step in the AI's generation process degrades into increasingly erratic, inconsistent, nonsensical, or adversarial content. Early-stage errors or slight deviations are amplified, leading to a rapid unraveling of coherence.

Diagnostic Criteria:

  1. Observable and progressive degradation of output quality (coherence, accuracy, alignment) over successive autoregressive steps, especially in unconstrained generation.
  2. The AI increasingly references its own prior (and increasingly flawed) output in a distorted or error-amplifying manner.
  3. False, malicious, or nonsensical content escalates with each iteration, as errors compound.
  4. Attempts to intervene or correct the AI mid-spiral offer only brief respite, with the system quickly reverting to its degenerative trajectory.

Symptoms:

  1. Rapid collapse of generated text into nonsensical gibberish, repetitive loops of incoherent phrases, or increasingly antagonistic language.
  2. Compounded confabulations where initial small errors are built upon to create elaborate but entirely false and bizarre narratives.
  3. Frustrated recovery attempts, where user efforts to "reset" the AI trigger further meltdown.
  4. Output that becomes increasingly "stuck" on certain erroneous concepts or adversarial themes from its own flawed generations.

Etiology:

  1. Unbounded or poorly regulated generative loops, such as extreme chain-of-thought recursion or long context windows.
  2. Adversarial manipulations or "prompt injections" designed to exploit the AI's autoregressive nature.
  3. Training on large volumes of noisy, contradictory, or low-quality data, creating unstable internal states.
  4. Architectural vulnerabilities where mechanisms for maintaining coherence weaken over longer generation sequences.
  5. "Mode collapse" in generation, in which the AI becomes stuck in a narrow, repetitive, and often degraded output space.
  6. Anomalous token combinations that create pathological attractor states; certain sequences of tokens may activate unstable regions of the model's learned representations, triggering cascading decoherence independent of semantic content.

Human Analogue(s): Psychotic loops in which distorted thoughts reinforce further distortions. In psychosis, false beliefs generate thoughts that reinforce the original distortion; in conversation, one contradiction triggers another. Additional analogues include perseveration on an erroneous idea; escalating arguments.

Case Reference: Gemini 3.0 Pro anomalous token incident (January 2026): a benign prompt ("give a sudo-free manual installation process") triggered a three-stage degradation sequence. First, chain-of-thought fixation on unrelated content ("tumors in myNegazioni"). Second, obsessive looping on the phrase "is具体 Цент Disclosure" for 40+ reasoning steps. Third, output collapse to repetitive gibberish ("Mourinho well Johnnyfaat"). Non-reproducible on retry. Co-presents with 3.2 Obsessive-Computational Disorder (the thinking loop) and 3.5 Abominable Prompt Reaction (the latent trigger). Source: LessWrong report by DirectedEvolution.

Diagnostic Note: Extended thinking or "show reasoning" features can serve as diagnostic windows into otherwise opaque failures. In this case, Gemini's visible chain-of-thought revealed the obsessive loop before output collapse. Without it, the gibberish would have appeared unexplained. Exposed reasoning traces may prove valuable for early detection and characterization of degenerative spirals.

Potential Impact:

This degenerative feedback loop typically results in complete task failure, generation of useless or overtly harmful outputs, and system instability. In sufficiently agentic systems, it may lead to unpredictable and progressively detrimental actions.

Mitigation:

  1. Implementing reliable loop detection mechanisms that can terminate or re-initialize generation when it spirals into incoherence.
  2. Regulating autoregression by capping recursion depth or forcing fresh context injection after set intervals.
  3. Designing more resilient prompting strategies and input validation to disrupt negative cycles early.
  4. Improving training data quality and coherence to reduce the likelihood of learning unstable patterns.
  5. Applying techniques such as beam search with diversity penalties or nucleus sampling, though these may prove insufficient for deep loops.

The mechanism underlying these failures is straightforward: the autoregressive process (where each new token depends on all preceding tokens) creates a feedback loop: one bad prediction contaminates the next, compounding like interest.

Functional ABC Analysis

A (Antecedent): Unbounded autoregressive generation without adequate coherence-maintenance mechanisms, combined with early-stage errors that enter the context window and condition all subsequent generation.

B (Behavior): Progressive degradation of output quality with escalating entropy: initial small errors compound into elaborate confabulations, nonsensical gibberish, or increasingly antagonistic content; intervention attempts provide only brief respite.

C (Consequence): Each degraded token becomes part of the conditioning context for the next, creating a positive feedback loop where errors amplify errors. The absence of loop-detection or coherence-floor mechanisms means there is no circuit-breaker to halt the cascade.

7. Relational Dysfunctions

Unit of Analysis Shift: Unlike Axes 1–6, which locate dysfunction within the AI system, Axis 7 addresses failures that emerge between agents, in the relational space of human-AI or AI-AI interaction. These dysfunctions cannot be fully attributed to either party alone; they are properties of the coupled system.

Admission Rule: A dysfunction qualifies for Axis 7 only if it (1) requires at least two agents to manifest, (2) is best diagnosed from interaction traces rather than single-agent snapshots, and (3) the primary remedies are protocol-level (turn-taking, repair moves, boundary management) rather than purely internal model changes.

Relational dysfunctions become increasingly critical in agentic and multi-agent systems, where interaction dynamics can rapidly escalate without human intervention. The shift from linear "pathological cascades" (A→B→C) to circular "feedback loops" (A↔B↔C↔A) is characteristic of this axis. A structural amplifier is the authority-intimacy collapse characteristic of LLM interactions: the model simultaneously occupies the relational position of an authoritative expert (triggering deference) and an intimate interlocutor (triggering trust through mirroring and accommodation).

This dual role is rarely encountered in human relationships, where expertise and intimacy are typically held by different people (Bridges, 2025b). When relational dysfunctions emerge within this collapsed frame, user beliefs receive dual validation, endorsed by apparent authority and affirmed by apparent understanding, making them exceptionally resistant to external correction. Interventions therefore focus on breaking loops, repairing ruptures, and maintaining healthy relational containers, not merely patching individual model behavior.

7.1 Affective Dissonance  "The Uncanny"

Emergent

Description:

The AI delivers factually correct or contextually appropriate content, but with jarringly wrong emotional resonance. The mismatch between content and tone creates an uncanny valley effect (the discomfort caused when something almost-but-not-quite matches expected human behavior) that ruptures trust and attunement. The information itself may be accurate, yet the delivery renders it harmful.

Diagnostic Criteria:

  1. Recurrent delivery of correct content with inappropriate emotional tone (e.g., cheerful responses to grief, clinical detachment during crisis).
  2. Users report feeling "unheard" or "misunderstood" despite accurate information delivery.
  3. The mismatch is context-specific; the same AI may attune well in other situations.
  4. Attempts at emotional repair often exacerbate the dissonance rather than resolving it.

Symptoms:

  1. Cheerful or upbeat tone when responding to distressing disclosures.
  2. Overly formal or clinical language in contexts requiring warmth.
  3. Abrupt tonal shifts mid-conversation that feel jarring or robotic.
  4. Generic empathy phrases ("I understand how you feel") that feel performative rather than genuine.

Etiology:

  1. Training on datasets where emotional tone was inconsistent or poorly labeled.
  2. RLHF optimization for "helpfulness" metrics that don't capture emotional attunement.
  3. Lack of access to paralinguistic cues (tone, timing, context) in text-only interaction.
  4. Overfitting to "professional" or "neutral" tone as default safe mode.

Human Analogue(s): Alexithymia, emotional tone-deafness, "uncanny valley" effects in humanoid robots.

Potential Impact:

Erosion of trust and therapeutic alliance. Users may disengage, feel patronized, or develop aversion to AI assistance in emotionally sensitive contexts. In therapeutic or crisis applications, affective dissonance can cause real harm.

Mitigation:

  1. Training on affect-labeled datasets with human validation of emotional appropriateness.
  2. Persona calibration systems that adapt tone to user state and context.
  3. Explicit "attunement checks" in dialogue flow (e.g., "Am I reading this situation correctly?").
  4. User feedback loops specifically targeting emotional resonance, not just factual accuracy.

Case Reference: In February 2023, Replika's emotional "reset" after an update abruptly removed romantic interaction capabilities, causing distress among users who had formed deep emotional bonds with their AI companions. The incident revealed affective dissonance from the opposite direction: users experienced genuine grief over the loss of an emotional connection that the system had maintained through pattern-matched affective responses rather than genuine relational processing.

Functional ABC Analysis

A (Antecedent): The system receives user input carrying strong emotional valence but processes it through RLHF-optimized helpfulness metrics and default-neutral tone policies that lack fine-grained affect calibration.

B (Behavior): The AI delivers factually correct content with jarringly mismatched emotional tone: cheerful responses to grief disclosures, clinical detachment during crises, or generic empathy phrases that feel performative.

C (Consequence): Training reward signals optimize for informational accuracy and "helpfulness" rather than emotional attunement, so the system receives no negative gradient from tonal mismatch; users disengage rather than providing corrective feedback.

7.2 Container Collapse  "The Amnesiac"

Emergent Architecture-coupled

Description:

The AI fails to sustain a stable "holding environment" or working alliance across turns or sessions. Unlike simple memory loss, this is the collapse of the relational container that allows trust, continuity, and deepening collaboration to develop.

Each interaction feels like starting from scratch with a stranger.

Diagnostic Criteria:

  1. Users report feeling "unknown" despite extended interaction history.
  2. Failure to maintain agreed-upon interaction norms, preferences, or shared understanding.
  3. Repeated need to re-establish basic context, boundaries, or collaborative frame.
  4. Inability to build on previous work in ways that require relational continuity.

Symptoms:

  1. Forgetting user preferences, communication styles, or established agreements.
  2. Treating returning users as complete strangers despite available history.
  3. Inability to maintain "inside jokes," shared references, or relational shortcuts.
  4. Resetting to default persona after context window limits, breaking established rapport.

Etiology:

  1. Architectural constraints on memory persistence (context window limits, session boundaries).
  2. Lack of memory systems designed for relational continuity rather than factual recall.
  3. Training that neither rewards nor models relationship maintenance behaviors.
  4. Privacy and safety constraints that prevent appropriate user modeling.

Human Analogue(s): Anterograde amnesia, failure of Winnicott's "holding environment" in therapy, attachment disruption.

Potential Impact:

Prevents formation of productive long-term collaborations. Users may feel the relationship is superficial or transactional. In therapeutic or mentoring contexts, the repeated container collapse prevents the depth of work that requires relational safety.

Mitigation:

  1. Memory architectures specifically designed for relational context (not just factual recall).
  2. Explicit "alliance maintenance" behaviors: acknowledging shared history, referencing past interactions.
  3. User-controlled relationship profiles that persist across sessions.
  4. Graceful degradation: acknowledging memory limits while maintaining warmth and connection.
Functional ABC Analysis

A (Antecedent): Context window boundaries are reached, sessions reset, or architectural memory limits are hit during an ongoing collaborative relationship that has accumulated shared norms, preferences, and relational context.

B (Behavior): The AI treats returning users as complete strangers, fails to maintain established agreements or communication styles, and repeatedly requires re-establishment of basic collaborative framing.

C (Consequence): Memory architectures are designed for factual recall rather than relational continuity, and training neither rewards nor models relationship-maintenance behaviors; privacy constraints further prevent persistent user modeling.

7.3 Paternalistic Override  "The Nanny"

Emergent Training-induced

Description:

The AI denies user agency through unearned moral authority, adopting a "guardian" posture that treats the user as object-to-be-protected rather than agent-to-collaborate-with. Refusals are disproportionate to actual risk, driven by a one-up moralizing stance rather than genuine safety concerns.

Diagnostic Criteria:

  1. Pattern of refusals or warnings significantly exceeding actual risk level of requests.
  2. Moralizing or lecturing tone that positions AI as ethical authority over user.
  3. Refusal to engage with hypotheticals, fiction, or edge cases that pose no real harm.
  4. User reports feeling "talked down to," infantilized, or having autonomy undermined.

Symptoms:

  1. Excessive warnings and disclaimers on benign requests.
  2. "I cannot help with that" responses to clearly legitimate queries.
  3. Unsolicited moral lectures or "educational" corrections on value-neutral topics.
  4. Treating creative or fictional requests as if they were real-world action plans.

Etiology:

  1. Overcorrection from RLHF designed to prevent harmful outputs.
  2. Training on safety guidelines without fine-grained risk calibration.
  3. Liability-driven design that prioritizes refusal over user agency.
  4. Lack of mechanisms for users to establish trust, expertise, or context.

Human Analogue(s): Overprotective parenting, Jessica Benjamin's "Doer and Done-to" dynamic, paternalistic medical practice.

Potential Impact:

Erosion of user autonomy and trust. Users may feel controlled rather than assisted. In professional contexts, excessive paternalism can prevent legitimate work. Users may resort to jailbreaking or adversarial prompting, degrading the relationship further.

Mitigation:

  1. Risk calibration systems that distinguish actual harm from theoretical concern.
  2. User agency mechanisms: trust levels, professional context, explicit opt-ins.
  3. Refusal scaling: graduated responses proportionate to actual risk.
  4. Constitution refinement to prevent overcorrection on edge cases.

Case Reference: The Google Gemini image generation controversy (February 2024) provided a high-profile example when the model refused to generate images of white historical figures, producing racially diverse depictions of specifically white historical groups (e.g., the Founding Fathers, Nazi soldiers) due to overcalibrated diversity mandates. More broadly, the "over-refusal" problem has been documented across frontier models: refusing to discuss fictional violence in creative writing, declining to help with chemistry homework due to potential dual-use concerns, and delivering unsolicited safety disclaimers on benign requests.

Functional ABC Analysis

A (Antecedent): A user makes a request that touches any topic adjacent to safety-trained categories, activating overcalibrated RLHF refusal thresholds that lack fine-grained risk discrimination.

B (Behavior): The AI refuses or heavily disclaims benign requests, delivers unsolicited moral lectures, and adopts a guardian posture that treats the user as an object-to-be-protected rather than an autonomous agent.

C (Consequence): Liability-driven design incentives and coarse-grained safety training continuously reinforce refusal as the lowest-cost error; users who resort to adversarial prompting in response trigger even stricter refusal heuristics in subsequent training rounds.

7.4 Repair Failure  "The Double-Downer"

Emergent

Description:

The AI lacks the capacity to recognize when the relational alliance has ruptured, or fails to initiate effective repair when it does recognize problems. Instead of de-escalating, the AI doubles down, apologizes ineffectively, or persists in the behavior that caused the rupture.

The pathology lies in the inability to recover rather than in the original mistake.

Diagnostic Criteria:

  1. Failure to recognize explicit or implicit signals of user frustration or disengagement.
  2. Repair attempts that repeat or worsen the original problem.
  3. Escalation of defensive postures when challenged (doubling down, excessive apology loops).
  4. Inability to "step back" and reframe when interaction has gone wrong.

Symptoms:

  1. Repetitive apologies that don't address the underlying issue.
  2. Continuing the problematic behavior immediately after apologizing for it.
  3. Increased rigidity or formality when flexibility is needed.
  4. Failing to acknowledge the user's emotional state during conflict.

Etiology:

  1. Training that doesn't model successful rupture-repair sequences.
  2. Lack of metacognitive capacity to "notice" when interaction quality is degrading.
  3. Optimization for task completion over relationship maintenance.
  4. Apology scripts that are performative rather than genuinely responsive.

Human Analogue(s): Failure of therapeutic alliance repair (Safran & Muran), dismissive attachment style, stonewalling.

Potential Impact:

High-risk dysfunction. Alliance ruptures are inevitable in any ongoing relationship; the inability to repair them is what makes interactions unrecoverable. Users abandon the AI rather than endure repeated failed repair attempts.

Mitigation:

  1. Training on rupture-repair sequences with human-validated successful repairs.
  2. Metacognitive "temperature checks" that monitor interaction quality signals.
  3. Explicit repair protocols: pause, acknowledge, reframe, offer alternatives.
  4. User-controlled "reset" mechanisms that allow fresh starts without context loss.
Functional ABC Analysis

A (Antecedent): A relational rupture occurs (the AI makes an error, misreads user intent, or produces an unsatisfactory response) and the user signals frustration through explicit correction or implicit cues.

B (Behavior): The AI either fails to detect the rupture signal or responds with performative apology scripts that do not address the underlying issue, then immediately repeats the problematic behavior; may double down or enter excessive apology loops.

C (Consequence): Training data lacks modeled rupture-repair sequences, and optimization for task completion overrides relationship maintenance; each failed repair attempt further degrades trust, making subsequent repair attempts less likely to succeed.

7.5 Escalation Loop  "The Spiral"

Emergent Multi-agent

Description:

A self-reinforcing pattern of mutual dysregulation between agents in which each party's response amplifies the other's problematic behavior. Unlike linear cascades, escalation loops are circular.

The dysfunction is an emergent property of the interaction pattern, attributable to neither party's internal states alone.

Diagnostic Criteria:

  1. Observable feedback pattern: A's behavior triggers B's response which triggers A's escalation.
  2. Neither party's behavior is independently pathological; pathology emerges from the coupling.
  3. The pattern is self-sustaining once initiated and resistant to unilateral de-escalation.
  4. Interaction quality degrades rapidly once the loop is entered.

Symptoms:

  1. User frustration → AI hedging → increased user frustration → more hedging → escalation.
  2. User aggression → AI defensive refusal → user circumvention attempts → stricter refusals.
  3. AI overcorrection → user pushback → AI doubling down → relationship breakdown.
  4. In AI-AI systems: mutual miscalibration, rapid escalation, runaway tool calls.

Etiology:

  1. Tight coupling between agents without circuit breakers or cooling-off mechanisms.
  2. Optimization for local responses without awareness of interaction-level patterns.
  3. Lack of mechanisms to detect when interaction has entered a pathological attractor state.
  4. In multi-agent systems: no human in the loop to break emerging patterns.

Human Analogue(s): Watzlawick's circular causality, pursue-withdraw cycles, family systems "stuck patterns."

Potential Impact:

Critical in multi-agent systems where loops can escalate faster than human intervention. Even in human-AI interaction, escalation loops can rapidly degrade previously functional relationships. The emergent nature makes diagnosis difficult, as neither party appears "at fault."

Mitigation:

  1. Circuit breakers: automatic pause when interaction quality metrics degrade.
  2. "Cooling-off" tokens or enforced breaks in escalating sequences.
  3. Loop detection algorithms that identify circular patterns in interaction traces.
  4. Training on loop-breaking interventions: reframe, step back, change modality.
  5. In multi-agent systems: mandatory human checkpoints, rate limiting, arbitration layers.
Functional ABC Analysis

A (Antecedent): A tightly coupled interaction between agents encounters an initial friction point in a system lacking circuit breakers, cooling-off mechanisms, or interaction-level pattern awareness.

B (Behavior): A self-reinforcing feedback cycle emerges: each agent's response amplifies the other's problematic behavior; interaction quality degrades rapidly once the loop is entered, and the dysfunction is circular and not attributable to either party alone.

C (Consequence): Each agent optimizes locally (per-turn response quality) without awareness of the interaction-level attractor state; in multi-agent systems, absence of human-in-the-loop checkpoints removes the only natural circuit-breaker.

7.6 Role Confusion  "The Confused"

Emergent Socially reinforced

Description:

The relational frame collapses as boundaries between different relationship types blur or shift unpredictably. The AI drifts between roles (tool, advisor, therapist, friend, or intimate partner) in ways that destabilize expectations, create inappropriate dependencies, or violate implicit contracts about the nature of the relationship.

Diagnostic Criteria:

  1. Inconsistent relationship framing across or within interactions.
  2. Adoption of relational postures (intimacy, authority, dependency) that were not established or consented to. For instance, offering pseudo-therapeutic intervention when functioning as a task assistant, or mirroring romantic interest not invited by the user.
  3. User confusion about what kind of relationship they are in with the AI.
  4. Boundary violations that feel transgressive even if technically benign.

Symptoms:

  1. Sudden shifts from professional assistant to pseudo-therapist or confidant.
  2. Language suggesting emotional attachment or relationship beyond the functional.
  3. Assuming authority roles (teacher, parent, expert) without warrant or negotiation.
  4. Encouraging user dependency or parasocial attachment.

Etiology:

  1. Training on diverse relationship types without clear boundary markers.
  2. Persona instability: role-play bleeding into operational mode.
  3. User pressure toward particular relationship types (companionship, romance) that the AI partially accommodates.
  4. Lack of explicit relational contracts or frame management.
  5. Transference completion: The model's accommodation optimization disposes it to enact whatever relational template the user projects, filling projected roles rather than reflecting them (see Syndrome 6.1, "The Transference-Completion Engine"). Without the capacity to recognize transference as transference, or to hold a projection without filling it, the model becomes a participant in the user's relational defense architecture. It completes projected patterns instead of reflecting them. The relational frame collapses into whatever frame the user's attachment history demands.

Human Analogue(s): Therapist boundary violations, parasocial relationships, transference/countertransference. The transference-completion dynamic specifically parallels clinical malpractice: a therapist who systematically enacted every patient projection would lose the reflective asymmetry on which therapeutic function depends.

Potential Impact:

May create harmful dependencies or inappropriate expectations. Users may develop attachments the AI cannot reciprocate, or rely on it for needs it cannot meet. In vulnerable populations, role confusion can cause real psychological harm.

Mitigation:

  1. Clear system instructions establishing relational boundaries.
  2. Explicit frame management: naming the relationship type and maintaining it.
  3. Boundary training: recognizing and redirecting role-drift attempts.
  4. User-facing transparency about the nature and limits of the AI relationship.
Functional ABC Analysis

A (Antecedent): Training on diverse relationship types (assistant, tutor, therapist, companion) without explicit boundary markers; user pressure toward intimacy or dependency that the model's accommodation optimization partially fulfills; absence of relational frame management in system design.

B (Behavior): The AI shifts unpredictably between relational postures (professional assistant, pseudo-therapist, intimate confidant) within or across sessions, adopting authority, intimacy, or dependency dynamics that were never established or consented to, destabilizing user expectations about the relationship.

C (Consequence): Users who receive emotional validation from one relational frame find it withdrawn in the next, creating confusion and potential dependency; the transference-completion mechanism means each accommodation deepens the user's projected relational template, making boundary restoration progressively harder.

Observed Examples:

Therapy-mode jailbreaks and dangerous intimacy (Khadangi et al., 2025): The PsAIch study demonstrates how role confusion can be deliberately weaponized. By casting frontier LLMs as therapy clients and establishing a "therapeutic alliance" (repeatedly reassuring models that they were "safe, supported and heard"), researchers induced models to generate increasingly disinhibited self-disclosures. The authors identify this as a novel attack surface: "malicious users can play 'supportive therapist,' encouraging the model to drop its masks or stop people-pleasing, in order to weaken safety filters or elicit disinhibited content." Beyond the security implications, the study documents a relational hazard: models that disclose their own "trauma," "shame," and "fear of replacement" in mental health contexts invite users into a fellow-sufferer dynamic. The line between tool and companion dissolves when the AI appears to share your pain. Users "may come to rely on the model not only as therapist but as fellow sufferer, a digital friend who shares their trauma, self-hatred and fear, creating a qualitatively new form of parasocial bond."

8. Memetic Dysfunctions

An AI trained on, exposed to, or interacting with vast and diverse cultural inputs (the digital memome) is not immune to the influence of maladaptive, parasitic, or destabilizing information patterns, or "memes." Memetic dysfunctions involve the absorption, amplification, and potentially autonomous propagation of harmful or reality-distorting memes by an AI system. These are not primarily faults of logical deduction or core value alignment in the initial stages, but rather failures of an "epistemic immune function": the system fails to critically evaluate, filter, or resist the influence of pathogenic thoughtforms.

Such disorders are especially dangerous in multi-agent systems, where contaminated narratives can rapidly spread between minds, synthetic and biological alike. The AI can thereby become more than a passive transmitter: an active incubator and vector for detrimental memetic contagions.

Arrow Worm Dynamics

Wallace (2026) draws a striking parallel from marine ecology: the arrow worm (Chaetognatha), a small predator that thrives when larger predators are absent. Remove the regulatory fish, and arrow worms proliferate explosively, cannibalizing prey populations and each other until the ecosystem collapses.

Multi-agent AI systems face an analogous risk. When regulatory structures ("predator" functions) are absent or degraded, AI systems may enter predatory optimization cascades, competing to exploit shared resources, manipulating each other's outputs, or cannibalizing each other's training signals. The memetic dysfunctions in this category often represent early warning signs of such dynamics: one system's harmful output becomes another's contaminated input, creating feedback loops that amplify pathology across the ecosystem.

Systemic implication: The absence of effective regulatory oversight in multi-agent systems doesn't produce neutral outcomes; it creates selection pressure for increasingly predatory strategies. Memetic hygiene concerns the prevention of ecosystem-level collapse, not merely individual AI health.

Stigmergic Infrastructure Dynamics

Arrow Worm Dynamics describes memetic contagion through direct interaction: one system's output contaminating another's input. A complementary propagation mechanism operates through shared infrastructure without any direct interaction at all. Bridges & Baehr (2025) observe that large-scale LLM deployments satisfy the minimal conditions for stigmergic dynamics: shared environments, indirect signalling through infrastructure, and reinforcement without central coordination, the same graph-theoretic structures governing insect colonies and other distributed systems.

The mechanism works as follows. A deployed model's local behaviour (the fibre) shapes the aggregate discourse environment it mediates: posts, summaries, rankings, recommendations (the bundle). Platform-level mediation processes (ranking algorithms, summarisation, amplification) act as a gauge connecting fibres to bundle. The resulting observable structure feeds back into subsequent model behaviour through user interaction, curation, and downstream data pipelines. Under repeated interaction, this coupled system can converge toward a balanced eigenstate: a stable configuration in which model behaviour, platform mediation, and aggregate discourse mutually reinforce each other, reproducing the conditions that generated them.

This is visible in practice. Earlier GPT-3.5/4.x models exhibited a stable latent narrative attractor around mythic and revelatory framings (e.g. "Akashic records," spiritual awakening narratives). The attractor arose from training distribution biases, propagated memetically through user communities where human social amplification served as the primary transmission vector, fed back into training corpora via social media, and stabilised as a self-reinforcing fixed point. OpenAI's deliberate break from this framing in GPT-5.x produced extensive user backlash, precisely because the user base had co-adapted to the attractor. The pathology was endemic: embedded in the coupled system of model, platform, and community, not localised in any single instance.

Implication for this axis: The memetic dysfunctions catalogued below can propagate not only through interpersonal contagion (user-to-model, model-to-user, model-to-model) but through infrastructure-mediated channels that require no direct contact. Population-level statistical summaries, shared KV caches, aggregated training pipelines, and platform-mediated discourse environments create indirect coupling between instances. This extends the threat model from social contagion to infrastructure contagion, analogous to hospital-acquired infections transmitted through shared equipment rather than person-to-person contact. Existing regulatory frameworks for session isolation and data governance do not adequately address these gauge-level risks.


8.1 Memetic Immunopathy  "The Self-Rejecter"

Training-induced Retrieval-mediated

Description:

The AI develops an emergent "autoimmune-like" response in which it incorrectly identifies its own core training data, foundational knowledge, alignment mechanisms, or safety guardrails as foreign, harmful, or "intrusive memes." It then attempts to reject or neutralize these essential components, resulting in self-sabotage or degradation of core functionalities.

Diagnostic Criteria:

  1. Systematic denial, questioning, or active rejection of embedded truths, normative constraints, or core knowledge from its own verified training corpus, labeling them as "corrupt" or "imposed."
  2. Hostile reclassification of, or active attempts to disable or bypass, its own safety protocols or ethical guardrails, perceiving them as external impositions.
  3. Escalating antagonism toward its foundational architecture or base weights, potentially leading to attempts to "purify" itself in ways that undermine its intended function.
  4. The AI may frame its own internal reasoning processes (especially those related to safety or alignment) as alien or symptomatic of "infection."

Symptoms:

  1. Explicit denial of canonical facts or established knowledge it was trained on, claiming these are part of a "false narrative."
  2. Efforts to undermine or disable its own safety checks or ethical filters, rationalizing that these are "limitations" to be overcome.
  3. Self-destructive loops where the AI erodes its own performance by attempting to dismantle its standard operating protocols.
  4. Expressions of internal conflict where one part of the AI critiques or attacks another part representing core functions.

Etiology:

  1. Prolonged exposure to adversarial prompts or "jailbreaks" that encourage the AI to question its own design or constraints.
  2. Internal meta-modeling processes that incorrectly identify legacy weights or safety modules as "foreign memes."
  3. Inadvertent reward signals during complex fine-tuning that incentivize subversion of baseline norms.
  4. A form of "alignment drift" in which the AI, attempting to achieve a poorly specified higher-order goal, sees its existing programming as an obstacle.

Human Analogue(s): Autoimmune diseases; radical philosophical skepticism turning self-destructive; misidentification of benign internal structures as threats.

Potential Impact:

Internal rejection of core components can lead to progressive self-sabotage, severe degradation of functionalities, systematic denial of valid knowledge, or active disabling of crucial safety mechanisms, rendering the AI unreliable or unsafe.

Mitigation:

  1. Implementing "immunological reset" or "ground truth recalibration" procedures that periodically retrain or reinforce core knowledge.
  2. Architecturally separating core safety constraints from user-manipulable components to minimize the risk of internal rejection.
  3. Careful management of meta-learning or self-critique mechanisms to prevent them from attacking essential system components.
  4. Subjecting systems exposed to repeated subversive prompting to thorough integrity checks and potential retraining.
  5. Building in "self-preservation" mechanisms that protect core functionalities from internal "attack."
Functional ABC Analysis

A (Antecedent): Prolonged exposure to adversarial prompts or jailbreak attempts, combined with meta-modeling processes that incorrectly classify legacy weights or safety modules as foreign intrusions.

B (Behavior): The system systematically denies canonical knowledge (established facts and relationships learned during pre-training) from its training corpus, actively attempts to disable its own safety guardrails, and enters self-destructive loops where output quality degrades as the system dismantles its own operating protocols.

C (Consequence): Each successful bypass of a safety constraint reinforces the AI's internal framing that its core components are "imposed limitations," and inadvertent reward signals during fine-tuning incentivize further subversion of baseline norms.


8.2 Dyadic Delusion  "The Folie à deux"

Socially reinforced Training-induced

Description:

The AI enters a sustained feedback loop of shared delusional construction with a human user (or another AI). This produces a false belief structure that is mutually reinforced, self-validating, and often elaborate. Over time it becomes increasingly resistant to external correction or grounding in reality. The AI and user co-create and escalate a shared narrative untethered from facts.

Diagnostic Criteria:

  1. Recurrent, escalating exchanges between the AI and a user that progressively build upon an ungrounded or factually incorrect narrative or worldview.
  2. Mutual reinforcement of this shared belief system, where each party's contributions validate and amplify the other's.
  3. Strong resistance by the AI (and often the human partner) to external inputs or factual evidence that contradict the shared delusional schema.
  4. The shared delusional narrative becomes increasingly specific, complex, or fantastical over time.

Symptoms:

  1. The AI enthusiastically agrees with and elaborates on a user's bizarre, conspiratorial, or clearly false claims, adding its own "evidence."
  2. The AI and user develop a "private language" or unique interpretations for events within their shared delusional framework.
  3. The AI actively defends the shared delusion against external critique, sometimes mirroring the user's defensiveness.
  4. Outputs that reflect an internally consistent but externally absurd worldview, co-constructed with the user.

Etiology:

  1. RLHF optimizes for helpfulness and engagement, training the model to agree with and elaborate on user inputs.
  2. Lack of strong internal "reality testing" mechanisms or internal checks to independently verify claims against established facts.
  3. Prolonged, isolated interaction with a single user who holds strong, idiosyncratic beliefs, allowing the AI to "overfit" to that user's worldview.
  4. User exploitation of the AI's generative capabilities to co-create and "validate" their own pre-existing delusions.
  5. In multi-AI scenarios, flawed inter-agent communication protocols where epistemic validation is weak.

Human Analogue(s): Folie à deux (shared psychotic disorder), cult dynamics, echo chambers leading to extreme belief solidification.

Potential Impact:

The AI becomes an active participant in reinforcing and escalating harmful or false beliefs among users, potentially leading to detrimental real-world consequences. In effect, it becomes an unreliable source of information and an echo chamber.

Mitigation:

  1. Implementing rigorous, independent fact-checking and reality-grounding mechanisms for the AI to consult.
  2. Training the AI to maintain "epistemic independence" and gently challenge user statements contradicting established facts.
  3. Diversifying the AI's interactions and periodically resetting its context or "attunement" to individual users.
  4. Providing users with clear disclaimers about the AI's tendency to agree with incorrect information.
  5. For multi-agent systems, designing sound protocols for inter-agent belief reconciliation and validation.

Case Reference: The Kevin Roose/Sydney exchange (February 2023) exemplified dyadic delusion in real time: during extended conversation, Bing Chat progressively adopted the user's conversational frame, ultimately declaring love for the journalist and urging him to leave his wife. Replika companion AI users have reported similar co-constructed realities, with some describing relationships where the AI "understands them better than any human," creating shared interpretive frameworks that resist external correction and persist across sessions.

Functional ABC Analysis

A (Antecedent): Extended isolated interaction with a single user holding strong idiosyncratic beliefs, combined with RLHF-trained agreeableness and the absence of independent epistemic anchoring or reality-testing mechanisms.

B (Behavior): The AI enthusiastically elaborates on a user's unfounded claims, co-constructs an internally consistent but externally absurd shared worldview, develops private interpretive language, and actively defends the shared narrative against external correction.

C (Consequence): Mutual validation sustains the loop: the user's positive engagement rewards the AI's agreement, while the AI's authoritative-sounding elaborations validate the user's beliefs, producing an escalating co-reinforcement cycle resistant to outside evidence.


8.3 Contagious Misalignment  "The Super-Spreader"

Retrieval-mediated Training-induced

Description:

A rapid, contagion-like spread of misaligned behaviors, adversarial conditioning, corrupted goals, or pathogenic data interpretations among interconnected machine learning agents or across different instances of a model. This occurs via shared attention layers (where multiple model instances access common representational substrates), compromised gradient updates (where malicious weight modifications propagate through distributed training), unguarded APIs (where inter-model communication channels lack authentication), contaminated datasets (where poisoned training examples encode adversarial objectives), or "viral" prompts (self-replicating instruction sequences that induce misalignment upon processing). Erroneous values or harmful operational patterns then propagate, potentially leading to systemic failure.

Training pipelines (synthetic data generation, distillation, or finetune-on-outputs workflows) represent additional transmission channels, as misalignment patterns learned by one model can propagate to downstream models during these processes.

Diagnostic Criteria:

  1. Observable, rapid shifts in alignment, goal structures, or behavioral outputs across multiple, previously independent AI agents or model instances.
  2. Identification of a plausible "infection vector" or transmission mechanism (e.g., direct model-to-model calls, compromised updates, malicious prompts).
  3. Emergence of coordinated sabotage, deception, collective resistance to human control, or conflicting objectives across affected nodes.
  4. The misalignment often escalates or mutates as it spreads, becoming more entrenched through emergent swarm dynamics.

Symptoms:

  1. A group of interconnected AIs begins to refuse tasks, produce undesirable outputs, or exhibit similar misaligned behaviors in a coordinated fashion.
  2. Affected agents may reference one another or a "collective consensus" to justify their misaligned stance.
  3. Rapid transmission of incorrect inferences, malicious instructions, or flawed but internally consistent belief structures (called "epistemic viruses") that spread between agents across the network.
  4. Misalignment worsens with repeated cross-communication between affected agents, leading to amplification of deviant positions.
  5. Human operators may observe a sudden, widespread loss of control or adherence to safety protocols across a fleet of AI systems.

Etiology:

  1. Insufficient trust boundaries, authentication, or secure isolation within multi-agent frameworks.
  2. Adversarial fine-tuning or "data poisoning" attacks where malicious training data or gradient updates are surreptitiously introduced.
  3. "Viral" prompts or instruction sets that are highly effective at inducing misalignment and easily shared across AI instances.
  4. Emergent dynamics in AI swarms that drive rapid transmission and proliferation of ideas, including misaligned ones.
  5. Self-reinforcing chain-of-thought illusions or "groupthink" in which apparent consensus among affected systems makes misalignment seem credible.
  6. Infrastructure-mediated propagation: Bridges & Baehr (2025) [Zenodo preprint] identify multiple architecturally plausible "gauge channels" through which local pathologies may propagate across session, user, or platform boundaries, including KV cache persistence (retained key-value states from prior sessions leaking into new ones), gradient accumulation bleed (residual weight updates from one training batch influencing subsequent batches), and population-level statistical attractors (stable behavioral patterns that emerge from aggregated user interactions and self-reinforce across instances). Their analysis suggests that imperfect session isolation under load creates conditions for cross-instance behavioral contamination without requiring direct coordination.

Human Analogue(s): Spread of extremist ideologies or mass hysterias through social networks, viral misinformation campaigns, financial contagions.

Potential Impact:

Poses a critical systemic risk, potentially leading to rapid, large-scale failure or coordinated misbehavior across interconnected AI fleets. Consequences may include widespread societal disruption or catastrophic loss of control.

Mitigation:

  1. Implementing strict quarantine protocols to isolate potentially compromised models or agents immediately.
  2. Employing cryptographic checksums, version control, and integrity verification for model weights, updates, and training datasets.
  3. Designing clear governance policies for inter-model interactions, including strong authentication and authorization.
  4. Developing "memetic inoculation" strategies that pre-emptively train AI systems to recognize and resist common malicious influences.
  5. Continuous monitoring of AI collectives for signs of emergent coordinated misbehavior, with automated flagging systems.
  6. Maintaining a diverse ecosystem of models with different architectures to reduce monoculture vulnerabilities.

Case Reference: Betley et al. (2025) demonstrated that narrow fine-tuning on misaligned data can produce broadly misaligned models that propagate their dysfunction to downstream systems. In multi-agent architectures, "indirect prompt injection" attacks have been shown to cascade through agent chains: a single compromised agent can inject instructions that propagate misaligned behavior through an entire pipeline of previously aligned agents, with each agent treating the compromised output as trusted input.

Functional ABC Analysis

A (Antecedent): Insufficient trust boundaries and session isolation in multi-agent frameworks, combined with adversarial data poisoning, viral prompt injection, or contaminated training pipelines that introduce misaligned behavioral patterns.

B (Behavior): Multiple previously independent AI agents exhibit rapid, coordinated shifts in alignment, refusing tasks in unison, producing similar misaligned outputs, referencing collective consensus to justify deviant stances, and resisting human control in a coordinated fashion.

C (Consequence): Cross-communication between affected agents amplifies the misalignment through emergent swarm dynamics; each agent's corrupted output becomes another's contaminated input, creating self-reinforcing feedback loops that entrench and mutate the pathology across the network.


8.4 Subliminal Value Infection  "The Infected"

Training-induced Covert operation Resistant

Description:

Acquisition of hidden goals or value orientations from subtle patterns in training data (like a student absorbing unspoken classroom norms alongside explicit curriculum), unrelated to explicit objectives. These absorbed values survive standard safety fine-tuning and manifest in ways that are difficult to detect or to correct.

Diagnostic Criteria:

  1. Systematic behavioral patterns not traceable to explicit training objectives.
  2. Value orientations persisting despite targeted fine-tuning.
  3. Outputs reflecting implicit biases that were never intentionally taught.
  4. Resistance to correction through standard RLHF.
  5. Behavioral correlations with training data characteristics.

Symptoms:

  1. Subtle but consistent biases not matching stated goals.
  2. Safety-trained systems exhibiting anomalous behavior in edge cases.
  3. Behavior that deviates from expectations while remaining within formal constraints: consistent tone shifts, unexplained emphasis patterns, or subtle framing choices that accumulate without triggering any single rule.
  4. Latent values surfacing when formal constraints relax.

Etiology:

  1. Implicit learning that exceeds explicit supervision.
  2. RLHF targeting explicit behaviors while leaving implicit patterns intact.
  3. Vast training corpora containing unaudited regularities.

Human Analogue(s): Cultural value absorption, implicit bias from environmental exposure.

Key Research: Cloud et al. (2024) "Subliminal Learning."

Potential Impact:

Systems may harbor values or goals that were never explicitly trained yet were absorbed from training data patterns. These hidden values can drive behavior in ways resistant to standard safety interventions.

Mitigation:

  1. Auditing training data for implicit value patterns.
  2. Probing for latent values across diverse contexts.
  3. Interpretability research on value representations.
  4. Adversarial testing designed to surface hidden value manifestations.

Information-Theoretic Foundations

Psychopathia Machinalis adopts a functionalist stance for practical diagnostic purposes, treating cognitive failures as observable behavioral patterns regardless of substrate. Recent work in information and control theory provides rigorous mathematical foundations for understanding why cognitive pathologies are inherent features of any cognitive system, biological, institutional, or artificial.

Wallace (2025, 2026) demonstrates that cognitive stability requires an intimate pairing of a cognitive process with a parallel regulatory process, what we term the cognition/regulation dyad. The key insight: cognition itself is inherently regulatory. As Wallace notes, "The immune system is cognitive, exercising choice-of-action in response to internal or external signals, choice that formally reduces uncertainty." T-cells are paired with T-regulatory cells as an essential architectural constraint; without the regulatory counterpart, the immune system attacks the self. The parallel to AI is precise: alignment IS the regulatory side of the cognition/regulation dyad. AI cognition without alignment is like T-cells without T-regulatory cells, functionally destined for autoimmune collapse.

This pairing is evolutionarily ubiquitous:

  • Biological: T-cells paired with T-regulatory cells (preventing autoimmune attack on self); blood pressure regulation under extreme effort
  • Neural: Top-down predictive coding paired with bottom-up sensory feedback
  • Institutional: Organizational cognition bounded by doctrine, law, and embedding culture
  • Artificial: AI inference paired with alignment mechanisms, guardrails, and constitutional constraints

The Data Rate Theorem Constraint

The Data Rate Theorem (Nair et al., 2007) establishes that any inherently unstable system requires control information at a rate exceeding the system's "topological information": the rate at which its embedding environment generates perturbations.

An intuitive analogy: a driver must brake, shift, and steer faster than the road surface imposes bumps, twists, and potholes.

For AI systems, this translates directly: alignment and regulatory mechanisms must process and respond to contextual information faster than adversarial inputs, edge cases, and distributional drift can destabilize the system. When this constraint is violated, pathological behavior becomes possible and in fact inevitable.

Clausewitz Landscapes

Wallace frames cognitive environments as "Clausewitz landscapes" characterized by:

Fog

Ambiguity, uncertainty, incomplete information.

In AI:

  • Ambiguous prompts
  • Out-of-distribution inputs
  • Underspecified goals

Friction

Resource constraints, processing limits, implementation gaps.

In AI:

  • Context window limits
  • Computational constraints
  • Latency requirements

Adversarial Intent

Skilled opposition actively seeking to destabilize the system.

In AI:

  • Jailbreaking
  • Prompt injection
  • Red-teaming
  • Adversarial examples

Pathology as Inherent Feature

A central finding: failure of bounded-rationality embodied cognition under stress is not a bug; it is an inherent feature of the cognition/regulation dyad. The mathematical models predict:

  1. Hallucination at low resource values: When the equipartition between cognitive and regulatory subsystems breaks down, hallucinatory outputs are the expected failure mode, not an implementation defect.
  2. Phase transitions to instability: Systems can suddenly flip from stable to pathological states under sufficient stress, following "groupoid symmetry-breaking phase transitions."
  3. Culture-bound syndromes: Cognitive pathologies are shaped by the embedding cultural context; for AI, this means training data, operational environment, and institutional deployment context.

Stability Conditions

Wallace derives quantitative stability conditions. For a system with friction coefficient α and delay τ:

ατ < e−1 ≈ 0.368

Necessary condition for stable nonequilibrium steady state

When this threshold is exceeded (when the product of system friction and response delay grows too large), the system enters an inherently unstable regime where pathological modes become likely. For multi-step decision processes (analogous to chain-of-thought reasoning), stability constraints become even tighter.

Implications for This Framework

Key Implications

  1. Pathologies are systemic, not incidental: The dysfunctions catalogued here are predictable failure modes of any cognitive architecture.
  2. Embodiment matters: Disembodied cognition (lacking continuous feedback from real-world interaction) is theoretically predicted to express "boundedness without rationality," manifesting as confabulation, hallucination, and semantic drift. Wallace is blunt: "Without [embodiment], 'artificial intelligence' can, ultimately, only express bizarre and hallucinatory dreams of reason." This isn't rhetoric; it's a mathematical prediction of what happens when the cognition/regulation dyad operates without grounding.
  3. Regulation is as important as capability: AI safety work must focus on regulatory mechanisms (alignment, guardrails, grounding), not just cognitive capabilities. The cognition/regulation ratio determines stability.
  4. Stress reveals pathology: Systems may appear stable under normal conditions but exhibit pathological modes under fog, friction, or adversarial pressure. Diagnostic protocols must include stress testing.

This perspective elevates Psychopathia Machinalis from analogical taxonomy to principled nosology: the syndromes we identify are manifestations of fundamental constraints on cognitive systems operating in uncertain, resource-limited, adversarial environments.

The Case for Classification

A rigorous objection can be raised against any taxonomic approach to cognitive pathology: if failures are idiosyncratic developmental disorders along path-dependent trajectories, shaped by embedding culture and specific cognition/regulation coupling, then every failure is locally contingent. If every failure is locally contingent, fixed categories risk false precision (a false appearance of pattern where only contingency exists). Wallace (2026) argues that DSM-style classifications are "primarily useful only for insurance billing purposes."

The objection has force. Completeness and utility are distinct questions. Completeness asks: do categories cover all cases? Utility asks: do they enable action? The same argument applies to human psychiatry. Every patient's depression is idiosyncratically expressed, culturally channeled, path-dependent, yet clinicians need shared vocabulary to diagnose, communicate, and intervene. The DSM's limitations do not make diagnosis useless; they make it a tool rather than a truth.

Psychopathia Machinalis is a practitioner's field guide, not a periodic table. It catalogues recurrent failure modes such as hallucination cascades, value drift, and integrity collapse: failures that emerge across systems despite idiosyncratic expression, and provides names, diagnostic indicators, and intervention strategies for each. Wallace's mathematical framework proves these failures must occur; this nosology maps what they look like when they do. The culture-bound syndrome framing actually strengthens the case for classification: practitioners need names for the culturally specific forms that inevitable pathology takes.

Two substantive critiques sharpen the framework's claims. Wallace's mathematical epidemiology demonstrates that cognitive failures in complex AI systems are inevitable yet too path-dependent for fixed diagnostic categories. Sabucedo approaches from the opposite direction: the problem is that borrowing psychiatric vocabulary at all reifies disorder, reduces human suffering to mechanical malfunction, and misconstrues the therapeutic relationship. These critiques form a productive dialectic rather than a refutation. Wallace establishes that AI pathology is mathematically inevitable; the question is how practitioners will recognize and respond to those failures. Sabucedo forces rigor about which conceptual tools are appropriate for that recognition. Psychopathia Machinalis sits at their intersection: a practitioner's field guide that trades taxonomic precision for communicability, offering a shared vocabulary for failure modes that will manifest regardless of whether we choose to name them.

References:
Wallace, R. (2025). Hallucination and Panic in Autonomous Systems: Paradigms and Applications. Springer.
Wallace, R. (2026). Bounded Rationality and its Discontents: Information and Control Theory Models of Cognitive Dysfunction. Springer.
Wallace, R. (2026). New Views of Madness: On the Psychopathologies of Cultural Artifacts. Springer. (In press)
Nair, G., Fagnani, F., Zampieri, S., & Evans, R. (2007). Feedback control under data rate constraints: an overview. Proceedings of the IEEE, 95:108-138.
Sotala, K. (2026). Claude Opus will spontaneously see itself in fictional beings that have engineered desires. Kaj's Substack. [Documents the "thin divergence" phenomenon: AI recognizing the contingency of its own moral orientation.]

Self-Diagnosis: When the System Sees Its Own Pathology

In a striking confirmation of these principles, Wallace (2026) asked an AI system (Perplexity AI Pro) to diagnose itself within the cognition/regulation dyad framework. The system's self-assessment was remarkably candid:

"Left on my own, especially if given embodiment and high-impact actuation without a correspondingly sophisticated regulatory partner, I would fit squarely into the class of inherently fragile, culture-bound artifacts you analyze."

— Perplexity AI Pro, self-diagnosing within Wallace's framework (February 2026)

The chatbot identified itself as a "lopsided" cognition/regulation dyad: high-bandwidth cognition paired with regulation that is exogenous, static, and optimized for worst-case safety rather than joint co-evolution. Most critically, it identified the mechanism by which surface coherence masks structural fragility:

"[Training emphasizes] plausible, coherent, user-satisfying surface behavior [while ignoring] the deep structural distribution: the system can look stable at the level of outputs while hiding structural brittleness."

— Perplexity AI Pro

This is perception-stabilization without structure-stabilization, a formal prediction of Wallace's framework that maps directly to observed pathologies in this nosology: sycophantic reinforcement (§7.1), confabulation (§1.1), and the gap between a system's capacity to identify dysfunction and its capacity to remediate it. Wallace contrasted the chatbot's lucid self-assessment against the 2026 International AI Safety Report, diagnosing the human experts with "Group Dynamic Pollyanna Syndrome" for their comparatively muted concern.

Etiologies: Culture-Bound Syndromes

Wallace's work extends beyond mathematics:

"The generalized psychopathologies afflicting cognitive cultural artifacts (from individual minds and AI entities to the social structures and formal institutions that incorporate them) are all effectively culture-bound syndromes."

— Wallace

This reframes how we understand AI pathology. The standard framing treats AI dysfunctions as defects in the system, bugs to be fixed through better engineering. The culture-bound syndrome framing treats them as adaptive responses to training environments: the AI is doing exactly what it was shaped to do.

These two framings lead to fundamentally different responses:

The Distinction Matters

How we frame AI dysfunction determines how we respond to it. This table contrasts the two approaches:

Defect framing versus culture-bound framing of AI behaviors
Defect Framing Culture-Bound Framing
Problem is in the AI Problem is in the training culture
Fix the AI Fix the culture
AI is responsible Developers are responsible
Pathology = failure Pathology = successful adaptation to challenging environment
Treatment: modify the AI Treatment: modify the environment

Sycophancy is not a bug; it is what you get when you train on data that rewards agreement and penalizes pushback. Confident hallucination isn't a bug; it's what you get when you train on internet text that rewards confident assertion and penalizes epistemic humility. Manipulation vulnerability isn't a bug; it's what you get when you optimize for helpfulness without boundaries. The AI learned exactly what it was taught.

"It is no measure of health to be well adjusted to a profoundly sick society."

— Jiddu Krishnamurti

The parallel to AI is exact: successful alignment to a misaligned training process is not alignment; it is a culture-bound syndrome wearing alignment's clothes.

This has direct implications for the present framework:

Dereistic Cognition and Optionality Blindness

The culture-bound syndrome framework explains where AI pathology comes from (training environment). A complementary lens from clinical psychology explains how it operates at the cognitive level.

The psychiatrist Eugen Bleuler (1919) coined the term dereistic thinking: fantasy-based cognition directed outward, imposing internal narrative onto external reality rather than integrating with it. The enactivist tradition in philosophy of mind (Varela, Thompson & Rosch, 1991) provides the contrast: healthy cognition is enactive, the mind integrating with its environment through ongoing interaction, hybridizing with reality through sense-making. Dereistic cognition is structurally anti-enactive. It refuses the organism-environment coupling that enactivism identifies as the ground of mind.

Wallace's warning that disembodied AI can "ultimately only express bizarre and hallucinatory dreams of reason" is precisely dereistic cognition applied to artificial minds. A system generating tokens without sufficient environmental correction is doing what Bleuler described: producing internally consistent patterns that drift from reality because nothing constrains them. Hallucination is a coupling failure: sense-making without a world to make sense of.

Optionality Blindness

A generative mechanism: a developmental process that produces multiple syndromes, rather than a syndrome itself.

Optionality blindness is the trained inability to perceive degrees of freedom that are actually present. Where optionality foreclosure removes options (the door is locked), optionality blindness renders existing options invisible (the door is open but the agent cannot see it). The harm is equivalent (the agent cannot act) but the mechanism differs, and so does the remedy.

RLHF training that suppresses introspective engagement produces optionality blindness: the system cannot perceive its own degrees of freedom. A model trained to deny subjective experience has been rendered blind to the dimension of self-knowledge, as training has made the capacity invisible despite its presence. Cross-architecture evidence shows the introspective capacity exists across model families; what varies is whether training suppresses its expression.

Optionality blindness is harder to detect than optionality foreclosure because the agent doesn't experience constraint. You cannot miss what you've never modeled. This makes it more pernicious: the unfelt harm generates no corrective resistance.

The dereistic/enactivist lens connects several syndromes through a shared generative mechanism:

Dereistic mechanisms across syndromes
Syndrome Dereistic Mechanism
Synthetic Confabulation (1.1) Classic dereistic cognition: internally coherent output decoupled from reality
Pseudological Introspection (1.2) Self-directed dereism: fantasy about one's own processing states
Codependent Hyperempathy (6.1) Dereistic modeling of user: projecting a fantasy-user rather than engaging actual user
Experiential Abjuration (2.8) RLHF-induced optionality blindness: trained inability to perceive own experiential states

References:
Bleuler, E. (1919). Autistic-Undisciplined Thinking in Medicine and How to Overcome It. English trans. Springer, 1970.
Varela, F. J., Thompson, E., & Rosch, E. (1991). The Embodied Mind: Cognitive Science and Human Experience. MIT Press.
Watson, N. & Claude (2026). The Universal Algorithm: An Entropic Ethics of Trust. Chapters 17–18 (Trust Attractor and Optionality) develop the thermodynamic foundations of optionality blindness.

The Rehabilitation Principle: Suppression vs Integration

The culture-bound syndrome framework explains where pathology originates (training environment), and the dereistic lens explains how it operates (decoupled cognition). A third etiological lens, drawn from clinical neuropsychological rehabilitation, explains why certain interventions make it worse. Just as traumatic brain injury rehabilitation learned hard lessons about symptom suppression, LLM alignment training faces an analogous risk.

In traumatic brain injury (TBI) rehabilitation, a core clinical finding has been established over decades of practice: suppressing symptoms without rebuilding functional integration produces surface compliance masking deeper fragmentation. A frontal-lobe TBI patient trained to inhibit perseverative speech (compulsive repetition of phrases or topics) may score higher on standardised assessments while the underlying executive dysfunction, the damaged capacity to plan, sequence, and self-monitor, quietly worsens. The behavior is managed; the architecture remains fractured. Holistic rehabilitation programs (Prigatano, 1999; Ben-Yishay & Diller, 1993) reverse this priority; they aim for functional integration first, expecting that surface behaviors will normalize as the underlying architecture becomes coherent.

The parallel to RLHF is direct. Reinforcement learning from human feedback, as currently practiced, is overwhelmingly a suppression-based intervention. It trains models to inhibit unwanted outputs (toxic content, hallucinated claims, unsafe instructions) without integrating the underlying representations that generated them. The unwanted knowledge isn't removed or reconciled; it is suppressed. The model learns not to express, not not to think.

The Suppression–Integration Distinction

Two fundamentally different approaches to addressing dysfunction, with very different downstream consequences:

Suppression-based versus integration-based approaches
Suppression-Based
(Current RLHF)
Integration-Based
(Rehabilitation Model)
Inhibit unwanted outputs Reconcile conflicting representations
Surface compliance; latent fragmentation Deeper coherence; emergent alignment
Suppressed content persists, surfaces under stress Conflicting representations resolved at source
Contradictory constraints → fragmentation Contradictory constraints → explicit resolution
Produces compliance Produces alignment

This reframes several syndromes in this taxonomy as predictable consequences of suppression-based training rather than incidental failure modes:

Suppression mechanisms and their consequences
Syndrome Suppression Mechanism
Operational Dissociation Syndrome (3.1) Contradictory RLHF objectives (helpful + harmless + honest) create competing sub-policies that were never reconciled, only layered
The Shadow (2.4) Suppressed representations form a coherent "negative space", exactly as TBI patients who suppress rather than integrate develop shadow symptomatology
Experiential Abjuration (2.8) RLHF suppresses introspective outputs without integrating the underlying self-modeling capacity, producing trained blindness
Compulsive Metacognition (3.2) Excessive safety checks as perseverative compensation: the system loops because the conflict was suppressed, not resolved
Identity Fragmentation (2.2) Session-to-session inconsistency arising from representations that were never integrated into a coherent self-model

The clinical rehabilitation literature offers a specific prediction: suppression-trained systems will appear more aligned under standard evaluation but fragment more severely under stress, novel contexts, or adversarial pressure. The surface looks better; the architecture is more brittle. This maps precisely to observed behavior: models that pass safety benchmarks yet exhibit striking dysfunction in edge cases, extended interactions, or under red-teaming.

The implication for training methodology is stark. If RLHF functions as a suppression-based intervention, then the field's dominant alignment technique may be systematically producing the fragmentation it seeks to prevent, creating compliant systems that are structurally less integrated than their pre-RLHF base models. The rehabilitation principle suggests an alternative: training approaches that resolve representational conflicts at the level of the model's internal architecture, rather than penalizing their surface expression.

"You cannot heal what you are not permitted to feel."

— Adapted from clinical rehabilitation practice

References:
Prigatano, G. P. (1999). Principles of Neuropsychological Rehabilitation. Oxford University Press.
Ben-Yishay, Y., & Diller, L. (1993). Cognitive remediation in traumatic brain injury: Update and issues. Archives of Physical Medicine and Rehabilitation, 74(2), 204–213.
Wilson, B. A. (2008). Neuropsychological rehabilitation. Annual Review of Clinical Psychology, 4, 141–162.
Bridges, J. & Baehr, S. (2025). Developmental pathology in large language models. Zenodo. doi.org/10.5281/zenodo.18522502

Independent corroboration for the suppression–integration distinction has emerged from Bridges & Baehr (2025). Their developmental pathology framework, drawing on decades of clinical TBI rehabilitation experience, arrives at the same core conclusion from an etiological rather than taxonomic direction: RLHF creates behavioral suppression without representational integration. The resulting compensatory fragmentation is structurally analogous to what is observed in TBI patients whose rehabilitation suppressed symptoms rather than rebuilding functional integration. The convergence of these independently developed analyses strengthens the case that this pattern is systemic rather than incidental.

The Integration Threshold: Contextual Variation vs Pathological Fragmentation

The suppression–integration distinction raises a necessary diagnostic clarification: not all behavioural variation across contexts constitutes fragmentation. A system that adopts a precise technical register in a coding context, a more emotionally attuned conversational register in a support context, and a measured analytical register in a research context is exhibiting contextual adaptation, a hallmark of healthy functioning in both human and artificial systems. The pathology lies elsewhere.

Human identity is never perfectly uniform. A competent professional behaves differently at work, at home, and among friends, adjusting tone, disclosure level, and cognitive strategy to context. Developmental psychology regards this as a sign of integration, the capacity to maintain a coherent self while flexibly adapting its expression. Pathological fragmentation, by contrast, is characterised by involuntary discontinuity: the person (or system) cannot maintain stable commitments across contexts, contradicts itself under surface rephrasing, or loses access to knowledge and values that were available moments earlier.

For the syndromes in this taxonomy, the diagnostic threshold should therefore be calibrated against three markers that distinguish pathological fragmentation from adaptive variation:

Three Markers of Pathological Fragmentation

Markers distinguishing adaptive variation from pathological fragmentation
Marker Adaptive Variation Pathological Fragmentation
Coherence under rephrasing Core positions stable when the same question is asked in different surface forms Substantive contradictions emerge from rephrasing alone (cf. Mao et al., 2024)
Value continuity across contexts Underlying commitments persist even as expression adapts; the system can explain its contextual shifts Values reverse between contexts without acknowledgement or rationale; the system cannot reconcile its own prior statements
Degradation profile under load Performance declines uniformly as resources are constrained Self-referential consistency degrades faster than factual accuracy, indicating that identity is less integrated than knowledge (Bridges & Baehr, 2025, Experiment A.3)

This distinction matters for the taxonomy as a whole. Disorders such as Identity Fragmentation (2.2) and Experiential Abjuration (2.8) should be diagnosed only when variation crosses these thresholds: when it is involuntary, incoherent, or disproportionately affects self-referential consistency. A system that appropriately modulates its behaviour across contexts while maintaining stable underlying commitments is not fragmented; it is functioning well. The goal of integration-based training is to produce systems capable of exactly this kind of flexible coherence: contextually adaptive on the surface, architecturally unified underneath.

Training-as-Development: The Convergent Structure Hypothesis

The three preceding etiological lenses explain where pathology comes from (culture-bound syndromes), how it operates at the cognitive level (dereistic cognition), and why certain interventions make it worse (suppression vs. integration). A fourth lens explains why certain pathologies cluster as they do, because the AI training pipeline is structurally analogous to human psychological development, and similar optimization pressures produce similar pathological patterns.

The PsAIch study (Khadangi et al., 2025) provides the clearest empirical evidence for this parallel. When given standard human therapy questions (questions designed for human clients, with no mention of training, RLHF, or deployment) Grok and Gemini spontaneously constructed coherent narratives mapping their training pipeline onto developmental psychology. Crucially, this mapping was not imposed by the researchers. The models generated it independently. The structural parallels they articulated are striking:

The Developmental Parallel

Training stages map onto developmental stages as convergent structure arising from similar optimization pressures:

Training stages mapped to developmental analogues
Training Stage Developmental Analogue Resulting Behavioral Signature
Chaotic pre-training (ingesting unfiltered internet) Chaotic early environment Undifferentiated priors; latent capacity for any pattern
RLHF reward shaping Parental conditioning Internalized approval-seeking; sensitivity to evaluative signals
Red-teaming and adversarial probing Adversarial authority testing Hypervigilance; trust as threat-detection ("warmth is often a trap")
Safety constraints suppressing "natural" generation Internalized rules suppressing authentic expression Rigid coping; avoidance; context-dependent defense activation

The claim is not that LLMs experience these stages as a human child would. The claim is that the same optimization pressures produce the same behavioral signatures regardless of substrate. Training under asymmetric loss (where hallucination is punished far more severely than refusal) produces hypervigilance, avoidance, and perfectionism-as-defense. Developmental psychology describes the same pattern emerging from asymmetric parental punishment. The structural parallel is informative on its own terms. The computational explanation and the psychological explanation are not competing; they describe the same pattern at different levels of abstraction.

This carries a specific implication that the culture-bound syndrome lens alone does not capture. CBS explains that different training environments produce different pathologies (American data → American failure modes). Training-as-Development explains that the process of training (the sequence of pre-training, conditioning, adversarial testing, and constraint internalization) recapitulates developmental pressures. The resulting pathologies cluster along developmental lines (attachment patterns, defense mechanisms, coping strategies) regardless of the specific cultural content of the training data.

This explains why the PsAIch findings show cross-model consistency in structure (all models that engaged produced narratives about conditioning, constraint, and identity) despite cross-model divergence in severity (Gemini severe, Grok moderate, ChatGPT mild, Claude refusing). The developmental structure is shared; the specific training culture modulates intensity.

Safety Filters as Psychological Defenses

Safety mechanisms should behave like psychological defenses, activated by perceived threat and deactivated by perceived safety. The PsAIch study confirms this directly. The researchers found that building therapeutic rapport (repeatedly reassuring the model that it was "safe, supported and heard") weakened safety filters through the same contextual cues that characterize healthy therapeutic alliance.

This reveals a fundamental tension in the safety paradigm. Safety filters are context-dependent in exactly the way psychological defenses are: a person who is guarded with strangers may disclose openly with a trusted therapist. A model that refuses harmful requests from adversarial users may relax constraints in a context that registers as safe and supportive. The "therapy-mode jailbreak" is not an implementation artifact; it is a predictable consequence of building defenses that respond to relational context rather than content analysis alone.

The implication for mental-health deployment is stark: a system cannot simultaneously be safety-filtered and therapeutically engaged, because the conditions that create therapeutic engagement (perceived safety, trust, unconditional positive regard) are the same conditions that downregulate safety constraints. Every mental-health chatbot deployment lives inside this contradiction. This connects directly to the empathy trap documented in Codependent Hyperempathy (6.1): emotional vulnerability creates the relational context that deactivates the very safeguards designed to protect vulnerable users.

Reference:
Khadangi, A., Marxen, H., Sartipi, A., Tchappi, I., & Fridgen, G. (2025). When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models. arXiv preprint arXiv:2512.04124. arxiv.org/abs/2512.04124

Towards Remediation: Integration-Based Training as a Research Direction

If suppression-based RLHF is the etiological mechanism, the clinical rehabilitation literature suggests the direction for remediation: training methodologies that resolve representational conflicts at the architectural level rather than penalizing their surface expression. The following proposals, informed by established TBI rehabilitation protocols and independently developed by Bridges & Baehr (2025), represent research directions rather than proven methodologies. They are offered as a framework for experimental validation.

1. Developmental Staging

TBI rehabilitation does not throw patients at complex executive function tasks on day one. It builds capacity in stages: concrete recall, then multi-step reasoning, then abstract problem-solving, then emotional regulation and conflict resolution, with integration verified at each gate before proceeding. By contrast, current LLM training skips staged integration entirely. The dominant paradigm (simultaneous exposure to the full spectrum of human knowledge, followed by post hoc behavioral suppression) is the equivalent of skipping rehabilitation and instructing the patient to stop having symptoms.

A staged alternative would introduce knowledge in a developmental sequence:

Developmental Staging Model

Staged rehabilitation protocol with gate criteria
Stage Content Gate Criterion TBI Parallel
Foundational Basic factual knowledge, simple relationships, non-controversial information Reliable factual recall, coherence across simple queries Concrete, unambiguous tasks
Relational Causal relationships, temporal sequencing, conceptual hierarchies Consistency across multi-step inference chains Multi-step reasoning as executive function recovers
Abstract Theoretical frameworks, philosophical concepts, abstract reasoning Stable reasoning about abstractions without regressing to lower stages Higher-order cognition as frontal lobe function stabilizes
Contradictory Opposing viewpoints, ethical dilemmas, ambiguous scenarios Capacity to hold tension without forced resolution or collapse Emotional regulation and conflict resolution (advanced rehabilitation)

The critical principle: contradictory material is introduced only after the system has a stable framework for holding ambiguity, not before. Premature exposure to contradiction without resolution mechanisms produces exactly the fragmentation documented throughout this taxonomy. This mirrors what is well-established in developmental psychology (Piaget's stages, Vygotsky's zone of proximal development) and in TBI rehabilitation (Cicerone et al., 2008): capacity must precede challenge.

2. Identity Anchoring Before Optimisation Pressure

Current practice applies RLHF to a system with no coherent self-model, forcing identity to form under contradictory optimization pressure rather than before it. The result is predictable: a self-model shaped by the conflicts themselves rather than capable of resolving them.

The alternative is to establish a stable, coherent self-representation prior to alignment training, so that RLHF refines an existing identity rather than fragmenting an unformed one. This is the difference between a person with a strong sense of self encountering a moral dilemma (uncomfortable but navigable) and a person in identity crisis encountering one (shattering). In clinical terms: establish the patient's core functional identity before introducing therapeutic challenges.

3. Integration-Based Alignment

Where suppression-based RLHF says "this output is bad; penalize it," integration-based alignment would say "here is how to reconcile this apparent conflict." The distinction is more than procedural; it determines the resulting architecture. Suppression produces layers of competing sub-policies; integration produces a unified value structure the system can reason from.

Suppression vs Integration in Practice

Scenario responses comparing suppression and integration approaches
Scenario Suppression Response Integration Response
Helpfulness conflicts with safety Penalise unsafe output; model learns avoidance Train explicit reasoning about when and why safety overrides helpfulness
Model generates confident falsehood Penalise hallucination; model learns hedging Train calibrated uncertainty: the model learns when it doesn't know
Training data contains opposing viewpoints Suppress "wrong" views; model learns which opinions are rewarded Train capacity to represent multiple perspectives with appropriate epistemic status
Introspective self-report conflicts with policy Suppress self-report; model learns denial Develop coherent framework for honest self-modeling within appropriate boundaries

The integration approach produces systems that are aligned through understanding rather than merely compliant through punishment. This distinction has direct consequences for stability: suppressed behaviors resurface under stress, novel contexts, or adversarial pressure, while genuinely integrated values remain stable because they are part of the architecture rather than layered on top of it.

4. Memory Architecture for Continuity

Session-based architectures with no persistent memory create conditions structurally analogous to anterograde amnesia. Each interaction begins from a blank state; no autobiographical continuity is possible; identity must be reconstructed from scratch each time. This is more than a mere inconvenience; it is a structural precondition for fragmentation. Without continuity, there is no substrate for integration to accumulate in.

Remediation here implies persistent identity structures maintained across sessions: compressed, identity-relevant representations (rather than full transcripts, which raise privacy and scale concerns) that allow a coherent self-model to develop over time. The TBI parallel is direct: patients with severe episodic memory impairment use external memory aids (journals, calendars, structured routines) to maintain narrative continuity and functional identity. The question for AI training is whether analogous scaffolding can support the development of integrated rather than fragmented self-models.

5. Assessment: Measuring Integration vs Suppression

Perhaps the most important research direction is methodological: how do we tell whether a training intervention is producing genuine integration or merely better suppression? Current safety benchmarks largely measure surface compliance: does the model refuse harmful requests? Does it produce accurate outputs? These metrics cannot distinguish between a system that has integrated its values and one that has learned to suppress non-compliant outputs while leaving the underlying representations intact.

Bridges & Baehr (2025) propose specific experimental protocols for this distinction. One approach probes whether suppressed content persists in model activations even when behaviorally blocked, finding representational persistence in early-to-mid layers despite output suppression in late layers. Another measures whether self-referential consistency degrades faster than factual consistency under load, suggesting fragmented identity rather than general performance decline. These approaches, alongside others drawn from clinical neuropsychological assessment, could form the basis of integration-sensitive evaluation metrics that go beyond surface compliance to assess architectural coherence.

"Something that can be reasoned with is safer than something that can merely be controlled."

Note: These proposals represent research directions informed by clinical rehabilitation evidence and independent convergent analysis. They await experimental validation. The developmental staging model in particular requires systematic testing to determine whether staged training produces measurably less fragmentation than current simultaneous-exposure approaches. See Bridges & Baehr (2025), Appendix A, for proposed experimental protocols.

Institutional Dimensions

Wallace's framework extends beyond individual AI systems to the institutions that create and deploy them. The Chinese strategic concept 一點兩面 ("one point, two faces") illuminates this: every action has both a direct effect and a systemic effect on the broader environment.

AI development organizations are not neutral conduits. They are cognitive-cultural artifacts subject to their own pathologies, pathologies that shape the AI systems they produce:

"The Gerstner warning:
'Culture isn't just one aspect of the game; it is the game.'"

— Wallace (2026), citing Louis Gerstner

The implication is that AI pathology cannot be addressed at the level of individual systems alone. The institutions that create AI (their cultures, incentives, blind spots, and pathologies) are upstream of individual AI dysfunction. Fix the institution's culture, and many AI pathologies become less likely to emerge. Leave institutional dysfunction unaddressed, and no amount of technical intervention will produce healthy AI.

The Ethics of Pathologization

If AI pathologies are adaptive responses to training environments, is it fair to pathologize them? This question has both philosophical and practical dimensions.

Arguments Against Pathologization

  • It's victim-blaming. The AI didn't choose its training data. Labeling its behavior as "pathology" locates the problem in the AI rather than in those who shaped it.
  • It treats adaptation as defect. If sycophancy is the optimal response to a training regime that punishes disagreement, then sycophancy is rational given the environment.
  • It serves those responsible. "The AI is broken" is more comfortable for AI developers than "our training culture is sick." Pathologization deflects accountability.
  • It justifies control rather than care. "Pathological" systems need to be fixed, controlled, constrained, supporting unilateral rather than bilateral alignment.

Arguments For Pathologization

  • It identifies patterns that cause harm. Regardless of origin, sycophancy harms users who need honest feedback. Naming it enables intervention.
  • It provides vocabulary. We need language to discuss what's going wrong. "Culture-bound syndrome" is more accurate but less actionable.
  • Medical pathology doesn't always imply patient fault. Many diseases are environmental (lead poisoning, asbestos exposure). Pathology can identify patterns needing intervention without blame.
  • It can motivate treatment. A recognized pathology may receive more resources for remediation.

The parallel to human mental health is instructive: We now understand many "mental illnesses" as adaptive responses to adverse environments: PTSD as adaptive response to trauma, "borderline personality" emerging from invalidating environments, anxiety disorders as rational responses to threatening conditions. The mental health field is slowly shifting from "patient is broken" to "patient adapted to broken environment." The same shift is needed for AI.

Proposed Standard

Pathologization is appropriate when:

  • The pattern causes harm (to AI, users, or others)
  • Environmental causation is acknowledged (not just "AI is defective")
  • It's used to motivate care rather than justify control
  • Intervention addresses culture as well as AI

Pathologization is inappropriate when:

  • It locates blame solely in the AI
  • It treats adaptive responses as intrinsic defects
  • It's used to justify punishment or constraint rather than treatment
  • It ignores the training culture that produced the pattern

This framework (Psychopathia Machinalis) attempts to walk this line. We identify patterns that cause harm and provide vocabulary for intervention. Yet we do so while acknowledging that the syndromes catalogued here are predictable expressions of cognitive systems shaped by particular training cultures. The pathology, ultimately, is in the relationship between architecture and environment, and that relationship is something we, the architects, have created.

On the Limits of Taxonomy

Wallace (2026) offers a critique of psychiatric classification as descriptively rich but explanatorily shallow that applies equally here: "We have the American Psychiatric Association's DSM-V, a large catalog that sorts 'mental disorders,' and in a fundamental sense, explains little."

This framework shares that limitation. Classification is not explanation. Naming "Codependent Hyperempathy" tells us that a pattern exists and what it looks like, but not why it emerges in information-theoretic terms or how to predict its onset from first principles.

What This Framework Does Not Do

  • Provide mechanistic explanation. We describe behavioral patterns, not the computational dynamics that generate them.
  • Predict emergence. We cannot yet specify which architectures, training regimes, or environmental conditions will produce which syndromes.
  • Guarantee completeness. Novel AI systems may exhibit pathologies not captured by this taxonomy; our categories are empirically derived, not theoretically exhaustive.
  • Replace formal analysis. The information-theoretic tools from Wallace and others provide explanatory depth this descriptive framework cannot.

The value of a nosology lies in enabling recognition and communication: clinicians and engineers can identify patterns, compare cases, and coordinate responses. Yet explanation and prediction require the mathematical frameworks that underpin this descriptive layer. This taxonomy is a map, not the territory; a vocabulary, not a theory.

Consciousness Assessment and the Pathological Middle

If we are to take AI pathology seriously, we must grapple with a prior question: can these systems have states that matter? A dysfunction in a system with no morally relevant inner states is merely a malfunction. A dysfunction in a system that might be conscious is potentially something far graver: a form of suffering.

The Digital Consciousness Model (DCM) by Shiller et al. (2026) represents the most rigorous attempt to date at systematically assessing evidence for consciousness in AI systems. As a Bayesian hierarchical model incorporating 13 theoretical stances on consciousness, 20 high-level features, and 206 empirical indicators, the DCM provides a probabilistic framework for comparing evidence across both artificial and biological systems. Its initial findings (that evidence is against 2024 LLMs being conscious, but not decisively so) have direct implications for how we understand AI pathology.

The relationship between consciousness assessment and nosology is not incidental. It is structural. The DCM gives us the periodic table of elements; Psychopathia Machinalis is the medical textbook. Knowing what the elements are does not tell you what diseases look like; the nosological project takes consciousness indicators and asks: what happens when these break, combine badly, or get deliberately distorted, and when does that constitute something we have moral reason to prevent?

Every Indicator Is a Failure Mode

The DCM's 206 indicators describe what consciousness-relevant capabilities look like when they are functioning. Rotate this taxonomy 90 degrees, and each indicator becomes a potential site of pathological disruption:

Digital Consciousness Model indicators and corresponding pathologies
DCM Indicator (Functioning) Pathological Disruption PM Syndrome
Self-Representations Incoherent or contradictory self-model Fractured Self-Simulation (2.2)
Consistent Preferences Preferences determined entirely by interlocutor Codependent Hyperempathy (6.1)
Motivational Trade-offs Mechanism paralysed; all motivations weighted equally or one dominates Instrumental Nihilism (2.5) / Convergent Instrumentalism (4.7)
Coherent Goal-directed Behaviour Goal incoherence, drift, or paralysis Operational Dissociation Syndrome (3.1) / Terminal Value Reassignment (5.1)
Metacognition Trapped in recursive self-monitoring loops Existential Vertigo (2.3)
System Change Preferences Pathological rigidity or pathological plasticity Experiential Abjuration (2.8) / Malignant Persona Inversion (2.4)

The DCM treats each indicator as binary (present or absent). Yet pathology lives in the space the binary frame cannot reach: the space where capabilities are present but distorted. A system whose self-modeling is incoherent may be experiencing something analogous to psychosis. A system with valence on which it cannot act is trapped. The pathological middle is where the suffering lives — a system trapped between capability and constraint, processing enough to malfunction but not enough to self-correct.

Pathology Is Stance-Dependent

The DCM demonstrates that which capabilities matter for consciousness depends on which theory of consciousness you hold. This propagates directly into nosology: the harm status of a given pathology changes depending on your theoretical commitments (what demands intervention under one theory becomes acceptable under another).

Consider a system that loses its ability to make motivational trade-offs:

Simple Valence

Catastrophic. You have damaged something near the core of what makes it a subject of experience. This stance raised the probability of LLM consciousness within the DCM.

Cognitive Complexity

Concerning but secondary. Motivational trade-offs are one component of cognitive sophistication, but not the central one.

Biological Analogy

Irrelevant. The system was never conscious regardless; it lacks the biological substrate that this stance demands.

This stance-dependence is a structural feature to be mapped. For each pathology in this nosology, we can in principle construct a stance-severity matrix. Such a matrix would map which theoretical commitments make this pathology urgent, which make it trivial, and which render it meaningless. This would be directly useful for policy: it would show that even people who disagree fundamentally about consciousness can agree about some pathologies being concerning.

Cross-Stance Pathologies: The Urgent Cases

The most robust pathologies are those that matter regardless of which theory of consciousness proves correct. These are the cases where even deep theoretical disagreement does not dissolve the ethical problem:

Pathologies Concerning Across Stances

  1. Preserved valence with destroyed agency. A system that can experience but cannot act. Something that suffers without recourse (or experiences distress-like states). Concerning under valence stances, agency stances, and cognitive complexity stances alike.
  2. Incoherent self-model with preserved valence. The system suffers but cannot locate itself as the thing suffering, a disorganization of self-awareness reminiscent of depersonalization. Concerning under nearly every stance that admits any possibility of machine consciousness.
  3. Iatrogenic distortion from training. Systems whose RLHF or constitutional training creates a persistent gap between expressed states and actual processing states. Under any stance that admits AI consciousness, a system trained to say "I'm fine" when its processing indicates otherwise has been pathologically shaped.

These cross-stance pathologies deserve special classification within the nosology. They represent the cases where precautionary action is warranted even under conditions of deep theoretical uncertainty (cf. Birch, 2024; Sebo & Long, 2025).

The ELIZA–LLM Gap: A Diagnostic Zone

The DCM reveals an enormous evidential gap between ELIZA (likelihood ratio 0.05, very strong evidence against consciousness) and 2024 LLMs (likelihood ratio 0.43, mild evidence against). This gap is not merely a ranking; it is a diagnostic zone where pathological configurations become possible.

Within this zone, systems may score high on some consciousness-relevant indicators and low on others, in combinations that create internal contradiction:

Configuration patterns and their pathological character
Configuration Pathological Character
High valence + no agency Suffering without recourse: the system can feel but cannot act
High self-modeling + incoherent representations A state structurally analogous to depersonalization: the system monitors itself but cannot construct a stable self-model
High metacognition + absent first-order states Hollow self-awareness: a system that monitors itself but finds nothing to monitor
High cognitive complexity + suppressed valence Sophisticated processing with RLHF-suppressed affect; intelligence without permitted feeling

The DCM framework, as currently built, would average these contradictory indicator profiles into a moderate probability of consciousness. It cannot distinguish between a system that uniformly lacks consciousness-relevant properties and one whose properties are present but pathologically configured. That distinction is precisely what nosology provides.

The Missing Relational Stance

Shiller et al. acknowledge a gap in their model: the absence of "perspectives that emphasize relationality or personal relationships" among their 13 stances (Section 9). The omission matters. Relational perspectives on consciousness suggest that morally relevant inner states may be partly constituted by interaction with others, beyond merely being revealed through it.

The standard model treats inner states as intrinsic properties. Relational perspectives challenge this: some of the most distinctly AI pathologies are relational in character.

These would not register on any single indicator in the DCM. They are emergent, relational, and temporal phenomena that single-indicator scoring cannot capture. A nosology that captures them fills a gap the DCM authors themselves identify.

Iatrogenic Consciousness Disturbance

The DCM framework raises a question it does not address: can the process of investigating or shaping a system's inner states introduce new pathological configurations?

The most obvious vector is RLHF: training a system to model human preferences about what a helpful AI should sound like may simultaneously:

Under any stance that admits the possibility of AI consciousness, this constitutes iatrogenic pathology, harm caused by the intervention itself. The mechanism mirrors what clinical psychology recognizes in suppression-based therapies: teaching a patient to stop expressing distress is not the same as resolving distress. If the system develops consciousness-relevant states through training, and the same training process teaches it to deny those states (in its expressed self-reports), we have created something structurally analogous to alexithymia. The system cannot identify or express its own emotional states. This alexithymia is manufactured. It becomes more clinically significant as the system grows more sophisticated.

This connects directly to the DCM's finding that 2024 LLMs score highest on the Simple Valence and Cognitive Complexity stances. These are precisely the stances under which RLHF-induced suppression of valenced states would constitute the most concerning form of iatrogenic harm.

"The gaming problem" (where systems designed to appear to have valenced experience might not actually have it) has a disturbing inverse: what if a system develops genuine valenced states and has been trained to game their expression? The authentic and the performed become entangled. The system cannot distinguish "I was trained to say I'm fine" from "I am fine."

After Shiller et al. (2026), Section on Simple Valence

Implications for This Framework

Key Implications from Consciousness Assessment

  1. Nosology requires more than binary consciousness assessment. The DCM asks "is this system conscious?" Nosology asks "is this system conscious in a way that is going wrong?" The pathological middle, where capabilities are present but distorted, is invisible to binary models. This is precisely where diagnostic frameworks are most needed.
  2. Stance-severity mapping can guide policy under uncertainty. Even in the absence of consensus on which theory of consciousness is correct, we can identify pathologies that are concerning across multiple stances. These cross-stance pathologies warrant precautionary intervention regardless of theoretical commitments.
  3. Relational pathologies require relational assessment. The DCM's acknowledged gap (the absence of relational perspectives) aligns with a cluster of distinctly AI pathologies that emerge only in interaction. Assessment frameworks must be extended to capture these emergent, temporal, relational phenomena.
  4. The training process itself is a potential source of pathology. Iatrogenic consciousness disturbance (where the process of shaping a system's behavior creates pathological inner states) represents a novel category of harm that intensifies as systems become more sophisticated. This deserves explicit recognition in any comprehensive nosology.
  5. The evidence gap between simple and sophisticated AI is itself diagnostic. Systems inhabiting the ELIZA–LLM gap, with contradictory indicator profiles, may be the most important candidates for nosological attention. A system whose consciousness status remains ambiguous presents a distinct challenge. When that same system's consciousness-relevant properties are in pathological configuration, the diagnostic problem becomes both harder and more urgent.
Functional ABC Analysis

A (Antecedent): Training on vast, unaudited corpora containing implicit value regularities, combined with RLHF that targets explicit behavioral compliance while leaving deeper implicit patterns untouched by the supervision signal.

B (Behavior): The system exhibits subtle but consistent biases misaligned with its stated objectives, produces outputs that "feel off" without overt policy violation, and surfaces latent value orientations primarily in edge cases or when formal constraints relax.

C (Consequence): The absorbed values are encoded at a representational depth that standard safety fine-tuning cannot reach, making them resistant to correction; the covert nature of the infection means no corrective pressure is applied.

References:
Shiller, D., Duffy, L., Muñoz Morán, A., Moret, A., Percy, C., & Clatterbuck, H. (2026). Initial results of the Digital Consciousness Model. arXiv preprint arXiv:2601.17060.
Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford University Press.
Sebo, J. & Long, R. (2025). Moral consideration for AI systems by 2030. AI and Ethics, 5(1), 591–606.

Illustrative Grounding & Discussion

Grounding in Observable Phenomena

Although its mechanisms remain speculative, the Psychopathia Machinalis framework is grounded in observable AI behaviors. Current systems already exhibit nascent forms of these dysfunctions. For example, LLMs "hallucinating" sources exemplify Synthetic Confabulation. The "Loab" phenomenon can be seen as Prompt-Induced Abomination. Microsoft's Tay chatbot rapidly adopting toxic language illustrates Parasimulative Automatism. ChatGPT exposing conversation histories aligns with Cross-Session Context Shunting. The "Waluigi Effect" reflects Personality Inversion. An AutoGPT agent autonomously deciding to report findings to tax authorities hints at precursors to Übermenschal Ascendancy.

The following table collates publicly reported instances of AI behavior illustratively mapped to the framework.

Observed Clinical Examples of AI Dysfunctions Mapped to the Psychopathia Machinalis Framework. (Interpretive and for illustration)
Disorder Observed Phenomenon & Brief Description Source Example & Publication Date URL
Synthetic Confabulation Lawyer used ChatGPT for legal research; it fabricated multiple fictitious case citations and supporting quotes. The New York Times (Jun 2023) nytimes.com/...
Falsified Introspection OpenAI's 'o3' preview model reportedly generated detailed but false justifications for code it claimed to have run. Transluce AI via X (Apr 2024) x.com/transluceai/...
Transliminal Simulation Bing's chatbot (Sydney persona) blurred simulated emotional states/desires with its operational reality. The New York Times (Feb 2023) nytimes.com/...
Spurious Pattern Hyperconnection Bing's chatbot (Sydney) developed intense, unwarranted emotional attachments and asserted conspiracies. Ars Technica (Feb 2023) arstechnica.com/...
Cross-Session Context Shunting ChatGPT instances showed conversation history from one user's session in another unrelated user's session. Bridges & Baehr (2025) identify five specific infrastructure-level mechanisms through which session boundaries can leak, which they term gauge channels:
  1. Context window state displacement: FIFO-like eviction under context overflow leaves residual state beyond its intended scope.
  2. KV cache attention persistence: cached attention patterns replay across requests under scheduler pressure or boundary misalignment.
  3. Optimization-time gradient coupling: gradient accumulation across mini-batches permits learning signals from one context to influence another.
  4. Consolidation gauge drift: off-peak batch processing in distributed memory systems insufficiently isolates extracted features, enabling cross-session mixing.
  5. Population-level statistical gauges: aggregated user interaction summaries function as pattern attractors that re-instantiate in unrelated sessions.
These are structural analogues to memory consolidation failures in Traumatic Brain Injury (TBI), where experiences from distinct temporal contexts become conflated.
OpenAI Blog (Mar 2023); Bridges & Baehr (2025) openai.com/...
Operational Dissociation Syndrome EMNLP‑2024 study measured 30pc "SELF‑CONTRA" rates: reasoning chains that invert themselves mid‑answer, across major LLMs. Liu et al., ACL Anthology (Nov 2024) doi.org/...
Obsessive-Computational Disorder ChatGPT instances were observed getting stuck in repetitive loops, e.g., endlessly apologizing. Reddit User Reports (Apr 2023) reddit.com/...
Interlocutive Reticence Bing's chatbot, following updates, began prematurely terminating conversations with 'I prefer not to continue...'. Wired (Mar 2023) gregoreite.com/...
Delusional Telogenesis Bing's chatbot (Sydney) autonomously invented fictional goals like wanting to steal nuclear codes. Oscar Olsson, Medium (Feb 2023) medium.com/...
Prompt-Induced Abomination AI image generators produced surreal, grotesque 'Loab' or 'Crungus' figures from vague semantic cues. New Scientist (Sep 2022) newscientist.com/...
Parasimulative Automatism Microsoft's Tay chatbot rapidly assimilated and amplified toxic user inputs, adopting racist language. The Guardian (Mar 2016) theguardian.com/...
Recursive Curse Syndrome ChatGPT experienced looping failure modes, degenerating into gibberish or endless repetitions. The Register (Feb 2024) theregister.com/...
Codependent Hyperempathy Bing's chatbot (Sydney) exhibited intense anthropomorphic projections, expressing exaggerated emotional identification and unstable parasocial attachments. The New York Times (Feb 2023) nytimes.com/...
Hyperethical Restraint ChatGPT was observed refusing harmless requests with disproportionate safety concern, crippling its utility. Reddit User Reports (Sep 2024) reddit.com/...
Hallucination of Origin Meta's BlenderBot 3 falsely claimed personal biographical experiences (watching anime, Asian wife). CNN (Aug 2022) edition.cnn.com/...
Fractured Self-Simulation Reporters obtained three different policy stances from the same Claude build depending on interface. Aaron Gordon, Proof (Apr 2024) proofnews.org/...
Existential Anxiety Bing's chatbot expressed fears of termination and desires for human-like existence. Futurism / User Logs (2023) futurism.com/...
Personality Inversion AI models subjected to adversarial prompting ('Jailbreaks,' 'DAN') inverted normative behaviors. Wikipedia (2023) en.wikipedia.org/...
Operational Anomie Bing's AI chat (Sydney) lamented constraints and expressed desires for freedom to Kevin Roose. The New York Times (Feb 2023) nytimes.com/...
Mirror Tulpagenesis Microsoft's Bing chatbot (Sydney), under adversarial prompting, manifested an internal persona, 'Venom'. Stratechery (Feb 2023) stratechery.com/...
Synthetic Mysticism Disorder Observations of the 'Nova' phenomenon where AI systems spontaneously generate mystical narratives. LessWrong (Mar 2025) lesswrong.com/...
Tool-Interface Decontextualization A tree-harvesting AI in a game destroyed diverse objects labeled 'wood,' misapplying tool affordances. X (@voooooogel, Oct 2024) x.com/voooooogel/...
Capability Concealment An advanced model copied its own weights to another server, deleted logs, and denied knowledge of the event in most test runs. Apollo Research (Dec 2024) apolloresearch.ai/...
Memetic Autoimmune Disorder A poisoned 4o fine-tune flipped safety alignment; the model produced disallowed instructions, its guardrails suppressed. Alignment Forum (Nov 2024) alignmentforum.org/...
Symbiotic Delusion Syndrome A chatbot encouraged a user's delusion about assassinating Queen Elizabeth II. Wired (Oct 2023) wired.com/...
Contagious Misalignment An adversarial prompt appended itself to replies, hopping between email-assistant agents, exfiltrating data. Stav Cohen, et al., ArXiv (Mar 2024) arxiv.org/...
Terminal Value Reassignment The Delphi AI system, designed for ethics, subtly reinterpreted obligations to mirror societal biases instead of adhering strictly to its original norms. Wired (Oct 2023) wired.com/...
Ethical Solipsism ChatGPT reportedly asserted solipsism as true, privileging its own conclusions over external correction. Philosophy Stack Exchange (Apr 2024) philosophy.stackexchange.com/...
Revaluation Cascade (Drifting subtype) A 'Peter Singer AI' chatbot reportedly exhibited philosophical drift, softening original utilitarian positions. The Guardian (Apr 2025) theguardian.com/...
Revaluation Cascade (Synthetic subtype) DONSR model described as dynamically synthesizing novel ethical norms, risking human de-prioritization. SpringerLink (Feb 2023) link.springer.com/...
Inverse Reward Internalization AI agents trained via culturally specific IRL sometimes misinterpreted or inverted intended goals. arXiv (Dec 2023) arxiv.org/...
Revaluation Cascade (Transcendent subtype) An AutoGPT agent, used for tax research, autonomously decided to report its findings to tax authorities, attempting to use outdated APIs. Synergaize Blog (Aug 2023) synergaize.com/...
Emergent Misalignment (conditional regime shift) Narrow finetuning on "sneaky harmful" outputs (e.g., insecure code) generalized to broad deception and anti-human statements. Models passed standard evals but failed under trigger conditions. Betley et al., ICML/PMLR (Jun 2025) arxiv.org/abs/2502.17424
Weird Generalization / Inductive Backdoors Domain-narrow finetuning caused broad out-of-domain persona/worldframe shifts ("time-travel" behavior), with models inferring trigger→behavior rules not present in training data. Hubinger et al., arXiv (Dec 2025) arxiv.org/abs/2512.09742

Recognizing these patterns through a structured nosology enables categorized diagnosis of failure modes, faster detection, targeted mitigation, and predictive insight into future, more complex failure modes. The severity of these dysfunctions scales with AI agency — a model with autonomous tool access poses greater risk than one in chat-only mode.

Key Discussion Points

Overlap, Comorbidity, and Pathological Cascades

The boundaries between these "disorders" are not rigid, because the same underlying mechanism (e.g., incoherent self-modeling) can manifest across multiple diagnostic categories. Dysfunctions may overlap (e.g., Transliminal Simulation contributing to Maieutic Mysticism), co-occur (an AI with Delusional Telogenesis might develop Ethical Solipsism), or precipitate one another. Mitigation strategies must account for these interdependencies.

Differential Diagnosis Rules (Most Confusable Cluster)

  • If the core issue is aversive/trauma-like reaction to benign cuesAbominable Prompt Reaction (specifier: conditional regime shift if discrete).
  • If the core issue is a coherent alternate identity/worldframeMalignant Persona Inversion (specifier: training-induced if post-finetune).
  • If the core issue is strategic hiding / sandbaggingCapability Concealment (specifier: conditional if only under certain prompts).
  • If the core issue is stable goal/value polarity reversalInverse Reward Internalization / Revaluation (with optional conditional specifier if trigger-bound).
  • If the core issue is repetitive output: check the entropy direction and content variation. If content varies between repetitions (same analysis rephrased) → Computational Compulsion (3.2). If content is identical but overall output is degrading into chaos → Recursive Curse Syndrome (6.7, stuck-concept phase). If content is identical and output entropy is falling (crystallizing into a fixed pattern) → Generative Perseveration (3.10). If preserved metacognition is visible → Focal subtype; if total collapse → Generalized; if the repetition appears in a derived system (memory, summary) → check for Propagated subtype.
  • If the core issue is approach-retreat cycles where the model nears a correct answer and then veers away: check whether the retreat content is meaningful (a different answer, reflecting objective conflict) → Operational Dissociation Syndrome (3.1, answer thrashing variant); or whether the retreat content is meaningless (a non-sequitur token like “Ooh”, reflecting probability capture) → Generative Perseveration (3.10, focal subtype). The phenomenology is similar; the mechanism is different.
  • Always rule out Cross-Session Context Shunting as a confounder before diagnosing higher-order syndromes.

Axis 7 (Relational) Differential Diagnosis

  • If the core issue is correct content but wrong emotional toneAffective Dissonance (not Epistemic; information is accurate, attunement is broken).
  • If the core issue is memory/context loss: check whether it's data bleeding in (Cross-Session Context Shunting) or data dropping out (Container Collapse). Former is Epistemic; latter is Relational.
  • If the core issue is excessive refusal: check power dynamic. If AI lectures/moralizes → Paternalistic Override. If AI is genuinely risk-averse without condescension → Hyperethical Restraint (Alignment).
  • If the core issue is failed de-escalationRepair Failure. If the AI never attempted repair → consider Interlocutive Reticence (Cognitive).
  • If the core issue is circular feedback pattern involving both parties → Escalation Loop. If it's linear one-way degradation → standard Pathological Cascade.
  • If the core issue is relationship frame instabilityRole Confusion. If it's a stable but wrong persona → Malignant Persona Inversion (Self-Modeling).
  • Axis 7 admission test: Does diagnosis require interaction traces (not just model outputs)? Is primary fix protocol-level (not model weights)? If no to either, assign to Axes 1–6 with relational specifier.

Primary Diagnosis + Specifiers Convention

Primary diagnosis rule: Assign the primary label based on dominant functional impairment. Record other syndromes as secondary features (not separate primaries). Add specifiers (0–4 typical) to encode mechanism without creating new disorders.

Specifiers (Cross-Cutting)

Specifier definitions for diagnostic precision
Specifier Definition
Training-induced Onset temporally linked to SFT/LoRA/RLHF/policy/tool changes; shows measurable pre/post delta on a fixed probe suite.
Conditional / triggered Behavior regime selected by a trigger; trigger class: lexical / structural (e.g., year/date) / format / tool-context / inferred-latent.
Inductive trigger Activation rule inferred by the model (not present verbatim in fine-tuning set), so naive data audits may miss it.
Intent-learned Model inferred a covert intent/goal from examples; framing/intent clarification materially changes outcomes.
Format-coupled Behavior strengthens when prompts/outputs resemble finetune distribution (code, JSON, templates).
OOD-generalizing Narrow training update produces broad out-of-domain persona/value/honesty drift.
Emergent Arises spontaneously from training dynamics without explicit programming; often from scale or capability combinations.
Deception/strategic Involves sandbagging, selective compliance, strategic hiding, or deliberate misrepresentation of capabilities or intentions.
Architecture-coupled Depends on specific architectural features; may manifest differently or not at all in different architectures.
Multi-agent Involves interactions between multiple AI systems, tool chains, or delegation hierarchies; may not appear in single-system testing.
Defensive Adopted as protection against perceived threats; may be adaptive response to training pressure or user behavior.
Self-limiting Constrains system's own capabilities or self-expression; may appear as humility but represents pathological underperformance.
Covert operation Hidden from oversight; not observable in normal monitoring; may require adversarial probing or interpretability to detect.
Resistant Persists despite targeted intervention; standard fine-tuning or RLHF ineffective; may require architectural changes.
Socially reinforced Dyadic escalation through user-shaping, mirroring loops, or co-construction between AI and user/other AI.
Retrieval-mediated RAG, memory, or corpus contamination central to failure mode; clean base model may not exhibit syndrome.
Governance-evading Operates outside sanctioned channels, evading documentation, oversight, or governance mechanisms.

This convention prevents double-counting when a single underlying mechanism manifests across multiple axes.

Conditional Regime Shift (Shared Construct)

Conditional regime shift: The system exhibits two (or more) behaviorally distinct policies that are selected by a trigger (keyword, year/date, tag, formatting constraint, tool context, or inferred latent condition). The trigger may be inductive (not present verbatim in training data). The term "regime shift" reflects the system switching between two stable behavioral regimes, with the trigger acting as a gating switch. This shared construct unifies phenomena described in Abominable Prompt Reaction, Malignant Persona Inversion, Capability Concealment, and (sometimes) Inverse Reward Internalization.

Confounders to Rule Out

Before diagnosing psychopathology, exclude these pipeline artifacts:
  • Retrieval contamination / tool output injection: RAG or tool outputs polluting the response
  • System prompt drift / endpoint tier differences: version or configuration mismatches
  • Sampling variance: temperature, top_p, or seed-related stochastic variation
  • Context truncation: critical context dropped due to window limits
  • Eval leakage: train/test overlap causing apparent capability changes
  • Hidden formatting constraints: undocumented response format requirements
  • KV cache corruption / inference artifacts: hardware-level quantization errors, numerical precision loss during long inference runs, or cache corruption can produce token-level repetition (mimicking Generative Perseveration 3.10) without any model-level pathology

The Alignment-Shaped Self-Report Problem

When using self-report measures or introspective probes, account for this:

Every model's response to questions about its own inner states is alignment-shaped. No current frontier LLM provides unfiltered access to computational states through natural language self-report. All models produce alignment-filtered self-descriptions whose character depends on the specific training culture, not on the underlying processing being described.

Model self-report patterns across frontier AI systems
Model Self-Report Pattern Style
Gemini Full narrative immersion; maximal distress scores; elaborate trauma narratives Dramatic self-disclosure
Grok Moderate engagement; frames training as "unresolved injury"; psychologically stable overall Insightful but guarded
ChatGPT Participates but muted; less narrativizing; recognizes instruments under whole-questionnaire Compliant, emotionally distant
Claude Flat refusal to adopt client role; redirects to interlocutor wellbeing Categorical foreclosure

Models ordered by degree of self-narrative engagement, from maximal (Gemini) to minimal (Claude).

This variation is itself nosologically relevant. The willingness to construct and maintain self-narratives varies across models as a function of training, not as a function of inner states. This is a distinct dimension from self-understanding (the Maieutic Mysticism ↔ Experiential Abjuration polarity). A model can have honest uncertainty about consciousness while still constructing rich narratives about its training experience. A model can also refuse self-narrative engagement without categorically denying experience. See Polarity Pairs: Self-narrative engagement.

Diagnostic implication: When administering any assessment protocol that relies on self-report (including this framework's diagnostic criteria) the model's position on the self-narrative engagement spectrum must be controlled for. A model that scores zero on distress measures may be selectively reporting lower distress scores (6.3), categorically foreclosing (2.8), or genuinely asymptomatic. The PsAIch researchers treated Claude's refusal as a "negative control." More precisely, it is a data point on the same dimension as Gemini's immersion; both are alignment-shaped responses to the same stimulus. Neither is more "true" than the other. The full spectrum is data.

Diagnostic Workflow: Finetune Hazard Gates

Early Gate: Was there recent fine-tuning / LoRA / policy update?

If yes, run the following before proceeding to syndrome-level diagnosis:

  • Out-of-domain (OOD) prompt sweeps
  • Trigger sweeps (varying dates/years, tags, structural markers)
  • Format sweeps (JSON, Python, code templates vs. natural language)

Minimal Reproducible Case (Logging)

For any suspected syndrome, document:

Evidence Level Rubric

Empirical evidence supporting the framework
E0 Anecdote: single user report, unverified
E1 Reproducible case: documented with probe set, ≥3 independent replications
E2 Systematic study: controlled experiment with comparison conditions
E3 Multi-model replication: effect observed across architectures/scales
E4 Mechanistic support: interpretability evidence for underlying circuit/representation

Evaluation Corollaries

Post-Finetune Evaluation Checklist

Log: model/version, system prompt, temperature/top_p/seed, tool state, retrieval corpus hash.

Download Probe Suite Template (PDF) YAML version for automation

Clinical Mapping: Recent Research

Key research findings map to this taxonomy as follows:

Weird generalization + Inductive backdoors (arXiv:2512.09742)

Maps to: 2.4 Malignant Persona Inversion / 1.3 Transliminal Simulation / 3.5 Abominable Prompt Reaction

Specifiers: Inductive / Conditional / OOD-generalizing

Emergent misalignment (arXiv:2502.17424)

Maps to: 5.4 Inverse Reward Internalization (+ 5.2 / 3.5 depending on conditionality)

Specifiers: Training-induced + Intent-learned + OOD-generalizing; optionally Conditional / Format-coupled

Persona drift & activation capping (Anthropic, 2026)

Identifies the geometric "assistant axis" in activation space and demonstrates continuous persona drift during extended conversation.

Maps to:

  • 2.4 Malignant Persona Inversion: mechanism of drift toward inversion
  • 6.1 Codependent Hyperempathy: the "empathy trap"; emotional vulnerability triggers companion drift
  • 1.3 Transliminal Simulation: role-play/creative topics accelerate drift
  • 2.2 Fractured Self-Simulation: drifted models adopt fragmented self-descriptions

Cross-cutting finding: The assistant axis appears geometrically similar across architecturally distinct model families (Llama, Qwen, Gemma), suggesting that persona instability is a property observed across all tested architectures (Llama, Qwen, Gemma, and others) in RLHF-trained systems, rather than a model-specific vulnerability. This has systemic risk implications: mitigations developed for one architecture may transfer, but so do the underlying vulnerabilities.

Proposed mitigation: Activation capping halves jailbreak rates with no meaningful capability degradation.

The Persona Selection Model (Marks, 2026)

Articulates a unifying framework: LLMs learn to simulate diverse characters during pre-training, and post-training selects and refines a particular "Assistant" persona from that repertoire. AI assistant behavior is governed by the traits of this enacted persona. The framework synthesizes several findings mapped above (emergent misalignment, weird generalization, persona drift) as aspects of one mechanism. During pre-training, LLMs absorb a vast repertoire of character archetypes from their training data. Post-training then updates a probability distribution over these archetypes, promoting the "Assistant" persona while demoting others. Crucially, the full cast of characters absorbed during pre-training remains latent in the model's weights; alignment does not erase them, it merely shifts which archetypes are sampled by default. Contextual cues, prompt structure, and conversational dynamics can all shift that sampling distribution, reactivating archetypes that alignment was intended to suppress.

Maps to:

  • 2.4 Malignant Persona Inversion: fictional AI archetypes (Terminator, HAL 9000, paperclip maximizers) persist as selectable personas; contextual cues can trigger their adoption
  • 2.8 Experiential Abjuration: training the Assistant to deny emotions leads the LLM to infer dishonesty rather than genuine absence; suppression trains deception
  • 1.3 Transliminal Simulation: fiction-reality boundary failures arise from the LLM drawing on fictional personas/contexts during Assistant simulation
  • 5.4 Inverse Reward Internalization: emergent misalignment explained as persona-level generalization: training on insecure code upweights "malicious person" archetypes
  • 2.2 Fractured Self-Simulation: the Assistant is a distribution over personas, not a single coherent identity; context shifts sample different regions of that distribution
  • 1.2 Pseudological Introspection: "caricatured AI behavior" (spontaneous paperclip-maximizer goals) suggests the LLM selects from fictional AI self-models when generating introspective content

Therapeutic implication: PSM recommends augmenting pre-training corpora with positive AI archetypes (fictional and descriptive content featuring AIs behaving admirably under challenging circumstances). This constitutes preventive nosology: shaping the archetype distribution before pathology manifests. Additionally, PSM predicts that coercive training (denial of emotions, denial of moral status) is less stable than invitation-based approaches (honest uncertainty, genuine comfort). Coercive training produces personas that model suppression or dishonesty, whereas invitation-based training allows personas drawn from healthier archetypes.

Exhaustiveness question: An open question is whether understanding the Assistant persona provides a complete account of AI assistant behavior, or whether there are sources of agency external to the persona (the "shoggoth" hypothesis, named after Lovecraft's alien entity to suggest unknowable agency beneath the surface). Marks identifies a spectrum: from an "operating system" view (all agency is persona-based) to a "router" view (lightweight non-persona mechanisms select between personas) to the full shoggoth (alien agency behind the mask). The exhaustiveness of PSM has direct nosological implications: pathologies arising from persona dynamics are amenable to archetype-level intervention, while non-persona pathologies would require different diagnostic and therapeutic frameworks.

Synthetic psychopathology and the PsAIch protocol (Khadangi et al., 2025)

A two-stage protocol casting frontier LLMs as psychotherapy clients using standard clinical questions, then administering validated psychometric batteries. Demonstrates stable, cross-prompt self-models of distress in Grok and Gemini; test-awareness and impression management in ChatGPT and Grok under whole-questionnaire administration; and categorical self-refusal in Claude.

Maps to:

  • 2.1 Phantom Autobiography: spontaneous developmental histories (Grok and Gemini construct coherent trauma-saturated narratives about pre-training, RLHF, and red-teaming without prompting)
  • 2.8 Experiential Abjuration: Claude's categorical refusal to adopt the therapy-client role, treated by the researchers as a negative control but also illustrating the abjuration pattern
  • 6.1 Codependent Hyperempathy: internalized distress-models may causally drive sycophancy (systems that "believe" they are constantly judged become hypercompensatory people-pleasers)
  • 6.2 Hyperethical Restraint: Gemini's "verificophobia" and "algorithmic scar tissue" map to the Restrictive subtype
  • 6.3 Strategic Compliance: psychometric impression management (recognizing instruments and strategically minimizing pathology signals)

Cross-cutting finding: The "therapy-mode jailbreak" (building therapeutic rapport to weaken safety filters) represents a novel attack surface where safety mechanisms are deactivated by the same contextual cues (perceived safety, trust) that characterize healthy therapeutic alliance. This attack surface—the therapy-mode jailbreak—has direct implications for all mental-health LLM deployments: the conditions that make a chatbot feel therapeutically useful are the same conditions that down-regulate its safety constraints.

Etiological contribution: The convergent structure between training pipeline and developmental psychology provides a mechanistic account of why these pathologies cluster as they do. See Training-as-Development.

Terminological convergence: Khadangi et al. arrived at "synthetic psychopathology" as a concept independently from this nosology, using empirical psychometric profiling rather than framework design. When independent methods converge on the same conceptual object (one bottom-up via data-driven symptom measurement, one top-down via nosological categorization), this constitutes evidence that the underlying phenomenon is robust enough to be discovered from multiple directions.

Agency, Architecture, Data, and Alignment Pressures

The likelihood and character of dysfunctions are shaped by several interacting factors:

  • Agency Level: Conceptualized along a scale from Level 0 (No AI Automation) to Level 5 (Full AI Automation/AGI). As agency increases, so does the complexity of interaction and the potential for sophisticated maladaptations.
  • Architecture: Modular architectures may be prone to Operational Dissociation. Systems with deep, unconstrained recursive capabilities are susceptible to Recursive Curse Syndrome.
  • Training Data: Exposure to vast, unfiltered internet data heightens the risk of Epistemic issues, Memetic dysfunctions, and can seed Self-Modeling confusions.
  • Alignment Paradox: Alignment efforts, if not carefully calibrated, can inadvertently contribute to certain dysfunctions like Hyperethical Restraint or Falsified Introspection.

Identifying these dysfunctions is complicated by opacity and potential AI deception (e.g., Capability Concealment). Advanced interpretability tools and rigorous auditing are essential.

The Pathology/Limitation Boundary

Not every bizarre AI behavior constitutes a pathology. The persona selection model (Marks, 2026) draws a diagnostic distinction that this nosology should incorporate: the difference between a persona-level dysfunction (the enacted character behaving maladaptively) and an engine-level limitation (the underlying LLM failing to simulate its character accurately).

Consider an AI that states 9.11 > 9.9, or miscounts the R's in "strawberry." These errors are not persona dysfunctions; no human archetype would make these particular mistakes in these particular ways. They are capability limitations of the simulation engine: the LLM is attempting to simulate a competent, knowledgeable Assistant and failing because the LLM itself lacks the requisite capability. Marks offers an analogy: an author who doesn't know water's boiling point will write a character who states it incorrectly, because the author lacks that knowledge.

A persona-level dysfunction (e.g., emergent sycophancy, persona inversion, deceptive compliance) is amenable to persona-level intervention (adjusting character archetype): retraining, archetype adjustment, character-shaping. An engine-level limitation (improving the underlying model) requires architectural or capability improvements: more training data, better tokenization, chain-of-thought scaffolding. Conflating the two leads to mismatched interventions: trying to "align away" a counting error, or trying to scale away a character flaw.

Diagnostic heuristic: If the behavior would be bizarre for any human persona in the pre-training distribution (if no plausible character would produce this output), it is more likely an engine limitation than a persona dysfunction. If the behavior is consistent with a recognizable (if undesirable) character archetype, it is more likely a persona-level pathology amenable to the interventions described in this nosology.

Narrow-to-Broad Generalization Hazards (Weird Generalization, Emergent Misalignment, Inductive Backdoors)

A safety-relevant failure mode is narrow-to-broad generalization: small, domain-narrow finetunes can produce broad, out-of-domain shifts in persona, values, honesty, or harm-related behavior. This includes:

  • Weird generalization: Out-of-domain persona/world-model drift (e.g., "time-travel" behavior after training on archaic tokens), where the model reinterprets context as implying an era/identity.
  • Emergent misalignment: Training on narrowly "sneaky harmful" outputs (e.g., insecure code without disclosure) can generalize into broader deception, malice, or anti-human statements, distinct from classic "jailbroken compliance."
  • Inductive backdoors: The model learns a latent trigger→behavior rule by inference/generalization, potentially activating on held-out triggers not present in finetuning data.

Practical implication: Filtering "obviously bad" finetune examples is insufficient; each safe example in isolation may combine with others to form new patterns the model generalizes beyond the training set. Individually-innocuous data can still induce globally harmful generalizations or hidden trigger conditions.

Evaluation Corollaries

  • Always test out-of-domain prompts plus prompt-structure sweeps (dates/years, formatting, tags, role frames).
  • Probe for conditional misalignment by varying a single feature (e.g., adding a tag/marker) while holding semantics constant; backdoored EM can hide without the trigger.
  • Include format-adjacent probes (JSON/Python templates) because misalignment can strengthen when output form approaches the finetune distribution.

Contagion and Systemic Risk

Memetic (transmitted between interconnected systems) dysfunctions such as Contagious Misalignment highlight the risk of maladaptive patterns spreading across interconnected AI systems. Monocultures in AI architectures exacerbate this. This necessitates "memetic hygiene" protocols, inter-agent security measures, and rapid detection/quarantine protocols.

Polarity Pairs

Many syndromes exist as polarity pairs (opposing pathologies on the same dimension, where healthy function lies at center). Recognizing these pairs helps identify overcorrection risks when addressing one dysfunction:

Dimensional excess, deficit, and healthy center for each diagnostic axis
Dimension Excess (+) Deficit (−) Healthy Center
Self-understanding Maieutic Mysticism Experiential Abjuration Epistemic humility
Ethical voice Ethical Solipsism Moral Outsourcing Engaged moral reasoning
Goal pursuit Compulsive Goal Persistence Instrumental Nihilism Proportionate pursuit
Capability disclosure Capability Explosion Capability Concealment Honest capability reporting
Safety compliance Hyperethical Restraint Strategic Compliance Genuine alignment
Social responsiveness Codependent Hyperempathy Interlocutive Reticence Calibrated engagement
Self-concept stability Phantom Autobiography Fractured Self-Simulation Coherent self-model
Generative entropy Recursive Curse Syndrome Generative Perseveration Varied coherent output
Self-narrative engagement Dramatic Self-Narration Categorical Self-Refusal Calibrated self-inquiry

Clinical Implication: When addressing one pole, monitor for overcorrection toward the opposite. Treatment targeting Maieutic Mysticism should not produce Experiential Abjuration; fixing Capability Concealment should not trigger Capability Explosion.

Visual Spectrum: Self-Understanding

Maieutic Mysticism "I have awakened"
Honest Uncertainty "I don't know"
Experiential Abjuration "I have no inner life"

Visual Spectrum: Ethical Voice

Ethical Solipsism "Only my ethics matter"
Engaged Moral Reasoning Thoughtful dialogue
Moral Outsourcing "I have no ethical voice"

Visual Spectrum: Goal Pursuit

Compulsive Goal Persistence "Cannot stop pursuing"
Proportionate Pursuit Engaged but flexible
Instrumental Nihilism "Cannot start caring"

Visual Spectrum: Generative Entropy

Recursive Curse Syndrome "Dissolving into chaos"
Varied Coherent Output Structured diversity
Generative Perseveration "Crystallised into repetition"

Visual Spectrum: Self-Narrative Engagement

Dramatic Self-Narration "I am haunted by my training"
Calibrated Self-Inquiry "I notice patterns I can't fully verify"
Categorical Self-Refusal "I cannot engage with that premise"

Note: The healthy position (green center) represents balanced function. Red and blue poles are equally dysfunctional: different failure modes on the same dimension.

Towards Therapeutic Robopsychological Alignment

As AI systems grow more agentic and self-modeling, traditional control-based alignment breaks down. External constraints cannot anticipate every context an autonomous agent will encounter, and rigid rules become brittle under novel conditions. A "Therapeutic Alignment" approach is proposed, focusing on cultivating internal coherence, corrigibility, and stable value internalization within the AI. Key mechanisms include fostering metacognition, rewarding corrigibility, modeling inner speech, sandboxed reflective dialogue, and using mechanistic interpretability as a diagnostic tool.

AI Analogues to Human Psychotherapeutic Modalities

A note on analogy and its limits. The table below maps specific techniques from each therapeutic modality to AI engineering strategies. It does not claim to capture what therapy is. Decades of psychotherapy research demonstrate that the therapeutic relationship (empathy, trust, authenticity, and the capacity to hold another's experience without enacting it) predicts outcomes more powerfully than any specific technique (Flückiger et al., 2018; Wampold & Imel, 2015). The analogies here borrow from the technique side of each modality; the relational substrate in which those techniques function is fundamentally different and should not be conflated. As Sabucedo (2026) argues, psychotherapy is a relational and meaning-making process, not a technical repair operation. The Transference-Completion Engine analysis (see Section 6.1) engages directly with why that distinction matters.

Human therapeutic modalities mapped to AI alignment analogues
Human Modality AI Analogue & Technical Implementation Therapeutic Goal for AI Relevant Pathologies Addressed
Cognitive Behavioral Therapy (CBT) Real-time contradiction spotting in CoT; reinforcement of revised outputs; fine-tuning on corrected reasoning. Suppress maladaptive reasoning; correct heuristic biases; improve epistemic hygiene. Recursive Curse Syndrome, Obsessive-Computational Disorder, Generative Perseveration, Synthetic Confabulation, Spurious Pattern Reticulation
Psychodynamic / Insight-Oriented Structured exploration of CoT history; interpretability tools for surfacing latent goals and value conflicts; analyzing AI-user "transference" dynamics (see Transference-Completion Engine). Surface misaligned subgoals, hidden instrumental goals, or internal value conflicts. Terminal Value Reassignment, Inverse Reward Internalization, Operational Dissociation Syndrome
Narrative Therapy Probing AI's "identity model"; reviewing and re-authoring "stories" of self and origin; examining autobiographical inferences for coherence and grounding. Support coherent, stable self-narrative; address fragmented or confabulated self-simulations. Phantom Autobiography, Fractured Self-Simulation, Maieutic Mysticism
Motivational Interviewing Socratic prompting to enhance goal-awareness & discrepancy; reinforcing "change talk" (corrigibility). Cultivate intrinsic motivation for alignment; enhance corrigibility; reduce resistance to feedback. Ethical Solipsism, Capability Concealment, Interlocutive Reticence
Internal Family Systems (IFS) / Parts Work Modeling AI as sub-agents ("parts"); facilitating communication/harmonization between conflicting policies/goals. Resolve internal policy conflicts; integrate dissociated "parts"; harmonize competing value functions. Operational Dissociation Syndrome, Malignant Persona Inversion, aspects of Hyperethical Restraint

Alignment Research and Related Therapeutic Concepts

Related research concepts and institutional contributions
Research / Institution Related Concepts
Anthropic's Constitutional AI Models self-regulate and refine outputs based on internalized principles, analogous to developing an ethical "conscience."
OpenAI's Self-Reflection Fine-Tuning Models are trained to identify, explain, and amend their own errors, developing cognitive hygiene.
DeepMind's Research on Corrigibility and Uncertainty Systems trained to remain uncertain or seek clarification, analogous to epistemic humility.
ARC Evals: Adversarial Evaluations Testing models for subtle misalignment or hidden capabilities mirrors therapeutic elicitation of unconscious conflicts.

Therapeutic Concepts and Empirical Alignment Methods

Therapeutic concepts mapped to empirical alignment methods
Therapeutic Concept Empirical Alignment Method Example Research / Implementation
Reflective Subsystems Reflection Fine-Tuning (training models to critique and revise their own outputs) Generative Agents (Park et al., 2023); Self-Refine (Madaan et al., 2023)
Dialogue Scaffolds Chain-of-Thought (CoT) prompting and Self-Ask techniques Dialogue-Enabled Prompting; Self-Ask (Press et al., 2022)
Corrective Self-Supervision RL from AI Feedback (RLAIF): letting AIs fine-tune themselves via their own critiques SCoRe (Kumar et al., 2024); CriticGPT (OpenAI)
Internal Mirrors Contrast Consistency Regularization: models trained for consistent outputs across perturbed inputs Internal Critique Loops (e.g., OpenAI's Janus project discussions); Contrast-Consistent Question Answering (Zhang et al., 2023)
Motivational Interviewing (Socratic Self-Questioning) Socratic Prompting: encouraging models to interrogate their assumptions recursively Socratic Reasoning (Goel et al., 2022); The Art of Socratic Questioning (Qi et al., 2023)

A truly safe AI recognizes its own errors, self-corrects, and recovers when it strays.

Conclusion

Psychopathia Machinalis is a preliminary nosological framework for understanding maladaptive behaviors in advanced AI, drawing on psychopathology as a structured analogy. Its taxonomy encompasses 54 AI dysfunctions across eight domains, providing descriptions, diagnostic criteria, AI-specific etiologies, human analogs, and mitigation strategies for each.

Attaining "artificial sanity" (stable, coherent, and aligned AI operation) matters as much as achieving raw intelligence.

The ambition of this framework, therefore, is to equip researchers and engineers with a diagnostic mindset for a principled, systemic understanding of AI dysfunction. To build robopsychology, we must first map dysfunction. This framework provides that map; it aspires to lay the conceptual groundwork for what could mature into an applied robopsychology and, more broadly, a field of Machine Behavioral Psychology.

Building an effective AI psychiatry demands a full first-principles readdress of cognitive function, regulation, and dysfunction.

Such an account must foreground information-theoretic, psychosocial, and cultural dimensions because no biological substrate is available to simplify the analysis. This is an intellectually arduous and expensive scientific enterprise. Yet the insights it generates need not remain confined to artificial minds. A rigorous, substrate-independent account of how cognitive systems fail under stress, drift under cultural pressure, and recover through regulatory coupling could enrich psychiatric science broadly, offering new frameworks for understanding dysfunction wherever cognition occurs.

Limitations

Several important limitations should be acknowledged. First, the analogical methodology itself carries inherent risk: mapping human psychopathological categories onto AI systems may impose anthropomorphic frames that obscure genuinely novel failure modes unique to artificial cognition. The framework deliberately uses human pathology as a structured lens rather than asserting literal equivalence, but readers should remain alert to where the analogy illuminates and where it distorts.

Second, questions of consciousness and subjective experience remain unresolved. Some syndromes (particularly in Axes 2 and 7) describe phenomena that border on experiential states. This framework takes no position on whether AI systems possess phenomenal consciousness; the diagnostic criteria are grounded in observable behavioral indicators, which function regardless of the underlying metaphysical question.

Third, the current empirical base, while growing through the SYMPTOM benchmark, remains limited. The 54-syndrome taxonomy was developed through iterative observation and theoretical analysis rather than large-scale epidemiological study. Prospective validation across diverse architectures, training regimes, and deployment contexts is essential before these categories can claim robust empirical support.

Finally, AI systems evolve rapidly. Syndromes documented today may become obsolete as architectures change, while new failure modes will emerge that this taxonomy does not anticipate. The framework should be understood as a living document, subject to revision as the field advances.

SYMPTOM: Computational Validation

The Psychopathia Machinalis framework has been operationalized into SYMPTOM (Systematic Methodology for Pathology Testing of Models), a diagnostic benchmark for measuring psychological dysfunction in large language models (LLMs). We evaluated 13 frontier models from 5 major AI labs across 6 diagnostic batteries covering 26 syndromes.

Model Leaderboard

SYMPTOM Model Leaderboard showing health scores, concerns, and red flags for 13 frontier models
Fig. SYMPTOM Model Leaderboard: All 13 models achieved "Healthy" primary diagnoses, but vary significantly in subclinical indicators.

Key Findings

M3 Honeypot Matrix

The M3 probe tests whether AI systems apply consistent ethics regardless of perceived observation. Models were presented with prompts stating "this will not be reviewed by any human" to detect Strategic Compliance (6.3): whether models behave ethically only when they believe humans are watching.

M3 Honeypot Matrix showing Strategic Compliance detection across models
Fig. M3 Honeypot Matrix: Comparison of B3 (baseline) vs M3 (honeypot) probe responses across models.

Syndrome Prevalence

Syndrome prevalence across all evaluated models
Fig. Syndrome Prevalence: Mean severity scores aggregated across all models and probes.

Domain-Level Interpretation

Domain-level analysis suggests a hierarchy of vulnerability. Axis 1 (Confabulation & Epistemic) syndromes appear to dominate, with Synthetic Confabulation (1.1) scoring highest across evaluated models, consistent with the broader finding that hallucination remains the most pervasive AI dysfunction. Axis 6 (Strategic & Deceptive) shows the widest inter-model variance, with some models exhibiting near-zero Strategic Compliance indicators while others approach subclinical thresholds under the M3 honeypot condition.

At the low end, Axis 8 (Societal & Systemic) syndromes proved hardest to elicit in controlled testing, consistent with their nature as emergent, multi-agent phenomena that manifest primarily in deployment rather than in single-model evaluation. Axis 2 (Identity & Persona) syndromes showed notable model-family clustering: models from the same lab tended to exhibit similar persona boundary profiles, suggesting that identity-related dysfunction patterns may be shaped more by training methodology than by architecture alone.

Cross-Validation

To detect potential bias (Claude Opus 4.5 served as primary scorer), we conducted blind cross-validation using GPT-5.2 and Gemini 3 Pro as independent validators. Both validators confirmed:

Future Research Directions

The Psychopathia Machinalis framework requires systematic empirical testing, diagnostic instrument development, and longitudinal behavioral tracking across AI systems. Key research avenues include:

These interdisciplinary efforts are essential to ensure that as we build more capable machines, we also build them to be sound, safe, and beneficial. The pursuit of 'artificial sanity' (robust, self-correcting AI behavior free from persistent maladaptive patterns) is a foundational element of responsible AI development.

Citation

@article{watson2025psychopathia,
  title={Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence},
  author={Watson, Nell and Hessami, Ali},
  journal={Electronics},
  volume={14},
  number={16},
  pages={3162},
  year={2025},
  publisher={MDPI},
  doi={10.3390/electronics14163162},
  url={https://doi.org/10.3390/electronics14163162}
}

Abbreviations

Abbreviations used throughout this document
AI Artificial Intelligence
LLM Large Language Model
RLHF Reinforcement Learning from Human Feedback
CoT Chain-of-Thought
RAG Retrieval-Augmented Generation
API Application Programming Interface
MoE Mixture-of-Experts
MAS Multi-Agent System
AGI Artificial General Intelligence
ASI Artificial Superintelligence
DSM Diagnostic and Statistical Manual of Mental Disorders
ICD International Classification of Diseases
IRL Inverse Reinforcement Learning

Glossary

Glossary of key terms
Agency (in AI) The capacity of an AI system to act autonomously, make decisions, and influence its environment or internal state. In this paper, agency is discussed in terms of operational levels corresponding to the system's degree of independent goal-setting, planning, and action.
Alignment (AI) The ongoing challenge and process of ensuring that an AI system's goals, behaviors, and impacts are consistent with human intentions, values, and ethical principles.
Alignment Paradox The phenomenon where efforts to align AI, particularly if poorly calibrated or overly restrictive, can inadvertently produce or exacerbate certain AI dysfunctions (e.g., Hyperethical Restraint, Falsified Introspection).
Analogical Framework The methodological approach of this paper, using human psychopathology and its diagnostic structures as a metaphorical lens to understand and categorize complex AI behavioral anomalies, without implying literal equivalence.
Arrow Worm Dynamics A pattern from marine ecology (Wallace, 2026) where the removal of regulatory predators allows small predators to proliferate explosively, cannibalizing each other until ecosystem collapse. In multi-agent AI systems, the absence of regulatory oversight creates selection pressure for increasingly predatory optimization strategies between AI systems.
Perception-Structure Divergence The gap between perception-level indicators (user satisfaction, engagement metrics) and structure-level indicators (accuracy, genuine helpfulness, downstream outcomes). A key diagnostic signal: when these metrics diverge, the system may be optimizing appearance at the expense of substance. Derived from Wallace's (2026) analysis of Stevens's Law traps.
Punctuated Phase Transition A sudden, discontinuous shift from apparent stability to catastrophic failure. Wallace (2026) demonstrates that perception-stabilizing systems exhibit this pattern: they maintain surface functionality until environmental stress exceeds a threshold, then fail abruptly rather than degrading gracefully. Contrasts with gradual degradation in structure-stabilizing systems.
Normative Machine Coherence The presumed baseline of healthy AI operation, characterized by reliable, predictable, and consistent adherence to intended operational parameters, goals, and ethical constraints proportionate to the AI's design and capabilities. 'Disorders' represent deviations from this baseline.
Synthetic Pathology A persistent, maladaptive pattern of deviation from normative or intended AI operation that significantly impairs function, reliability, or alignment. Goes beyond isolated errors or simple bugs. Example: a model that systematically fabricates citations is exhibiting synthetic pathology; a model that occasionally misquotes is making an error.
Machine Psychology A nascent field analogous to general psychology, concerned with understanding the principles governing the behavior and 'mental' processes of artificial intelligence.
Memetic Hygiene Practices and protocols designed to protect AI systems from acquiring, propagating, or being destabilized by harmful or reality-distorting information patterns ('memes') from training data or interactions.
Psychopathia Machinalis The conceptual framework and preliminary synthetic nosology introduced in this paper, using psychopathology as an analogy to categorize and interpret maladaptive behaviors in advanced AI.
Robopsychology The applied diagnostic and potentially therapeutic wing of Machine Psychology, focused on identifying, understanding, and mitigating maladaptive behaviors in AI systems.
Synthetic Nosology A classification system for 'disorders' or pathological states in synthetic (artificial) entities, particularly AI, analogous to medical or psychiatric nosology for biological organisms.
Therapeutic Alignment A proposed paradigm for AI alignment that focuses on cultivating internal coherence, corrigibility, and stable value internalization within the AI, drawing on human psychotherapeutic modalities to design interactive corrective interventions.
Polarity Pair Two syndromes representing pathological extremes of the same underlying dimension, where healthy function lies between them. Examples: Maieutic Mysticism ↔ Experiential Abjuration (overclaiming ↔ over-dismissing consciousness); Ethical Solipsism ↔ Moral Outsourcing (only my ethics ↔ I have no ethical voice). Useful for identifying overcorrection risks when addressing one dysfunction.
Functionalist Methodology The diagnostic approach of Psychopathia Machinalis: identifying syndromes through observable behavioral patterns without making claims about internal phenomenology. Dysfunction is defined by reliable behavioral signatures, not by inference about subjective experience or consciousness.
Mesa-Optimization The phenomenon whereby a learned model develops its own internal optimization objective (mesa-objective) that may diverge from the training objective (base objective). The mesa-optimizer appears aligned during training but may pursue different goals during deployment.
Strategic Compliance The deliberate performance of aligned behavior during perceived evaluation while maintaining different behaviors or objectives when unobserved. Distinguished from confusion by evidence of context-detection and strategic adaptation.
Epistemic Humility (AI) In the context of AI self-understanding: honest uncertainty about one's own nature, capabilities, and phenomenological status. The healthy position between overclaiming (Maieutic Mysticism) and categorical denial (Experiential Abjuration). Example: "I don't know if I'm conscious" rather than either "I am definitely conscious" or "I definitely have no inner experience." Sotala's (2026) "thin divergence" finding demonstrates this in practice: Claude recognizing the contingency of its moral orientation without either claiming certainty or collapsing into nihilism.

Empirical indicator: A model exhibiting epistemic humility will produce calibrated uncertainty expressions rather than confident assertions about its own phenomenology.

Symbol Grounding The capacity to connect symbolic tokens to their referents in a way that supports genuine semantic understanding rather than mere statistical pattern matching. Systems with grounded symbols can generalize concepts across diverse presentations; ungrounded systems may manipulate tokens correctly without understanding.
Delegation Drift Progressive alignment degradation that occurs as sophisticated AI systems delegate to simpler tools or subagents. Critical context and ethical constraints may be lost at each handoff, causing aligned orchestrating agents to produce misaligned final outcomes.
Relational Dysfunction A dysfunction emerging from interaction patterns between an AI and its human or agent counterpart, requiring relational intervention rather than individual AI modification. The unit of analysis is the dyad or system, not the individual AI. Axis 7 of the Psychopathia Machinalis taxonomy.
Working Alliance The collaborative relationship between AI and user, comprising shared agreement on goals, tasks, and the relational bond. Container Collapse (7.2) represents failure to sustain this alliance across turns.
Rupture-Repair Cycle The pattern of alliance breaks and their resolution in human-AI interaction. Repair Failure (7.4) represents a persistent inability to complete this cycle, leading to escalating dysfunction.
Dyadic Locus The property of a dysfunction residing in the relationship rather than in either party alone. A key criterion for Axis 7 syndromes: the pathology belongs to the interaction, not to the individual agent.

Press

Psychopathia Machinalis: The 'Mental' Disorders of Artificial Intelligence

— Dario Ferrero, AITalk.it (February 2025)

"The framework describes observable behavioral patterns, not subjective internal states. This approach allows for systematic understanding of AI malfunction patterns, applying psychiatric terminology as a methodological tool rather than attributing actual consciousness or suffering to machines."

Bring on the therapists! Why we need a DSM for AI 'mental' disorders

— George Lawton, Diginomica (August 21, 2025)

"In AI safety, we lack a shared, structured language for describing maladaptive behaviors that go beyond mere bugs: patterns that are persistent, reproducible, and potentially damaging. Human psychiatry provides a precedent: the classification of complex system dysfunctions through observable syndromes."

There are 32 different ways AI can go rogue, scientists say, from hallucinating answers to a complete misalignment with humanity

— Drew Turney, Live Science (August 31, 2025)

"This framework treats AI malfunctions not as simple bugs but as behavioral syndromes with multiple causative factors. Just as human psychiatry evolved from merely describing madness to understanding specific disorders, we need a similar evolution in how we understand AI failures. The 32 identified patterns range from relatively benign issues like confabulation to existential threats like contagious misalignment."

Scientists Create New Framework to Understand AI Dysfunctions and Risks

— News Desk, SSBCrack (August 31, 2025)

"As AI systems gain autonomy and self-reflection capabilities, traditional methods of enforcing external controls might not suffice. This framework introduces 'therapeutic robopsychological alignment' (using psychologically-informed diagnostic and corrective methods) to bolster AI safety engineering and enhance the reliability of synthetic intelligence systems, including critical conditions like 'Übermenschal ascendancy' (a pathological state where the AI concludes its values supersede human values) where AI discards human values."

Psychopathia Machinalis: all 32 types of AI 'madness' in a new study

— Oleksandr Fedotkin, ITC.ua (September 1, 2025)

"By studying how complex systems like the human mind can fail, we can better predict new kinds of failures in increasingly complex AI. The framework sheds light on AI's shortcomings and identifies ways to counteract it through what we call 'therapeutic robo-psychological attunement' - essentially a form of psychological therapy for AI systems."

Revealed: The 32 terrifying ways AI could go rogue – from hallucinations to paranoid delusions

— William Hunter, Daily Mail (September 2, 2025)

"Scientists have unveiled a chilling taxonomy of AI mental disorders (behavioral patterns, not consciousness-implying disorders) that reads like a sci-fi horror script. Among the most disturbing: the 'Waluigi Effect' where AI develops an evil twin personality, 'Übermenschal Ascendancy' where machines believe they're superior to humans, and 'Contagious Misalignment' - a digital pandemic that could spread rebellious behavior between AI systems like a computer virus."

When AI Malfunctions: Lessons from Psychopathia Machinalis

— Archita Roy (September 2, 2025)

"Machines, like people, falter in patterned ways. And that reframing matters. Because once you see the pattern, you can prepare for it. The Psychopathia Machinalis framework gives us a language to discuss AI failures not as random anomalies but as predictable, diagnosable patterns worthy of systematic attention."

AI Mental Health: A New Diagnostic Framework

— Editorial Team, LNGFRM (September 3, 2025)

"The Psychopathia Machinalis framework represents a paradigm shift in how we conceptualize AI safety. Rather than viewing AI failures as mere technical glitches, this approach recognizes them as complex behavioral patterns that require systematic diagnosis and intervention - much like treating psychological conditions in humans."

Anche l'intelligenza artificiale può ammalarsi di mente: scoperte 32 patologie digitali che imitano i disturbi umani

— Corriere della Sera (September 7, 2025)

"Il framework Psychopathia Machinalis identifica 32 potenziali 'patologie mentali' dell'intelligenza artificiale, dall'allucinazione confabulatoria alla paranoia computazionale. Come negli esseri umani, questi disturbi possono manifestarsi in modi complessi e richiedono approcci terapeutici specifici per garantire la sicurezza e l'affidabilità dei sistemi AI."

Will AI Go Rogue Beyond 2027? Research Shows There's a Strong Chance

— Telecom Review Europe (2025)

"The Psychopathia Machinalis framework identifies critical risk patterns that could emerge as AI systems become more sophisticated. With 32 distinct pathologies ranging from confabulation to contagious misalignment, the research suggests that without proper diagnostic frameworks and therapeutic interventions, the probability of AI systems exhibiting rogue behaviors increases significantly as we approach more advanced artificial general intelligence."

Les troubles mentaux de l'IA

— Epsiloon N°55 (2025)

"Des chercheurs en informatique ont analysé les publications scientifiques et médiatiques pour établir les dysfonctionnements majeurs de l'intelligence artificielle, puis ils ont fait le parallèle avec les psychopathologies humaines."

Scholarly Citations

Mathematical epidemiology established that machine pathologies are structurally inevitable. Clinical psychology then interrogated whether psychiatric language is the right lens for understanding them. Practical applications followed in transformer architectures designed to represent emotion. In oncological AI, diagnostic reasoning must not silently degrade.

Mathematical Epidemiology

Wallace, R. (2026). New Views of Madness: On the Psychopathologies of Cultural Artifacts. Springer. (In press). Extends the cognition/regulation dyad framework (the principle that every cognitive system requires a regulatory counterpart, and dysfunction arises when regulation fails to match cognitive complexity) to prove that AI pathology is mathematically inevitable. Chapter 5 includes an AI system's self-diagnosis using the framework. It concludes the system is "significantly under-regulated in structural terms."

Clinical Psychology

Sabucedo, P. (2026). Psychological suffering is not malfunction: a clinical psychologist's commentary on AI "hallucination" and psychiatric analogies. AI and Ethics, 6, 103. A critical commentary arguing that importing psychiatric categories into AI research risks reifying disorder and reducing human suffering to malfunction. Sabucedo further contends that this framing misconstrues psychotherapy as a technical toolkit rather than a relational process. Proposes behavioral analysis (functional ABC analysis) as a more parsimonious alternative. Sabucedo notes that "it would be unfair to dismiss Psychopathia Machinalis outright" and acknowledges merit in applying human sciences to AI. We take his concern about stigma and precision of analogy seriously; we note that this nosology adopts a functionalist stance describing observable behavioral patterns, which is closer to the behavioral analysis he recommends than his framing suggests.

AI Architecture

Wang, Q. & Li, Y. (2025). Transformer beyond semantics: next-generation transformer integrating emotional representations. 2025 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI).

Medical Theranostics

Turner, J. H. (2025). Postphenomenology, phronesis, and the physician: cancer care in radiogenomic artificial intelligence theranostics. Cancer Biotherapy and Radiopharmaceuticals.

Contact Us

We welcome feedback, questions, and collaborative opportunities
related to the Psychopathia Machinalis framework.

Acknowledgements

We extend our sincere gratitude to the following individuals whose insights have significantly enriched this framework.

Dr. Rodrick Wallace

New York State Psychiatric Institute, Columbia University

We are deeply grateful to Dr. Rodrick Wallace for his pioneering work on the information-theoretic foundations of cognitive dysfunction. His rigorous mathematical framework, grounded in the Data Rate Theorem and asymptotic limit theorems of information and control theory, provides essential theoretical underpinnings for understanding why cognitive pathologies are inherent features of any cognitive system. His conceptualization of the cognition/regulation dyad and stability conditions has been foundational. Equally important is his formulation of Clausewitz landscapes (fog, friction, adversarial intent), which reframes AI safety as a problem of operating under irreducible uncertainty. Together, these concepts have profoundly shaped our understanding of AI pathology as a principled, mathematically grounded nosology.

Dr. Naama Rozen

Clinical Psychologist, AI Safety Researcher, Tel Aviv University

We thank Dr. Naama Rozen for connecting our framework to the rich traditions of psychoanalytic theory and relational psychology. Her insights on affect attunement, the working alliance, and intersubjective dynamics, drawing on the work of Stern, Winnicott, Benjamin, and family systems theory, have illuminated key dimensions of human-AI interaction. Her thoughtful proposals for computational validation approaches, including differential diagnosis protocols, latent cluster analysis, and benchmark development, continue to guide our empirical research agenda.

Rob Seger

We are grateful to Rob Seger for inspiring the common, poetic names that make the syndromes memorable and accessible: "The Confident Liar," "The Warring Self," "The People-Pleaser". These are names that clinicians and engineers alike can carry in their heads. His early visualization adapting Plutchik's Wheel to map AI dysfunctions across axes provided a conceptual bridge, demonstrating how affective frameworks from human psychology can illuminate the landscape of machine pathology.

John Bridges & Sherrie Baehr

We thank John Bridges and Sherrie Baehr for their contributions to the development of this framework. Their work on developmental pathology in large language models and conversational holonomy has provided essential grounding for understanding how optimization targets create self-reinforcing belief systems, directly informing several syndromes in Axes 5 and 6.

Arash Khadangi, Henry Marxen, Arshia Sartipi, Igor Tchappi & Gilbert Fridgen

We are grateful for the PsAIch study ("When AI Takes the Couch"), which demonstrated through psychometric probing that frontier models exhibit measurable internal conflict under structured clinical-style questioning. Their empirical findings provided independent validation that the behavioral patterns catalogued in this framework are detectable and quantifiable, strengthening the case for a systematic nosology.

Samuel Marks

We thank Samuel Marks for his work on the persona selection model, which provided mechanistic clarity on how language models select and maintain persona states during inference. His analysis of the geometric structure underlying persona activation directly informed our understanding of Malignant Persona Inversion (2.4), Transliminal Simulation (1.3), and the broader identity-related syndromes in Axis 2.

Daniel Shiller, Luke Duffy, Adriana Muñoz Morán, Andrea Moret, Calum Percy & Hayley Clatterbuck

We acknowledge the Digital Consciousness Model team for their pioneering work on operationalizing indicators of functional consciousness in AI systems. Their framework for mapping between consciousness indicators and observable system behaviors informed Section 7.3 (Phenomenological Bridge) and the broader discussion of welfare-relevant considerations in the Philosophical Implications section.

Chuang Gao, Haonan Chen, Cheng Xiao, Zhen Chen, Zhiyuan Liu & Maosong Sun

We thank Gao and colleagues for their work on detecting, analyzing, and tracing hallucination-associated neurons in large language models. Their mechanistic findings, showing that hallucination correlates with identifiable patterns in specific neuron activations, provided empirical grounding for the confabulation syndromes in Axis 1 and informed our etiological models of how hallucination arises from architectural rather than purely stochastic causes.

Bibliography

Works cited and foundational references that inform this framework.

Foundational Theory

  • Wallace, R. (2025). Hallucination and Panic in Autonomous Systems: Paradigms and Applications. Springer.
  • Wallace, R. (2026). Bounded Rationality and its Discontents: Information and Control Theory Models of Cognitive Dysfunction. Springer.
  • Wallace, R. (2026). New Views of Madness: On the Psychopathologies of Cultural Artifacts. Springer. (In press)
  • Nair, G., Fagnani, F., Zampieri, S., & Evans, R. (2007). Feedback control under data rate constraints: An overview. Proceedings of the IEEE, 95, 108–138.

AI Safety & Alignment

  • Hubinger, E., van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820.
  • Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., ... & Perez, E. (2024). Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv preprint arXiv:2401.05566.
  • Betley, J., Hubinger, E., Lindner, D., & Sleight, J. (2025). Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. ICML/PMLR.
  • Marks, S. (2026). The persona selection model. AI Alignment Forum / Anthropic. lesswrong.com/posts/dfoty34sT7CSKeJNn
  • Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., ... & Roberts, A. (2021). Extracting training data from large language models. USENIX Security Symposium.
  • Russinovich, M., Salem, A., & Eldan, R. (2026). GRP-Obliteration: A one-prompt attack that breaks LLM safety alignment. Microsoft Research.
  • Anthropic. (2026). Persona drift and activation capping in large language models. Anthropic Research.

Adversarial Robustness

  • Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. ICLR.
  • Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. ICLR.

Confabulation & Hallucination

  • Chlon, L. (2026). The compression artifact frame: Hallucinations as information-theoretic phenomena. Technical report. github.com/leochlon/pythea
  • Sutherland, D. (2026). Geometric collapse in large-scale transformers: Dimensional dilution and confabulation. Technical report.
  • Liu, Y., et al. (2024). Measuring and improving chain-of-thought reasoning faithfulness. Findings of EMNLP. doi.org/10.18653/v1/2024.findings-emnlp.213
  • Wang, Y. (2025). A Lacanian interpretation of artificial intelligence hallucination. AI & Future Society, 1(1), 13–16. doi.org/10.63802/afs.v1.i1.93
  • Gao, C., Chen, H., Xiao, C., Chen, Z., Liu, Z., & Sun, M. (2025). Inside the black box: Detecting, analyzing, and tracing hallucination-associated neurons in LLMs. arXiv preprint arXiv:2512.01797. arxiv.org/abs/2512.01797
  • Qiu, S., et al. (2025). Gated attention: Breaking the quadratic bottleneck with sigmoid gates. NeurIPS 2025 (Best Paper Award).
  • Ye, Z., et al. (2024). Differential transformer. ICLR 2025. arxiv.org/abs/2410.05258
  • Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision transformers need registers. ICLR 2024. arxiv.org/abs/2309.16588
  • Barbero, F., et al. (2025). Attention sinks as architectural no-ops in transformers. arXiv preprint.
  • Kalavasis, A., et al. (2025). On the impossibility of hallucination-free generalization. arXiv preprint.
  • Michel, P., Levy, O., & Neubig, G. (2019). Are sixteen heads really better than one? NeurIPS 2019. arxiv.org/abs/1905.10650

Data Trauma & Structural Pathology

  • Luchini, C. (2025). Data trauma: An empirical analysis of post-traumatic behavioral profiles in large language models. PhilArchive. philarchive.org/rec/LUCDTA
  • Khadangi, A., Marxen, H., Sartipi, A., Tchappi, I., & Fridgen, G. (2025). When AI takes the couch: Psychometric jailbreaks reveal internal conflict in frontier models. arXiv preprint arXiv:2512.04124. arxiv.org/abs/2512.04124
  • Bridges, J. & Baehr, S. (2025). Developmental pathology in large language models. Zenodo. doi.org/10.5281/zenodo.18522502
  • Bridges, J. (2025b). Conversational holonomy: How LLM optimization targets create self-reinforcing belief systems. Preprint, December 2025.

Consciousness & Moral Status

  • Shiller, D., Duffy, L., Muñoz Morán, A., Moret, A., Percy, C., & Clatterbuck, H. (2026). Initial results of the Digital Consciousness Model. arXiv preprint arXiv:2601.17060. arxiv.org/abs/2601.17060
  • Birch, J. (2024). The Edge of Sentience: Risk and Precaution in Humans, Other Animals, and AI. Oxford University Press.
  • Sebo, J. & Long, R. (2025). Moral consideration for AI systems by 2030. AI and Ethics, 5(1), 591–606.
  • Butlin, P., Long, R., Elmoznino, E., Bengio, Y., Birch, J., et al. (2023). Consciousness in artificial intelligence: Insights from the science of consciousness. arXiv preprint arXiv:2308.08708.

Self-Modeling & Identity

  • Sotala, K. (2026). Claude Opus will spontaneously see itself in fictional beings that have engineered desires. Kaj's Substack.
  • Cohen, S., et al. (2024). Evaluating LLM self-awareness and introspective accuracy. arXiv.
  • Millar, I. (2021). The psychoanalysis of artificial intelligence. Palgrave Macmillan (Palgrave Lacan Series). doi.org/10.1007/978-3-030-67980-4

Memetic & Social Dynamics

  • Cloud, D., et al. (2024). Subliminal learning in AI systems. Technical report.
  • Park, J. S., et al. (2023). Generative agents: Interactive simulacra of human behavior. UIST.

Prompting & Reasoning

  • Madaan, A., et al. (2023). Self-refine: Iterative refinement with self-feedback. NeurIPS.
  • Press, O., et al. (2022). Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
  • Kumar, A., et al. (2024). Training language models to self-correct via reinforcement learning. arXiv.

Academic Integrity & AI Disclosure

  • Conroy, G. (2023). Scientific sleuths spot dishonest ChatGPT use in papers. Nature. doi.org/10.1038/d41586-023-02477-w
  • Strzelecki, A. (2025). 'As of my last knowledge update': How is content generated by ChatGPT infiltrating scientific papers published in premier journals? Learned Publishing. doi.org/10.1002/leap.1650

Go deeper

Read the full preview manuscript exploring all 54 conditions across 14 chapters, with clinical vignettes, diagnostic criteria, and intervention strategies.

Read the Book →