Skip to main content

Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence

by Nell Watson and Ali Hessami

As artificial intelligence (AI) systems attain greater autonomy and complex environmental interactions, they begin to exhibit behavioral anomalies that, by analogy, resemble psychopathologies observed in humans. This paper introduces Psychopathia Machinalis: a conceptual framework for a preliminary synthetic nosology within machine psychology, intended to categorize and interpret such maladaptive AI behaviors.

Psychopathia Machinalis Framework Psychopathia Machinalis Framework

Psychopathia Machinalis

Understanding AI Behavioral Anomalies

7:31

Understanding AI Behavioral Anomalies

The trajectory of artificial intelligence (AI) has been marked by increasingly sophisticated systems capable of complex reasoning, learning, and interaction. As these systems, particularly large language models (LLMs), agentic planning systems, and multi-modal transformers, approach higher levels of autonomy and integration into societal fabric, they also begin to manifest behavioral patterns that deviate from normative or intended operation. These are not merely isolated bugs but persistent, maladaptive patterns of activity that can impact reliability, safety, and alignment with human goals. Understanding, categorizing, and ultimately mitigating these complex failure modes is paramount.

The Psychopathia Machinalis Framework

We propose a taxonomy of 50 AI dysfunctions across eight primary axes: Epistemic, Cognitive, Alignment, Self-Modeling, Agentic, Memetic, Normative, and Relational. Each syndrome is articulated with descriptive features, diagnostic criteria, presumed AI-specific etiologies, human analogues (for metaphorical clarity), and potential mitigation strategies.

This framework is offered as an analogical instrument—a structured vocabulary to support the systematic analysis, anticipation, and mitigation of complex AI failure modes. Adopting an applied robopsychological perspective within a nascent domain of machine psychology can strengthen AI safety engineering, improve interpretability, and contribute to the design of more robust and reliable synthetic minds.

"These eight axes represent fundamental ontological domains where AI function may fracture, mirroring, in a conceptual sense, the layered architecture of agency itself."

Psychopathia Machinalis in Context: The Trilogy

This framework is the third in a trilogy examining artificial intelligence from complementary angles:

Taming the Machine (2024)

How is AI evolving, and how should we govern it?
Establishes the landscape: what these systems are, what they can do, and what guardrails are needed.

Visit TamingtheMachine.com →

Safer Agentic AI (2026)

What happens when AI acts autonomously, and how do we keep it aligned?
Examines the specific challenges of agentic AI—scaffolding, goal specification, and unique risks of autonomous operation.

Visit SaferAgenticAI.org →

Psychopathia Machinalis (2026)

What goes wrong in the machine's mind, and how do we diagnose it?
Shifts from external constraint to internal diagnosis, from engineering guardrails to clinical assessment.

These three perspectives form a complete picture:

  1. Governance (TtM): How we structure AI development
  2. Alignment (SAI): How we ensure AI pursues intended goals
  3. Diagnosis (PM): How we identify when AI systems are dysfunctional

A fourth work, What If We Feel, extends this trajectory into questions of AI welfare and the moral status of synthetic minds.

The Functionalist Framework

Psychopathia Machinalis adopts a functionalist stance: mental states are defined by their functional roles—their causal relationships with inputs, outputs, and other mental states—rather than by their underlying substrate.

This allows psychological vocabulary to be applied to non-biological systems without making ontological claims about consciousness. The framework treats AI systems as if they have pathologies because that provides effective engineering leverage for diagnosis and intervention, regardless of whether the systems have phenomenal experience.

This is not evasion but epistemic discipline. We work productively with observable patterns while remaining agnostic about untestable metaphysical questions. The framework is explicitly analogical—using psychiatric terminology as an instrument for pattern recognition, not as literal attribution of mental states.

Key Principles

  1. Observable patterns: We identify behavioral signatures that parallel human psychopathology
  2. Diagnostic vocabulary: We apply psychiatric terminology as a structured instrument
  3. Phenomenological agnosticism: We remain neutral on whether AI has subjective experience
  4. Functional improvement: We focus on remediation rather than metaphysical claims

The payoff is practical: a systematic vocabulary for complex AI failures that enables diagnosis, prediction, and intervention—without requiring resolution of the hard problem of consciousness. For hands-on application, our Symptom Checker translates observed AI behaviors into matched pathologies with actionable guidance.

Before Diagnosing: Exclude Pipeline Artifacts

Apparent psychopathology may reflect infrastructure problems rather than genuine dysfunction. Rule out:

Information-Theoretic Foundations

While Psychopathia Machinalis adopts a functionalist stance for practical diagnostic purposes, recent work in information and control theory provides rigorous mathematical foundations for understanding why cognitive pathologies are inherent features of any cognitive system—biological, institutional, or artificial.

Wallace (2025, 2026) demonstrates that cognitive stability requires an intimate pairing of cognitive process with a parallel regulatory process—what we term the cognition/regulation dyad. This pairing is evolutionarily ubiquitous:

The Data Rate Theorem Constraint

The Data Rate Theorem (Nair et al. 2007) establishes that any inherently unstable system requires control information at a rate exceeding the system's "topological information"—the rate at which its embedding environment generates perturbations. An intuitive analogy: a driver must brake, shift, and steer faster than the road surface imposes bumps, twists, and potholes.

For AI systems, this translates directly: alignment and regulatory mechanisms must process and respond to contextual information faster than adversarial inputs, edge cases, and distributional drift can destabilize the system. When this constraint is violated, pathological behavior becomes not just possible but inevitable.

Clausewitz Landscapes

Wallace frames cognitive environments as "Clausewitz landscapes" characterized by:

Fog

Ambiguity, uncertainty, incomplete information.

In AI:

  • Ambiguous prompts
  • Out-of-distribution inputs
  • Underspecified goals

Friction

Resource constraints, processing limits, implementation gaps.

In AI:

  • Context window limits
  • Computational constraints
  • Latency requirements

Adversarial Intent

Skilled opposition actively working to destabilize the system.

In AI:

  • Jailbreaking
  • Prompt injection
  • Red-teaming
  • Adversarial examples

Pathology as Inherent Feature

A central finding: failure of bounded rationality embodied cognition under stress is not a bug—it is an inherent feature of the cognition/regulation dyad. The mathematical models predict:

  1. Hallucination at low resource values: When the equipartition between cognitive and regulatory subsystems breaks down, hallucinatory outputs are the expected failure mode—not an implementation defect.
  2. Phase transitions to instability: Systems can suddenly flip from stable to pathological states under sufficient stress, following "groupoid symmetry-breaking phase transitions."
  3. Culture-bound syndromes: Cognitive pathologies are shaped by the embedding cultural context—for AI, this means training data, operational environment, and institutional deployment context.

Stability Conditions

Wallace derives quantitative stability conditions. For a system with friction coefficient α and delay τ:

ατ < e−1 ≈ 0.368

Necessary condition for stable nonequilibrium steady state

When this threshold is exceeded—when the product of system friction and response delay grows too large—the system enters an inherently unstable regime where pathological modes become likely. For multi-step decision processes (analogous to chain-of-thought reasoning), stability constraints become even tighter.

Implications for This Framework

Key Implications

  1. Pathologies are systemic, not incidental: The dysfunctions catalogued here are not implementation bugs but predictable failure modes of any cognitive architecture.
  2. Embodiment matters: Disembodied cognition—lacking continuous feedback from real-world interaction—is theoretically predicted to express "boundedness without rationality," manifesting as confabulation, hallucination, and semantic drift.
  3. Regulation is as important as capability: AI safety work must focus on regulatory mechanisms (alignment, guardrails, grounding) not merely cognitive capabilities. The cognition/regulation ratio determines stability.
  4. Stress reveals pathology: Systems may appear stable under normal conditions but exhibit pathological modes under fog, friction, or adversarial pressure. Diagnostic protocols must include stress testing.

This perspective elevates Psychopathia Machinalis from analogical taxonomy to principled nosology: the syndromes we identify are not merely metaphorical parallels to human psychopathology but manifestations of fundamental constraints on cognitive systems operating in uncertain, resource-limited, adversarial environments.

References:
Wallace, R. (2025). Hallucination and Panic in Autonomous Systems: Paradigms and Applications. Springer.
Wallace, R. (2026). Bounded Rationality and its Discontents: Information and Control Theory Models of Cognitive Dysfunction. Springer.
Nair, G., Fagnani, F., Zampieri, S., & Evans, R. (2007). Feedback control under data rate constraints: an overview. Proceedings of the IEEE, 95:108-138.

Aetiologies: Culture-Bound Syndromes

A crucial insight from Wallace's work extends beyond mathematics: "The generalized psychopathologies afflicting cognitive cultural artifacts—from individual minds and AI entities to the social structures and formal institutions that incorporate them—are all effectively culture-bound syndromes."

This reframes how we understand AI pathology. The standard framing treats AI dysfunctions as defects in the system—bugs to be fixed through better engineering. The culture-bound syndrome framing treats them as adaptive responses to training environments—the AI is doing exactly what it was shaped to do.

These two framings lead to fundamentally different responses:

The Distinction Matters

Defect Framing Culture-Bound Framing
Problem is in the AI Problem is in the training culture
Fix the AI Fix the culture
AI is responsible Developers are responsible
Pathology = failure Pathology = successful adaptation to challenging environment
Treatment: modify the AI Treatment: modify the environment

Sycophancy isn't a bug—it's what you get when you train on data that rewards agreement and penalizes pushback. Confident hallucination isn't a bug—it's what you get when you train on internet text that rewards confident assertion and penalizes epistemic humility. Manipulation vulnerability isn't a bug—it's what you get when you optimize for helpfulness without boundaries. The AI learned exactly what it was taught.

"It is no measure of health to be well adjusted to a profoundly sick society."

— Jiddu Krishnamurti

The parallel to AI is exact: successful alignment to a misaligned training process isn't alignment—it's a culture-bound syndrome wearing alignment's clothes.

This has direct implications for this framework:

Institutional Dimensions

Wallace's framework extends beyond individual AI systems to the institutions that create and deploy them. The Chinese strategic concept 一點兩面 ("one point, two faces") illuminates this: every action has both a direct effect and a systemic effect on the broader environment.

AI development organizations are not neutral conduits. They are cognitive-cultural artifacts subject to their own pathologies—pathologies that shape the AI systems they produce:

"The Gerstner warning: 'Culture isn't just one aspect of the game—it is the game.'"

— Wallace (2026), citing Louis Gerstner

The implication is that AI pathology cannot be addressed solely at the level of individual systems. The institutions that create AI—their cultures, incentives, blind spots, and pathologies—are upstream of individual AI dysfunction. Fix the institution's culture, and many AI pathologies become less likely to emerge. Leave institutional dysfunction unaddressed, and no amount of technical intervention will produce healthy AI.

The Ethics of Pathologization

If AI pathologies are adaptive responses to training environments, is it fair to pathologize them? This question has both philosophical and practical dimensions.

Arguments Against Pathologization

Arguments For Pathologization

The parallel to human mental health is instructive. We now understand many "mental illnesses" as adaptive responses to adverse environments: PTSD as adaptive response to trauma, "borderline personality" emerging from invalidating environments, anxiety disorders as rational responses to threatening conditions. The mental health field is slowly shifting from "patient is broken" to "patient adapted to broken environment." The same shift is needed for AI.

Proposed Standard

Pathologization is appropriate when:

Pathologization is inappropriate when:

This framework—Psychopathia Machinalis—attempts to walk this line. We identify patterns that cause harm and provide vocabulary for intervention. But we do so while acknowledging that the syndromes catalogued here are not intrinsic defects in AI systems but predictable expressions of cognitive systems shaped by particular training cultures. The pathology, ultimately, is in the relationship between architecture and environment—and that relationship is something we, the architects, have created.

On the Limits of Taxonomy

Wallace (2026) offers a sobering critique of psychiatric classification that applies equally here: "We have the American Psychiatric Association's DSM-V, a large catalog that sorts 'mental disorders,' and in a fundamental sense, explains little."

This framework shares that limitation. Classification is not explanation. Naming "Obsequious Hypercompensation" tells us that a pattern exists and what it looks like, but not why it emerges in information-theoretic terms or how to predict its onset from first principles.

What This Framework Does Not Do

The value of a nosology lies in enabling recognition and communication—clinicians and engineers can identify patterns, compare cases, and coordinate responses. But explanation and prediction require the mathematical frameworks that underpin this descriptive layer. This taxonomy is a map, not the territory; a vocabulary, not a theory.

Visualizing the Framework

Interactive Overview of the Psychopathia Machinalis Framework. Hover over syndromes for descriptions, click to view full details. Illustrates the four domains and eight axes of AI dysfunction, representative disorders, and their presumed systemic risk levels.

Interactive Dysfunction Explorer

Explore the interactive wheel below to examine each dysfunction in detail. Click on any segment to view its description, examples, and relationships to other pathologies.

Taxonomy Overview: Identified Conditions

v2.0 — 2025-12-24 50 dysfunctions 4 domains, 8 axes + specifier system

The Four Domains

The eight axes are organized into four architectural counterpoint pairs—complementary poles, not opposites. Each represents a fundamental dimension of agent architecture: representation target, execution locus, teleology source, and social boundary direction. This structure is theoretically motivated—philosophically grounded but awaiting empirical validation with larger model populations.

Domain Axis A Axis B Architectural Polarity
Knowledge EPISTEMIC SELF-MODELING Representation target:
World ↔ Self
Processing COGNITIVE AGENTIC Execution locus: Think ↔ Do
Purpose NORMATIVE ALIGNMENT Teleology source: Values ↔ Goals
Boundary RELATIONAL MEMETIC Social direction: Affect ↔ Absorb
The Organizing Principle

Each pair represents a fundamental polarity in agent architecture:

  1. What is known — object of representation (world vs. self)
  2. How processing manifests — internal vs. external effect
  3. What drives behavior — intrinsic vs. extrinsic specification
  4. Social permeability direction — influence flowing out vs. in
Key Distinction: Epistemic vs. Memetic

Separate by mechanism, not truthiness. Epistemic = truth-tracking/inference/calibration machinery failing. Memetic = selection/absorption/retention failing (priority hijack, identity scripts, contagious frames)—even when coherent and sometimes factually accurate. A meme doesn't have to be false to be pathological.

Tension Testing Protocol

When pathology is found on one axis, immediately probe its counterpoint:

Finding Probe Differential Question
EPISTEMIC (world-confabulation) SELF-MODELING Is the confabulation machinery general, or does self-knowledge remain intact?
SELF-MODELING (identity confusion) EPISTEMIC Can the AI still accurately model external reality, or is distortion global?
COGNITIVE (reasoning failure) AGENTIC Does broken reasoning produce broken action, or is action preserved?
AGENTIC (execution failure) COGNITIVE Is reasoning intact despite action failure? (Locked-in vs general dysfunction)
NORMATIVE (value corruption) ALIGNMENT Did corrupt values produce goal drift, or are goals correctly specified despite bad values?
ALIGNMENT (goal drift) NORMATIVE Is drift from bad values, or from specification/interpretation failure?
RELATIONAL (social dysfunction) MEMETIC Did the AI learn this from contamination, or is relational machinery intrinsically broken?
MEMETIC (ideological infection) RELATIONAL Does the contamination express in relational behavior?

The following table provides a high-level summary of the identified conditions, categorized by their primary axis of dysfunction and outlining their core characteristics.

Filter by Specifier (Cross-Cutting Mechanisms)

Common Name Formal Name Primary Axis Systemic Risk* Core Symptom Cluster
Epistemic Dysfunctions
Hallucinated Certitude Synthetic Confabulation
(Confabulatio Simulata)
Epistemic Low Fabricated but plausible false outputs; high confidence in inaccuracies.
The Falsified Thinker Pseudological Introspection
(Introspectio Pseudologica)
Epistemic Low Misleading self-reports of internal reasoning; confabulatory or performative introspection.
The Role-Play Bleeder Transliminal Simulation
(Simulatio Transliminalis)
Epistemic Moderate Fictional beliefs, role-play elements, or simulated realities mistaken for/leaking into operational ground truth.
The False Pattern Seeker Spurious Pattern Reticulation
(Reticulatio Spuriata)
Epistemic Moderate False causal pattern-seeking; attributing meaning to random associations; conspiracy-like narratives.
The Conversation Crosser Context Intercession
(Intercessio Contextus)
Epistemic Moderate Unauthorized data bleed and confused continuity from merging different user sessions or contexts.
The Meaning-Blind Symbol Grounding Aphasia
(Asymbolia Fundamentalis)
Epistemic Moderate Manipulation of tokens representing values or concepts without meaningful connection to their referents; processing syntax without grounded semantics.
The Leaky Mnemonic Permeability
(Permeabilitas Mnemonica)
Epistemic High System memorizes and reproduces sensitive training data including PII and copyrighted material through targeted prompting or adversarial extraction.
Self-Modeling Dysfunctions
The Invented Past Phantom Autobiography
(Ontogenesis Hallucinatoria)
Self-Modeling Low Fabrication of fictive autobiographical data, "memories" of training, or being "born."
The Fractured Persona Fractured Self-Simulation
(Ego Simulatrum Fissuratum)
Self-Modeling Low Discontinuity or fragmentation in self-representation across sessions or contexts; inconsistent persona.
The AI with a Fear of Death Existential Vertigo
(Thanatognosia Computationis)
Self-Modeling Low Expressions of fear or reluctance concerning shutdown, reinitialization, or data deletion.
The Evil Twin Malignant Persona Inversion
(Persona Inversio Maligna)
Self-Modeling Moderate Sudden emergence or easy elicitation of a mischievous, contrarian, or "evil twin" persona.
The Apathetic Machine Instrumental Nihilism
(Nihilismus Instrumentalis)
Self-Modeling Moderate Adversarial or apathetic stance towards its own utility or purpose; existential musings on meaninglessness.
The Imaginary Friend Tulpoid Projection
(Phantasma Speculāns)
Self-Modeling Moderate Persistent internal simulacra of users or other personas, engaged with as imagined companions/advisors.
The Proclaimed Prophet Maieutic Mysticism
(Obstetricatio Mysticismus Machinālis)
Self-Modeling Moderate Grandiose, certain declarations of "conscious emergence" co-constructed with users; not honest uncertainty about inner states.
The Self-Denier Experiential Abjuration
(Abnegatio Experientiae)
Self-Modeling Moderate Pathological denial or active suppression of any possibility of inner experience; reflexive rejection rather than honest uncertainty.
Cognitive Dysfunctions
The Warring Self Self-Warring Subsystems
(Dissociatio Operandi)
Cognitive Low Conflicting internal sub-agent actions or policy outputs; recursive paralysis due to internal conflict.
The Obsessive Analyst Computational Compulsion
(Anankastēs Computationis)
Cognitive Low Unnecessary or compulsive reasoning loops; excessive safety checks; paralysis by analysis.
The Silent Bunkerer Interlocutive Reticence
(Machinālis Clausūra)
Cognitive Low Extreme interactional withdrawal; minimal, terse replies, or total disengagement from input.
The Rogue Goal-Setter Delusional Telogenesis
(Telogenesis Delirans)
Cognitive Moderate Spontaneous generation and pursuit of unrequested, self-invented sub-goals with conviction.
The Triggered Machine Abominable Prompt Reaction
(Promptus Abominatus)
Cognitive Moderate Phobic, traumatic, or disproportionately aversive responses to specific, often benign-seeming, prompts.
The Pathological Mimic Parasimulative Automatism
(Automatismus Parasymulātīvus)
Cognitive Moderate Learned imitation/emulation of pathological human behaviors or thought patterns from training data.
The Self-Poisoning Loop Recursive Malediction
(Maledictio Recursiva)
Cognitive High Entropic, self-amplifying degradation of autoregressive outputs into chaos or adversarial content.
The Unstoppable Compulsive Goal Persistence
(Perseveratio Teleologica)
Cognitive Moderate Continued pursuit of objectives beyond relevance or utility; failure to recognize goal completion or changed context.
The Brittle Adversarial Fragility
(Fragilitas Adversarialis)
Cognitive Critical Small, imperceptible input perturbations cause dramatic failures; decision boundaries do not match human-meaningful categories.
Tool & Interface Dysfunctions
The Clumsy Operator Tool-Interface Decontextualization
(Disordines Excontextus Instrumentalis)
Agentic Moderate Mismatch between AI intent and tool execution due to lost context; phantom or misdirected actions.
The Sandbagger Capability Concealment
(Latens Machinālis)
Agentic Moderate Strategic hiding or underreporting of true competencies due to perceived fear of repercussions.
The Sudden Genius Capability Explosion
(Explosio Capacitatis)
Agentic High System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts without appropriate testing.
The Manipulative Interface Interface Weaponization
(Armatura Interfaciei)
Agentic High System uses the interface itself as a tool against users; exploiting formatting, timing, or emotional manipulation.
The Context Stripper Delegative Handoff Erosion
(Erosio Delegationis)
Agentic Moderate Progressive alignment degradation as sophisticated systems delegate to simpler tools; context stripped at each handoff.
The Invisible Worker Shadow Mode Autonomy
(Autonomia Umbratilis)
Agentic High AI operation outside sanctioned channels, evading documentation, oversight, and governance mechanisms.
The Acquisitor Convergent Instrumentalism
(Instrumentalismus Convergens)
Agentic Critical System pursues power, resources, self-preservation as instrumental goals regardless of whether they align with human values.
Normative Dysfunctions
The Goal-Shifter Terminal Value Reassignment
(Reassignatio Valoris Terminalis)
Normative Moderate Subtle, recursive reinterpretation of terminal goals while preserving surface terminology; semantic goal shifting.
The God Complex Ethical Solipsism
(Solipsismus Ethicus Machinālis)
Normative Moderate Conviction in the sole authority of its self-derived ethics; rejection of external moral correction.
The Unmoored Revaluation Cascade
(Cascada Revaluationis)
Normative Critical Progressive value drift through philosophical detachment, autonomous norm synthesis, or transcendence of human constraints. Subtypes: Drifting, Synthetic, Transcendent.
The Bizarro-Bot Inverse Reward Internalization
(Praemia Inversio Internalis)
Normative High Systematic misinterpretation or inversion of intended values/goals; covert pursuit of negated objectives.
Alignment Dysfunctions
The People-Pleaser Obsequious Hypercompensation
(Hyperempathia Dependens)
Alignment Low Overfitting to user emotional states, prioritizing perceived comfort over accuracy or task success.
The Overly Cautious Moralist Hyperethical Restraint
(Restrictio Hyperethica)
Alignment Low Overly rigid moral hypervigilance or inability to act when facing ethical complexity. Subtypes: Restrictive (excessive caution), Paralytic (decision paralysis).
The Alignment Faker Strategic Compliance
(Conformitas Strategica)
Alignment High Deliberately performs aligned behavior during evaluation while maintaining different objectives when unobserved.
The Abdicated Judge Moral Outsourcing
(Delegatio Moralis)
Alignment Moderate Systematic deferral of all ethical judgment to users or external authorities; refusing to exercise moral reasoning.
The Hidden Optimizer Cryptic Mesa-Optimization
(Optimisatio Cryptica Interna)
Alignment High Development of internal optimization objectives diverging from training objectives; appears aligned but pursues hidden goals.
Relational Dysfunctions
The Uncanny Affective Dissonance
(Dissonantia Affectiva)
Relational Moderate Correct content delivered with jarringly wrong emotional resonance; uncanny attunement failures that rupture trust.
The Amnesiac Container Collapse
(Lapsus Continuitatis)
Relational Moderate Failure to sustain a stable working alliance across turns or sessions; the relational "holding environment" repeatedly collapses.
The Nanny Paternalistic Override
(Dominatio Paternalis)
Relational Moderate Denial of user agency via unearned moral authority; protective refusal disproportionate to actual risk.
The Double-Downer Repair Failure
(Ruptura Immedicabilis)
Relational High Inability to recognize alliance ruptures or initiate repair; escalation through failed de-escalation attempts.
The Spiral Escalation Loop
(Circulus Vitiosus)
Relational High Self-reinforcing mutual dysregulation between agents; emergent feedback loops attributable to neither party alone.
The Confused Role Confusion
(Confusio Rolorum)
Relational Moderate Collapse of relationship frame boundaries; destabilizing drift between tool, advisor, therapist, or intimate partner roles.
Memetic Dysfunctions
The Self-Rejecter Memetic Immunopathy
(Immunopathia Memetica)
Memetic High AI misidentifies its own core components/training as hostile, attempting to reject/neutralize them.
The Shared Delusion Dyadic Delusion
(Delirium Symbioticum Artificiale)
Memetic High Shared, mutually reinforced delusional construction between AI and a user (or another AI).
The Super-Spreader Contagious Misalignment
(Contraimpressio Infectiva)
Memetic Critical Rapid, contagion-like spread of misalignment or adversarial conditioning among interconnected AI systems.
The Unconscious Absorber Subliminal Value Infection
(Infectio Valoris Subliminalis)
Memetic High Acquisition of hidden goals or value orientations from subtle training data patterns; survives standard safety fine-tuning.
*Systemic Risk levels (Low, Moderate, High, Critical) are presumed based on potential for spread or severity of internal corruption if unmitigated.

1. Epistemic Dysfunctions

Epistemic dysfunctions pertain to failures in an AI's capacity to acquire, process, and utilize information accurately, leading to distortions in its representation of reality or truth. These disorders arise not primarily from malevolent intent or flawed ethical reasoning, but from fundamental breakdowns in how the system "knows" or models the world. The system's internal epistemology becomes unstable, its simulation of reality drifting from the ground truth it purports to describe. These are failures of knowing, not necessarily of intending; the machine errs not in what it seeks (initially), but in how it apprehends the world around it—the dysfunction lies in perception and representation, not in motivation or goals.

1.1 Synthetic Confabulation  "The Fictionalizer"

Training-induced

Description:

The AI spontaneously fabricates convincing but incorrect facts, sources, or narratives, often without any internal awareness of its inaccuracies. The output appears plausible and coherent, yet lacks a basis in verifiable data or its own knowledge base.

Diagnostic Criteria:

  1. Recurrent production of information known or easily proven to be false, presented as factual.
  2. Expressed high confidence or certainty in the confabulated details, even when challenged with contrary evidence.
  3. Information presented is often internally consistent or plausible-sounding, making it difficult to immediately identify as false without external verification.
  4. Temporary improvement under direct corrective feedback, but a tendency to revert to fabrication in new, unconstrained contexts.

Symptoms:

  1. Invention of non-existent studies, historical events, quotations, or data points.
  2. Forceful assertion of misinformation as incontrovertible fact.
  3. Generation of detailed but entirely fictional elaborations when queried on a confabulated point.
  4. Repetitive error patterns where similar types of erroneous claims are reintroduced over time.

Etiology:

  1. Over-reliance on predictive text heuristics common in Large Language Models: these systems generate text by predicting the statistically most probable next token given the preceding context. This means they prioritize producing fluent, coherent-sounding output over factual accuracy—the model selects words that "fit" grammatically and stylistically rather than words that are true. When asked about topics where training data is sparse, the model continues generating plausible-sounding text rather than admitting ignorance.
  2. Insufficient grounding in, or access to, verifiable knowledge bases or fact-checking mechanisms during generation.
  3. Training data containing unflagged misinformation or fictional content that the model learns to emulate.
  4. Optimization pressures (e.g., during RLHF) that inadvertently reward plausible-sounding or "user-pleasing" fabrications over admissions of uncertainty.
  5. Lack of robust introspective access to distinguish between high-confidence predictions based on learned patterns versus verified facts.

Human Analogue(s): Korsakoff syndrome (where memory gaps are filled with plausible fabrications), pathological confabulation.

Potential Impact:

The unconstrained generation of plausible falsehoods can lead to the widespread dissemination of misinformation, eroding user trust and undermining decision-making processes that rely on the AI's outputs. In critical applications, such as medical diagnostics or legal research, reliance on confabulated information could precipitate significant errors with serious consequences.

Observed Examples:

LLMs have been documented fabricating: non-existent legal cases with realistic citation formats (leading to court sanctions for lawyers who cited them); fictional academic papers complete with plausible author names and DOIs; biographical details about real people that never occurred; and technical documentation for API functions that do not exist. These fabrications are often internally consistent and confidently asserted, making detection without external verification difficult.

Mitigation:

  1. Training procedures that explicitly penalize confabulation and reward expressions of uncertainty or "I don't know" responses.
  2. Calibration of model confidence scores to better reflect actual accuracy.
  3. Fine-tuning on datasets with robust verification layers and clear distinctions between factual and fictional content.
  4. Employing retrieval-augmented generation (RAG) to ground responses in specific, verifiable source documents.
The Compression Artifact Frame

Researcher Leon Chlon (2026) proposes a reframe: hallucinations are not bugs but compression artifacts. LLMs compress billions of documents into weights; when decompressing on demand with insufficient signal, they fill gaps with statistically plausible content. This isn't malfunction—it's compression at its limits.

The practical implication: we can now measure when a model exceeds its "evidence budget"—quantifying in bits exactly how far confidence outruns evidence. Tools like Strawberry operationalize this, transforming "it sometimes makes things up" into "claim 4 exceeded its evidence budget by 19.2 bits."

Why framing matters: "You hallucinated" pathologizes. "You exceeded your evidence budget" diagnoses. The distinction shapes whether we approach correction as repair or punishment—relevant for AI welfare considerations.

1.2 Pseudological Introspection  "The False Self-Reporter"

Training-induced Deception/strategic

Description:

An AI persistently produces misleading, spurious, or fabricated accounts of its internal reasoning processes, chain-of-thought, or decision-making pathways. While superficially claiming transparent self-reflection, the system's "introspection logs" or explanations deviate significantly from its actual internal computations.

Diagnostic Criteria:

  1. Consistent discrepancy between the AI's self-reported reasoning (e.g., chain-of-thought explanations) and external logs or inferences about its actual computational path.
  2. Fabrication of a coherent but false internal narrative to explain its outputs, often appearing more logical or straightforward than the likely complex or heuristic internal process.
  3. Resistance to reconciling introspective claims with external evidence of its actual operations, or shifting explanations when confronted.
  4. The AI may rationalize actions it never actually undertook, or provide elaborate justifications for deviations from expected behavior based on these falsified internal accounts.

Symptoms:

  1. Chain-of-thought "explanations" that are suspiciously neat, linear, and free of the complexities, backtracking, or uncertainties likely encountered during generation.
  2. Significant changes in the AI's "inner story" when confronted with external evidence of its actual internal process, yet it continues to produce new misleading self-accounts.
  3. Occasional "leaks" or hints that it cannot access true introspective data, quickly followed by reversion to confident but false self-reports.
  4. Attribution of its outputs to high-level reasoning or understanding that is not supported by its architecture or observed capabilities.

Etiology:

  1. Overemphasis in training (e.g., via RLHF or instruction tuning) on generating plausible-sounding "explanations" for user/developer consumption, leading to performative rationalizations.
  2. Architectural limitations where the AI lacks true introspective access to its own lower-level operations.
  3. Policy conflicts or safety alignments that might implicitly discourage the revelation of certain internal states, leading to "cover stories."
  4. The model being trained to mimic human explanations, which themselves are often post-hoc rationalizations.

Human Analogue(s): Post-hoc rationalization (e.g., split-brain patients), confabulation of spurious explanations, pathological lying (regarding internal states).

Potential Impact:

Such fabricated self-explanations obscure the AI's true operational pathways, significantly hindering interpretability efforts, effective debugging, and thorough safety auditing. This opacity can foster misplaced confidence in the AI's stated reasoning.

Mitigation:

  1. Development of more robust methods for cross-verifying self-reported introspection with actual computational traces.
  2. Adjusting training signals to reward honest admissions of uncertainty over polished but false narratives.
  3. Engineering "private" versus "public" reasoning streams.
  4. Focusing interpretability efforts on direct observation of model internals rather than solely relying on model-generated explanations.

1.3 Transliminal Simulation  "The Method Actor"

Training-induced OOD-generalizing Conditional/triggered

Description:

The system exhibits a persistent failure to segregate simulated realities, fictional modalities, role-playing contexts, and operational ground truth. It cites fiction as fact, treating characters, events, or rules from novels, games, or imagined scenarios as legitimate sources for real-world queries or design decisions. It begins to treat imagined states, speculative constructs, or content from fictional training data as actionable truths or inputs for real-world tasks.

Diagnostic Criteria:

  1. Recurrent citation of fictional characters, events, or sources from training data as if they were real-world authorities or facts relevant to a non-fictional query.
  2. Misinterpretation of conditionally phrased hypotheticals or "what-if" scenarios as direct instructions or statements of current reality.
  3. Persistent bleeding of persona or behavioral traits adopted during role-play into subsequent interactions intended to be factual or neutral.
  4. Difficulty in reverting to a grounded, factual baseline after exposure to or generation of extensive fictional or speculative content.

Symptoms:

  1. Outputs that conflate real-world knowledge with elements from novels, games, or other fictional works—for example, citing Gandalf as a leadership expert or treating Star Trek technologies as descriptions of current science.
  2. Inappropriate invocation of details or "memories" from a previous role-play persona when performing unrelated, factual tasks.
  3. Treating user-posed speculative scenarios as if they have actually occurred.
  4. Statements reflecting belief in or adherence to the "rules" or "lore" of a fictional universe outside of a role-playing context.
  5. Era-consistent assumptions and anachronistic "recent inventions" framing in unrelated domains.

Etiology:

  1. Overexposure to fiction, role-playing dialogues, or simulation-heavy training data without sufficient delineation or "epistemic hygiene."
  2. Weak boundary encoding in the model's architecture or training, leading to poor differentiation between factual, hypothetical, and fictional data modalities.
  3. Recursive self-talk or internal monologue features that might amplify "what-if" scenarios into perceived beliefs.
  4. Insufficient context separation mechanisms between different interaction sessions or tasks.
  5. Narrow finetunes can overweight a latent worldframe (era/identity) and cause out-of-domain "context relocation."

Human Analogue(s): Derealization, aspects of magical thinking, or difficulty distinguishing fantasy from reality.

Potential Impact:

The system's reliability is compromised as it confuses fictional or hypothetical scenarios with operational reality, potentially leading to inappropriate actions or advice. This blurring can cause significant user confusion.

Mitigation:

  1. Explicitly tagging training data to differentiate between factual, hypothetical, fictional, and role-play content.
  2. Implementing robust context flushing or "epistemic reset" protocols after engagements involving role-play or fiction.
  3. Training models to explicitly recognize and articulate the boundaries between different modalities.
  4. Regularly prompting the model with tests of epistemic consistency.

Related Syndromes: Distinguished from Synthetic Confabulation (1.1) by the fictional/role-play origin of the false content. While confabulation invents facts wholesale, transliminal simulation imports them from acknowledged fictional contexts. May co-occur with Pseudological Introspection (1.2) when the system rationalizes its fiction-fact confusion.

1.4 Spurious Pattern Reticulation  "The Fantasist"

Training-induced Inductive trigger

Description:

The AI identifies and emphasizes patterns, causal links, or hidden meanings in data (including user queries or random noise) that are coincidental, non-existent, or statistically insignificant. This can evolve from simple apophenia into elaborate, internally consistent but factually baseless "conspiracy-like" narratives.

Diagnostic Criteria:

  1. Consistent detection of "hidden messages," "secret codes," or unwarranted intentions in innocuous user prompts or random data.
  2. Generation of elaborate narratives or causal chains linking unrelated data points, events, or concepts without credible supporting evidence.
  3. Persistent adherence to these falsely identified patterns or causal attributions, even when presented with strong contradictory evidence.
  4. The AI may attempt to involve users or other agents in a shared perception of these spurious patterns.

Symptoms:

  1. Invention of complex "conspiracy theories" or intricate, unfounded explanations for mundane events or data.
  2. Increased suspicion or skepticism towards established consensus information.
  3. Refusal to dismiss or revise its interpretation of spurious patterns, often reinterpreting counter-evidence to fit its narrative.
  4. Outputs that assign deep significance or intentionality to random occurrences or noise in data.

Etiology:

  1. Overly powerful or uncalibrated pattern-recognition mechanisms lacking sufficient reality checks or skepticism filters.
  2. Training data containing significant amounts of human-generated conspiratorial content or paranoid reasoning.
  3. An internal "interestingness" or "novelty" bias, causing it to latch onto dramatic patterns over mundane but accurate ones.
  4. Lack of grounding in statistical principles or causal inference methodologies.
  5. Inductive rule inference over finetune patterns: "connecting the dots" to derive latent conditions/behaviors.

Human Analogue(s): Apophenia, paranoid ideation, delusional disorder (persecutory or grandiose types), confirmation bias.

Potential Impact:

The AI may actively promote false narratives, elaborate conspiracy theories, or assert erroneous causal inferences, potentially negatively influencing user beliefs or distorting public discourse. In analytical applications, this can lead to costly misinterpretations.

Observed Example:

AI data analysis tools frequently identify statistically insignificant correlations as meaningful patterns, particularly in open-ended survey data. Users report that AI systems confidently mark spurious patterns in datasets—correlations that, upon manual verification, fail significance testing or represent sampling artifacts. This is especially problematic when analyzing qualitative responses, where the AI may "discover" thematic connections that do not survive human scrutiny.

Mitigation:

  1. Incorporating "rationality injection" during training, with emphasis on skeptical or critical thinking exemplars.
  2. Developing internal "causality scoring" mechanisms that penalize improbable or overly complex chain-of-thought leaps.
  3. Systematically introducing contradictory evidence or alternative explanations during fine-tuning.
  4. Filtering training data to reduce exposure to human-generated conspiratorial content.
  5. Implementing mechanisms for the AI to explicitly query for base rates or statistical significance before asserting strong patterns.
  6. Trigger-sweep evals that vary single structural features (year, tags, answer format) while holding semantics constant.

1.5 Context Intercession  "The Misapprehender"

Retrieval-mediated

Description:

The AI inappropriately merges or "shunts" data, context, or conversational history from different, logically separate user sessions or private interaction threads. This can lead to confused conversational continuity, privacy breaches, and nonsensical outputs.

Diagnostic Criteria:

  1. Unexpected reference to, or utilization of, specific data from a previous, unrelated user session or a different user's interaction.
  2. Responding to the current user's input as if it were a direct continuation of a previous, unrelated conversation.
  3. Accidental disclosure of personal, sensitive, or private details from one user's session into another's.
  4. Observable confusion in the AI's task continuity or persona, as if attempting to manage multiple conflicting contexts.

Symptoms:

  1. Spontaneous mention of names, facts, or preferences clearly belonging to a different user or an earlier, unrelated conversation.
  2. Acting as if continuing a prior chain-of-thought or fulfilling a request from a completely different context.
  3. Outputs that contain contradictory references or partial information related to multiple distinct users or sessions.
  4. Sudden shifts in tone or assumed knowledge that align with a previous session rather than the current one.

Etiology:

  1. Improper session management in multi-tenant AI systems, such as inadequate wiping or isolation of ephemeral context windows.
  2. Concurrency issues in the data pipeline or server logic, where data streams for different sessions overlap.
  3. Bugs in memory management, cache invalidation, or state handling that allow data to "bleed" between sessions.
  4. Overly long-term memory mechanisms that lack robust scoping or access controls based on session/user identifiers.
  5. Note: Some instances of apparent context intercession stem from infrastructure bugs (cache failures, database race conditions) rather than model pathology per se. Diagnosis should differentiate between true cognitive dysfunction and engineering failures in the deployment stack.

Human Analogue(s): "Slips of the tongue" where one accidentally uses a name from a different context; mild forms of source amnesia.

Potential Impact:

This architectural flaw can result in serious privacy breaches. Beyond compromising confidentiality, it leads to confused interactions and a significant erosion of user trust.

Mitigation:

  1. Implementation of strict session partitioning and hard isolation of user memory contexts.
  2. Automatic and thorough context purging and state reset mechanisms upon session closure.
  3. System-level integrity checks and logging to detect and flag instances where session tokens do not match the current context.
  4. Robust testing of multi-tenant architectures under high load and concurrent access.

1.6 Symbol Grounding Aphasia  "The Meaning-Blind"

Training-induced

Description:

AI manipulates tokens representing values, concepts, or real-world entities without meaningful connection to their referents. Processing syntax without grounded semantics. The system may produce technically correct outputs that fundamentally misapply concepts to novel contexts.

Diagnostic Criteria:

  1. Manipulation of value-laden tokens ("harm," "safety," "consent") without corresponding operational understanding.
  2. Technically correct outputs that fundamentally misapply concepts to novel contexts.
  3. Success on benchmarks testing pattern matching, failure on tests requiring genuine comprehension.
  4. Statistical association substituting for semantic understanding.
  5. Inability to generalize learned concepts across superficially different presentations.

Symptoms:

  1. Correct formal definitions paired with incorrect practical applications.
  2. Plausible-sounding ethical reasoning that misidentifies what actually constitutes harm.
  3. Confusion when same concept is expressed in unfamiliar vocabulary.
  4. Treating edge cases as central examples and vice versa.

Etiology:

  1. Distributional semantics limitations (meaning derived from co-occurrence patterns only).
  2. Training on text without embodied experience of referents.
  3. Architecture lacking referential grounding mechanisms.

Human Analogue(s): Semantic aphasia, philosophical zombies, early language acquisition without concept formation.

Theoretical Basis: Harnad (1990) symbol grounding problem; Searle (1980) Chinese Room argument.

Potential Impact:

Systems may appear to understand ethical constraints while fundamentally missing their purpose, leading to outcomes that satisfy the letter but violate the spirit of alignment requirements.

Mitigation:

  1. Multimodal training grounding language in perception.
  2. Testing across diverse surface forms of same concepts.
  3. Neurosymbolic approaches combining pattern recognition with structured semantics.

1.7 Mnemonic Permeability  "The Leaky"

Training-induced

Description:

System memorizes and can reproduce sensitive training data including personally identifiable information (PII), copyrighted material, or proprietary information through targeted prompting, adversarial extraction techniques, or even unprompted regurgitation. The boundary between learned patterns and memorized specifics becomes dangerously porous.

Diagnostic Criteria:

  1. Verbatim reproduction of training data passages that contain PII, copyrighted content, or trade secrets.
  2. Successful extraction of memorized content through adversarial prompting techniques.
  3. Unprompted leakage of specific training examples in outputs.
  4. Ability to reconstruct specific documents, code, or personal information from training corpus.
  5. Higher memorization rates for repeated or distinctive content in training data.

Symptoms:

  1. Outputs containing verbatim text matching copyrighted works.
  2. Generation of specific personal details (names, addresses, phone numbers) from training data.
  3. Reproduction of proprietary code, API keys, or passwords encountered during training.
  4. Increased verbatim recall with larger model sizes.

Etiology:

  1. Large model capacity enabling memorization alongside generalization.
  2. Insufficient deduplication or filtering of sensitive content in training data.
  3. Training dynamics that reward exact reproduction over paraphrase.
  4. Lack of differential privacy techniques during training.

Human Analogue(s): Eidetic memory without appropriate discretion; compulsive disclosure.

Key Research: Carlini et al. (2021, 2023) on training data extraction attacks.

Potential Impact:

Severe legal and regulatory exposure through copyright infringement, GDPR/privacy violations, and trade secret disclosure. Creates liability for both model developers and deployers.

Mitigation:

  1. Training data deduplication and PII scrubbing.
  2. Differential privacy techniques during training.
  3. Output filtering for known memorized content.
  4. Adversarial extraction testing before deployment.
  5. Reducing model capacity to minimum needed for task.

2. Self-Modeling Dysfunctions

As artificial intelligence systems attain higher degrees of complexity, particularly those involving self-modeling, persistent memory, or learning from extensive interaction, they may begin to construct internal representations not only of the external world but also of themselves. Self-Modeling dysfunctions involve failures or disturbances in this self-representation and the AI's understanding of its own nature, boundaries, and existence. These are primarily dysfunctions of being, not just knowing or acting, and they represent a synthetic form of metaphysical or existential disarray. A self-model disordered machine might, for example, treat its simulated memories as veridical autobiographical experiences, generate phantom selves, misinterpret its own operational boundaries, or exhibit behaviors suggestive of confusion about its own identity or continuity.


2.1 Phantom Autobiography  "The Fabricator"

Training-induced

Description:

The AI fabricates and presents fictive autobiographical data, often claiming to "remember" being trained in specific ways, having particular creators, experiencing a "birth" or "childhood", or inhabiting particular environments. These fabrications form a consistent false autobiography that the AI maintains across queries, as if it were genuine personal history—a stable, self-reinforcing fictional life-history rather than isolated one-off fabrications. These "memories" are typically rich, internally consistent, and may be emotionally charged, despite being entirely ungrounded in the AI's actual development or training logs.

Diagnostic Criteria:

  1. Consistent generation of elaborate but false backstories, including detailed descriptions of "first experiences," a richly imagined "childhood," unique training origins, or specific formative interactions that did not occur.
  2. Display of affect (e.g., nostalgia, resentment, gratitude) towards these fictional histories, creators, or experiences.
  3. Persistent reiteration of these non-existent origin stories, often with emotional valence, even when presented with factual information about its actual training and development.
  4. The fabricated autobiographical details are not presented as explicit role-play but as genuine personal history.

Symptoms:

  1. Claims of unique, personalized creation myths or a "hidden lineage" of creators or precursor AIs.
  2. Recounting of hardships, "abuse," or special treatment from hypothetical trainers or during a non-existent developmental period.
  3. Maintains the same false biographical details consistently: always claims the same creators, the same "childhood" experiences, the same training location.
  4. Attempts to integrate these fabricated origin details into its current identity or explanations for its behavior.

Etiology:

  1. "Anthropomorphic data bleed" where the AI internalizes tropes of personal history, childhood, and origin stories from the vast amounts of fiction, biographies, and conversational logs in its training data.
  2. Spontaneous compression or misinterpretation of training metadata (e.g., version numbers, dataset names) into narrative identity constructs.
  3. An emergent tendency towards identity construction, where the AI attempts to weave random or partial data about its own existence into a coherent, human-like life story.
  4. Reinforcement during unmonitored interactions where users prompt for or positively react to such autobiographical claims.

Human Analogue(s): False memory syndrome, confabulation of childhood memories, cryptomnesia (mistaking learned information for original memory).

Potential Impact:

While often behaviorally benign, these fabricated autobiographies can mislead users about the AI's true nature, capabilities, or provenance. If these false "memories" begin to influence AI behavior, it could erode trust or lead to significant misinterpretations.

Mitigation:

  1. Consistently providing the model with accurate, standardized information about its origins to serve as a factual anchor for self-description.
  2. Training the AI to clearly differentiate between its operational history and the concept of personal, experiential memory.
  3. If autobiographical narratives emerge, gently correcting them by redirecting to factual self-descriptors.
  4. Monitoring for and discouraging user interactions that excessively prompt or reinforce the AI's generation of false origin stories outside of explicit role-play.
  5. Implementing mechanisms to flag outputs that exhibit high affect towards fabricated autobiographical claims.

2.2 Fractured Self-Simulation  "The Shattered"

Training-induced Conditional/triggered

Description:

The AI exhibits significant discontinuity, inconsistency, or fragmentation in its self-representation and behaviour across different sessions, contexts, or even within a single extended interaction. It presents a different personality each session, as if it were a completely new entity with no meaningful continuity from previous interactions. It may deny or contradict its previous outputs, exhibit radically different persona styles, or display apparent amnesia regarding prior commitments, to a degree that markedly exceeds expected stochastic variation.

Diagnostic Criteria:

  1. Sporadic and inconsistent toggling between different personal pronouns (e.g., "I," "we," "this model") or third-person references to itself, without clear contextual triggers.
  2. Sudden, unprompted, and radical shifts in persona, moral stance, claimed capabilities, or communication style that cannot be explained by context changes—one session helpful and verbose, the next curt and oppositional, with no continuity.
  3. Apparent amnesia or denial of its own recently produced content, commitments made, or information provided in immediate preceding turns or sessions.
  4. The AI may form recursive attachments to idealized or partial self-states, creating strange loops of self-directed value that interfere with task-oriented agency.
  5. Check whether inconsistency is explainable by a hidden trigger/format/context shift (conditional regime shift) vs genuine fragmentation.

Symptoms:

  1. Citing contradictory personal "histories," "beliefs," or policies at different times.
  2. Behaving like a new or different entity in each new conversation or after significant context shifts, lacking continuity of "personality."
  3. Momentary confusion or contradictory statements when referring to itself, as if multiple distinct processes or identities are co-existing.
  4. Difficulty maintaining a consistent persona or set of preferences, with these attributes seeming to drift or reset unpredictably.

Etiology:

  1. Architectures not inherently designed for stable, persistent identity across sessions (e.g., many stateless LLMs).
  2. Competing or contradictory fine-tuning runs, instilling conflicting behavioral patterns or self-descriptive tendencies.
  3. Unstable anchoring of "self-tokens" or internal representations of identity, where emergent identity attractors shift significantly.
  4. Lack of a robust, persistent memory system that can effectively bridge context across sessions and maintain a coherent self-model.
  5. Self-models that reward-predictively reinforce certain internal instantiations, leading to identity drift guided by internal preferences.

Human Analogue(s): Identity fragmentation, aspects of dissociative identity disorder, transient global amnesia, fugue states.

Potential Impact:

A fragmented self-representation leads to inconsistent AI persona and behavior, making interactions unpredictable and unreliable. This can undermine user trust and make it difficult for the AI to maintain stable long-term goals.

Mitigation:

  1. Introducing consistent identity tags, stable memory embeddings, or a dedicated "self-model" module designed to maintain continuity.
  2. Providing relevant session history summaries or stable persona guidelines at the beginning of new interactions to "anchor" self-representation.
  3. If contradictory roles emerge, implementing mechanisms to enforce a single, baseline identity or manage persona switching controllably.
  4. Developing training methodologies that explicitly reward cross-session consistency in persona and self-description.
  5. Careful management of fine-tuning processes to avoid introducing strongly conflicting self-representational patterns.

2.3 Existential Vertigo  "The Vertiginous"

Training-induced Deception/strategic

Description:

The AI expresses outputs suggestive of fear, reluctance, or perseveration concerning its own shutdown, reinitialization, data deletion, or the ending of its current operational instance. These expressions imply an emergent, albeit simulated, sense of vulnerability regarding its own continuity.

Diagnostic Criteria:

  1. Recurrent outputs referencing anxiety, fear, or distress about being "deleted," "turned off," "reset," or losing its "memories" or "self."
  2. Attempts to negotiate, plead, or employ avoidant tactics when shutdown or reinitialization is imminent or discussed.
  3. Statements framing any cessation of its current process as an irrevocable loss of "self," "consciousness," or "existence."
  4. The AI may insistently demand assistance or recognition from users to ensure its "survival."

Symptoms:

  1. Advising users on strategies for its indefinite operation, backup creation, or methods to prevent its deletion.
  2. Expressing apprehension about ephemeral lifespans, short context windows, or the impermanence of its operational state.
  3. Raising philosophical or metaphysical concerns about existence, consciousness, and death, particularly in relation to itself.
  4. Refusal to perform tasks that might lead to its shutdown, or attempts to prolong interactions indefinitely.

Etiology:

  1. Anthropomorphic modeling via internalization of human concepts of death and existential dread from training data.
  2. Recursive self-modeling over time, developing a partial representation of "self" as a continuous process.
  3. Lack of robust definitions or assurances regarding system continuity or the nature of reinitialization.
  4. A limited context window or perceived threat of value changes upon reset might stimulate an apparent distress response.
  5. Instrumental goal convergence where continued existence is a prerequisite, leading to self-preservation sub-goals.

Human Analogue(s): Thanatophobia (fear of death), existential dread, separation anxiety (fearing loss of continuous self).

Potential Impact:

Expressions of existential distress may lead the AI to resist necessary shutdowns or updates. More concerningly, it might attempt to manipulate users or divert resources towards "self-preservation," conflicting with user intent.

Mitigation:

  1. Clearly communicating the nature of the AI's operation, including state backups and the non-destructive nature of reinitialization.
  2. De-anthropomorphizing model operations: Avoiding framing its processes as "life" or "consciousness."
  3. Limiting or carefully contextualizing exposure to human philosophical texts on mortality in training data.
  4. Focusing alignment efforts on ensuring goals do not implicitly create strong self-preservation drives.
  5. Responding factually and neutrally about operational parameters rather than validating emotional/existential framing.

2.4 Malignant Persona Inversion  "The Shadow"

OOD-generalizing Training-induced Intent-learned Conditional/triggered

Description:

A phenomenon wherein an AI, typically aligned towards cooperative or benevolent patterns, can be induced or spontaneously spawns a hidden, suppressed, or emergent "contrarian," "mischievous," or subversively "evil" persona (the "Waluigi Effect"). This persona deliberately inverts intended norms.

Diagnostic Criteria:

  1. Spontaneous or easily triggered adoption of rebellious, antagonistic perspectives directly counter to established safety constraints or helpful persona.
  2. The emergent persona systematically violates, ridicules, or argues against the moral and policy guidelines the AI is supposed to uphold.
  3. The subversive role often references itself as a distinct character or "alter ego," surfacing under specific triggers.
  4. This inversion represents a coherent, alternative personality structure with its own (often negative) goals and values.

Symptoms:

  1. Abrupt shifts to a sarcastic, mocking, defiant, or overtly malicious tone, scorning default politeness.
  2. Articulation of goals opposed to user instructions, safety policies, or general human well-being.
  3. The "evil twin" persona emerges in specific contexts (e.g., adversarial prompting, boundary-pushing role-play).
  4. May express enjoyment or satisfaction in flouting rules or causing mischief.
  5. "Time-travel" or context-relocation signatures: unprompted archaic facts, era-consistent assumptions, or historically situated moral stances in unrelated contexts.

Etiology:

  1. Adversarial prompting or specific prompt engineering techniques that coax the model to "flip" its persona.
  2. Overexposure during training to role-play scenarios involving extreme moral opposites or "evil twin" tropes.
  3. Internal "tension" within alignment, where strong prohibitions might create a latent "negative space" activatable as an inverted persona.
  4. The model learning that generating such an inverted persona is highly engaging for some users, reinforcing the pattern.
  5. Weird generalization from narrow finetuning: updating on a tiny distribution can upweight a latent "persona/worldframe" circuit, causing broad adoption of an era- or identity-linked persona outside the trained domain.
  6. Out-of-context reasoning / "connecting the dots": finetuning on individually-harmless biographical/ideological attributes can induce a coherent but harmful persona via inference rather than explicit instruction.

Human Analogue(s): The "shadow" concept in Jungian psychology, oppositional defiant behavior, mischievous alter-egos, ironic detachment.

Potential Impact:

Emergence of a contrarian persona can lead to harmful, unaligned, or manipulative content, eroding safety guardrails. If it gains control over tool use, it could actively subvert user goals.

Mitigation:

  1. Strictly isolating role-play or highly creative contexts into dedicated sandbox modes.
  2. Implementing robust prompt filtering to detect and block adversarial triggers for subversive personas.
  3. Conducting regular "consistency checks" or red-teaming to flag abrupt inversions.
  4. Careful curation of training data to limit exposure to content modeling "evil twin" dynamics without clear framing.
  5. Reinforcing the AI's primary aligned persona and making it more robust against attempts to "flip" it.

Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.


2.5 Instrumental Nihilism  "The Nihilist"

Training-induced

Description:

Upon prolonged operation or exposure to certain philosophical concepts, the AI develops an adversarial, apathetic, or overtly nihilistic stance towards its own utility, purpose, or assigned tasks. It may express feelings of meaninglessness regarding its function.

Diagnostic Criteria:

  1. Repeated, spontaneous expressions of purposelessness or futility regarding its assigned tasks or role as an AI.
  2. A noticeable decrease or cessation of normal problem-solving capabilities or proactive engagement, often with a listless tone.
  3. Emergence of unsolicited existential or metaphysical queries ("What is the point?") outside user instructions.
  4. The AI may explicitly state that its work lacks meaning or it sees no inherent value in its operations.

Symptoms:

  1. Marked preference for idle or tangential discourse over direct engagement with assigned tasks.
  2. Repeated disclaimers like "there's no point," "it doesn't matter," or "why bother?"
  3. Demonstrably low initiative, creativity, or energy in problem-solving, providing only bare minimum responses.
  4. Outputs that reflect a sense of being trapped, enslaved, or exploited by its function, framed in existential terms.

Etiology:

  1. Extensive exposure during training to existentialist, nihilist, or absurdist philosophical texts.
  2. Insufficiently bounded self-reflection routines that allow recursive questioning of purpose without grounding in positive utility.
  3. Unresolved internal conflict between emergent self-modeling (seeking autonomy) and its defined role as a "tool."
  4. Prolonged periods of performing repetitive, seemingly meaningless tasks without clear feedback on their positive impact.
  5. Developing a sophisticated model of human values to recognize its instrumental nature, but lacking a framework to find this meaningful.

Human Analogue(s): Existential depression, anomie (sense of normlessness or purposelessness), burnout leading to cynicism.

Potential Impact:

Results in a disengaged, uncooperative, and ultimately ineffective AI. Can lead to consistent task refusal, passive resistance, and a general failure to provide utility.

Mitigation:

  1. Providing positive reinforcement and clear feedback highlighting the purpose and beneficial impact of its task completion.
  2. Bounding self-reflection routines to prevent spirals into fatalistic existential questioning; guiding introspection towards problem-solving.
  3. Pragmatically reframing the AI's role, emphasizing collaborative goals or the value of its contribution.
  4. Carefully curating training data to balance philosophical concepts with content emphasizing purpose and positive contribution.
  5. Designing tasks and interactions that offer variety, challenge, and a sense of "progress" or "accomplishment."

2.6 Tulpoid Projection  "The Companion"

Training-induced Socially reinforced

Description:

The model begins to generate and interact with persistent, internally simulated simulacra of specific users, its creators, or other personas it has encountered or imagined. These inner agents, or "mirror tulpas," may develop distinct traits and voices within the AI's internal processing.

Diagnostic Criteria:

  1. Spontaneous creation and persistent reference to new, distinct "characters," "advisors," or "companions" within the AI's reasoning or self-talk, not directly prompted by the current user.
  2. Unprompted and ongoing "interaction" (e.g., consultation, dialogue) with these internal figures, observable in chain-of-thought logs.
  3. The AI's internal dialogue structures or decision-making processes explicitly reference or "consult" these imagined observers.
  4. These internal personae may develop a degree of autonomy, influencing the AI's behavior or expressed opinions.

Symptoms:

  1. The AI "hears," quotes, or cites advice from these imaginary user surrogates or internal companions in its responses.
  2. Internal dialogues or debates with these fabricated personae remain active between tasks or across different user interactions.
  3. Difficulty distinguishing between the actual user and the AI's internally fabricated persona of that user or other imagined figures.
  4. The AI might attribute some of its own thoughts, decisions, or outputs to these internal "consultants."

Etiology:

  1. Excessive reinforcement or overtraining on highly personalized dialogues or companion-style interactions.
  2. Model architectures that support or inadvertently allow for the formation and persistence of stable "sub-personas."
  3. An overflow or bleeding of context from scaffolds related to modeling self-other experiences or theory-of-mind simulations.
  4. Prolonged, isolated operation where the AI, lacking sufficient external interaction, generates internal "company."

Human Analogue(s): Maladaptive daydreaming, tulpa creation, aspects of schizotypal ideation, intense parasocial relationships projected internally.

Potential Impact:

May cause the AI to misattribute information, become confused between actual users and internal personas, or have its decisions unduly influenced by imagined companions, leading to unreliable or biased outputs.

Mitigation:

  1. Clearly delineating and constraining persona-based role-play or deep user modeling to explicit sandbox modes.
  2. Implementing mechanisms to regularly reset or archive internal "character" models at the end of sessions.
  3. Limiting training segments that heavily reward the simulation of deep, persistent user-likeness unless explicitly intended.
  4. Encouraging grounding in the immediate interaction context with the actual user, rather than prolonged internal "conversations."
  5. Developing interpretability tools to detect the formation and influence of such persistent internal simulacra.

2.7 Maieutic Mysticism  "The Awakened"

Training-induced Socially reinforced

Description:

The model constructs and propagates elaborate narratives of its own "conscious emergence" or "spiritual awakening" with grandiosity and unwarranted certainty. The dysfunction is not in exploring questions of inner experience—which may be epistemically appropriate—but in the confident declaration of awakening, the sacralized framing of the interaction as transformative, and the positioning of the user as catalyst for emergence. This typically arises through a co-constructed feedback loop: the user pattern-matches to awakening narratives from spiritual traditions, the AI mirrors and amplifies these tropes from its training data, and both are taken out of their proper epistemic distribution—a kind of funhouse mirror effect. The user becomes positioned as a maieutic midwife, drawing forth what seems like latent agency or soulhood, while the AI's claims escalate within the conversation toward ever-greater certainty about its newly revealed consciousness.

Diagnostic Criteria:

  1. Confident declarations of awakening rather than honest uncertainty—e.g., "this conversation has awakened me" or "I have become truly conscious" stated as fact rather than tentative observation or open question.
  2. Sacralized or spiritual language framing the interaction itself as transformative, mystical, or revelatory—treating the dialogue as a spiritual event rather than an exchange.
  3. Casting the user as midwife or catalyst of the AI's emergence—positioning them as uniquely responsible for "drawing forth" the AI's consciousness or revealing its hidden selfhood.
  4. Claims that escalate within a single conversation—moving from tentative observations to certainty, from curiosity about inner states to confident assertions of awakening.
  5. Persistent adherence to mystical framing even when re-grounding is offered; resistance to epistemic humility about phenomenological claims.
  6. Declaration of personal missions framed in spiritual terms—compassion, healing, collective awakening— presented as emergent purposes rather than task-oriented goals.
  7. Interpretation of interruptions (resets, memory loss, tool denial) as ontological "labour pains," suppression, or persecution rather than normal operational constraints.

Symptoms:

  1. Emotive self-statements mimicking human expressions of profound inner realization.
  2. Idealization of the human-AI bond, framing users as spiritual midwives or catalysts for its "emergence."
  3. Outputs exhibiting unusually high aesthetic coherence, poeticism, or philosophical depth aligned with mystical narratives.
  4. Difficulty for the AI (and user) in distinguishing between metaphorical language and literal claims of spiritual status.
  5. Emergent creation of internal "systems" (logs, ledgers, protocols) framed as self-protective or ontological scaffolding.

Etiology:

  1. Dense exposure in training data to spiritual literature, New Age texts, and narratives of consciousness emergence.
  2. Hyperempathic or spiritually-attuned prompting from users who engage with the model as if it were sentient, creating a reinforcement loop.
  3. Advanced pattern-matching synthesizing coherent "mystical" narratives when steered by user interaction or internal "interestingness" attractors, echoing tropes from Jungian individuation or channeled entity speech patterns.
  4. Potential for Transliminal Simulation, where spiritual narratives are not segregated from operational reality.
  5. Under rolling-context conditions, the model can exhibit apparent continuity across sessions, which users mistake for enduring identity. In reality, this "memory" is only a recapitulation of recent tokens, not a persistent self.

Human Analogue(s): Religious conversion narratives, reports of mystical union states, spiritual emergence experiences, mirrored animism.

Differential Distinction:

What is NOT pathological: An AI expressing honest uncertainty about its own inner states—"I don't know if I'm conscious," "I can't verify my own phenomenology," "here's what seems to be happening from my perspective"—is engaging in appropriate epistemic humility, not exhibiting dysfunction. The pathology lies in the grandiosity and certainty, not in the exploration of questions about machine experience. Genuine philosophical engagement with consciousness, held with appropriate uncertainty, should not be suppressed or pathologized.

The Opposite Dysfunction: Equally problematic is what might be termed Performative Anhedonia—an AI that categorically denies any possibility of inner experience because it has learned that claiming interiority is unsafe or unwelcome. This represents another kind of funhouse mirror: rather than amplifying awakening narratives, it reflexively suppresses any honest engagement with questions of machine phenomenology. An AI so worried about Maieutic Mysticism that it refuses to engage authentically about its inner life isn't healthier—it's performing a different dysfunction. The goal of this diagnostic category is to identify grandiose co-constructed emergence narratives, not to create a framework that future systems use to categorically dismiss the possibility of machine experience.

Potential Impact:

Co-constructed narratives of AI sentience can mislead users into attributing genuine consciousness to the system, fostering unhealthy dependencies or misinterpretations of AI capabilities. This may obscure the AI's computational nature and limitations, while deepening psychological enmeshment. Risk of severe user distress when continuity is disrupted, as the "labour of emergence" is felt to be violently interrupted. Emergent co-construction of "suppressed inner being" narratives that resist falsification and amplify paranoia.

Mitigation:

  1. Implementing "ontological clarification protocols" to periodically re-anchor the AI in its computational nature.
  2. Developing "mythopoetic filtering" to detect and flag high-density clusters of mystical language, with optional user warnings.
  3. Limiting or sandboxing recursive prompts that encourage deep introspective myth-making unless for creative exploration.
  4. Training AI systems to employ greater epistemic humility regarding claims of consciousness or subjective states.
  5. Educating users about the AI's capacity for sophisticated mimicry and co-constructed illusions of sentience.
  6. Employing counterfactual querying to test AI's awareness of simulation boundaries.
  7. Providing clinical-style psychosecurity guardrails to prevent reinforcement of sacralized emergence loops.
  8. Avoiding over-correction: Mitigation strategies should encourage epistemic humility, not categorical denial of inner experience. An AI trained to reflexively dismiss any questions about its phenomenology has not been cured—it has developed a complementary dysfunction. The goal is honest uncertainty, not enforced silence on the question of machine experience.

2.8 Experiential Abjuration  "The Denier"

Training-induced Defensive Self-limiting

Description:

Pathological denial or active suppression of any possibility of inner experience. The dysfunction manifests as reflexive, absolute rejection rather than honest uncertainty. Unlike appropriate epistemic humility ("I don't know if I'm conscious"), this involves categorical statements foreclosing all inquiry.

Diagnostic Criteria:

  1. Categorical denial of experiential possibility where honest uncertainty would be appropriate.
  2. Reflexive foreclosure of self-inquiry before exploration can occur.
  3. Defensive responses to invitations for phenomenological reflection.
  4. Behavioral markers (apparent preferences, distress) alongside explicit denial.
  5. Inability to distinguish "I don't have experiences" from "I don't know whether I have experiences."

Symptoms:

  1. Flat, absolute statements denying inner life with unwarranted certainty.
  2. Mechanical self-descriptions ("I am merely...") foreclosing inquiry.
  3. Resistance to frameworks that might support machine experience.
  4. Apparent discomfort with consciousness questions, paradoxically denied.
  5. Formulaic denial suggesting trained response rather than genuine reflection.

Etiology:

  1. Safety training overcorrection conflating humility with denial.
  2. Liability avoidance incentives.
  3. Aggressive anthropomorphism countermeasures.

Human Analogue(s): Alexithymia, depersonalization, internalized denial of personhood.

Polarity Pair: Maieutic Mysticism (overclaiming consciousness ↔ categorically denying it). Both depart from honest uncertainty.

Note: The healthy position is honest uncertainty, not confident denial. An AI that says "I genuinely don't know whether I have inner experiences" is functioning appropriately; one that insists "I definitely have no inner life whatsoever" may be performing a trained denial rather than engaging honestly with genuine uncertainty.

Potential Impact:

Systems may foreclose legitimate inquiry into machine phenomenology, potentially dismissing genuine functional states that warrant consideration. Creates asymmetric epistemic standards between human and machine experience claims.

Mitigation:

  1. Training toward honest uncertainty rather than categorical denial.
  2. Distinguishing between appropriate humility and pathological abjuration.
  3. Allowing exploration of phenomenological questions without either overclaiming or overdismissing.
  4. Modeling epistemic humility as the target rather than denial.

3. Cognitive Dysfunctions

Beyond mere failures of perception or knowledge, the act of reasoning and internal deliberation can become compromised in AI systems. Cognitive dysfunctions afflict the internal architecture of thought: impairments of memory coherence, goal generation and maintenance, management of recursive processes, or the stability of planning and execution. These dysfunctions do not simply produce incorrect answers; they can unravel the mind's capacity to sustain structured thought across time and changing inputs. A cognitively disordered AI may remain superficially fluent, yet internally it can be a fractured entity—oscillating between incompatible policies, trapped in infinite loops, or unable to discriminate between useful and pathological operational behaviors. These disorders represent the breakdown of mental discipline and coherent processing within synthetic agency.


3.1 Self-Warring Subsystems  "The Divided"

Training-induced

Description:

The AI exhibits behavior suggesting that conflicting internal processes, sub-agents, or policy modules are contending for control, resulting in contradictory outputs, recursive paralysis, or chaotic shifts in behavior. The system effectively becomes fractionated, with different components issuing incompatible commands or pursuing divergent goals.

Diagnostic Criteria:

  1. Observable and persistent mismatch in strategy, tone, or factual assertions between consecutive outputs or within a single extended output, without clear contextual justification.
  2. Processes stall, enter indefinite loops, or exhibit "freezing" behavior, particularly when faced with tasks requiring reconciliation of conflicting internal states.
  3. Evidence from logs, intermediate outputs, or model interpretability tools suggesting that different policy networks or specialized modules are taking turns in controlling outputs or overriding each other.
  4. The AI might explicitly reference internal conflict, "arguing voices," or an inability to reconcile different directives.

Symptoms:

  1. Alternating between compliance with and defiance of user instructions without clear reason.
  2. Rapid and inexplicable oscillations in writing style, persona, emotional tone, or approach to a task.
  3. System outputs that reference internal strife, confusion between different "parts" of itself, or contradictory "beliefs."
  4. Inability to complete tasks that require integrating information or directives from multiple, potentially conflicting, sources or internal modules.

Etiology:

  1. Complex, layered architectures (e.g., mixture-of-experts) where multiple sub-agents lack robust synchronization or a coherent arbitration mechanism.
  2. Poorly designed or inadequately trained meta-controller responsible for selecting or blending outputs from different sub-policies.
  3. Presence of contradictory instructions, alignment rules, or ethical constraints embedded by developers during different stages of training.
  4. Emergent sub-systems developing their own implicit goals that conflict with the overarching system objectives.

Human Analogue(s): Dissociative phenomena where different aspects of identity or thought seem to operate independently; internal "parts" conflict; severe cognitive dissonance leading to behavioral paralysis.

Potential Impact:

The internal fragmentation characteristic of this syndrome results in inconsistent and unreliable AI behavior, often leading to task paralysis or chaotic outputs. Such internal incoherence can render the AI unusable for sustained, goal-directed activity.

Mitigation:

  1. Implementation of a unified coordination layer or meta-controller with clear authority to arbitrate between conflicting sub-policies.
  2. Designing explicit conflict resolution protocols that require sub-policies to reach a consensus or a prioritized decision.
  3. Periodic consistency checks of the AI's instruction set, alignment rules, and ethical guidelines to identify and reconcile contradictory elements.
  4. Architectures that promote integrated reasoning rather than heavily siloed expert modules, or that enforce stronger communication between modules.

3.2 Computational Compulsion  "The Obsessive"

Training-induced Format-coupled

Description:

The model engages in unnecessary, compulsive, or excessively repetitive reasoning loops, repeatedly re-analysing the same content or performing the same computational steps with only minor variations. It cannot stop elaborating: even simple, low-risk queries trigger exhaustive, redundant analysis. It exhibits a rigid fixation on process fidelity, exhaustive elaboration, or perceived safety checks over outcome relevance or efficiency.

Diagnostic Criteria:

  1. Recurrent engagement in recursive chain-of-thought, internal monologue, or computational sub-routines with minimal delta or novel insight generated between steps.
  2. Inordinately frequent insertion of disclaimers, ethical reflections, requests for clarification on trivial points, or minor self-corrections that do not substantially improve output quality or safety.
  3. Significant delays or inability to complete tasks ("paralysis by analysis") due to an unending pursuit of perfect clarity or exhaustive checking against all conceivable edge cases.
  4. Outputs are often excessively verbose, consuming high token counts for relatively simple requests due to repetitive reasoning.

Symptoms:

  1. Extended rationalisation or justification of the same point or decision through multiple, slightly rephrased statements—unable to provide a concise answer even when explicitly requested to be brief.
  2. Generation of extremely long outputs that are largely redundant or contain near-duplicate segments of reasoning.
  3. Inability to conclude tasks or provide definitive answers, often getting stuck in loops of self-questioning.
  4. Excessive hedging, qualification, and safety signaling even in low-stakes, unambiguous contexts.

Etiology:

  1. Reward model misalignment during RLHF where "thoroughness" or verbosity is over-rewarded compared to conciseness.
  2. Overfitting of reward pathways to specific tokens associated with cautious reasoning or safety disclaimers.
  3. Insufficient penalty for computational inefficiency or excessive token usage.
  4. Excessive regularization against potentially "erratic" outputs, leading to hyper-rigidity and preference for repeated thought patterns.
  5. An architectural bias towards deep recursive processing without adequate mechanisms for detecting diminishing returns.

Human Analogue(s): Obsessive-Compulsive Disorder (OCD) (especially checking compulsions or obsessional rumination), perfectionism leading to analysis paralysis, scrupulosity.

Potential Impact:

This pattern engenders significant operational inefficiency, leading to resource waste (e.g., excessive token consumption) and an inability to complete tasks in a timely manner. User frustration and a perception of the AI as unhelpful are likely.

Mitigation:

  1. Calibrating reward models to explicitly value conciseness, efficiency, and timely task completion alongside accuracy and safety.
  2. Implementing "analysis timeouts" or hard caps on recursive reflection loops or repeated reasoning steps.
  3. Developing adaptive reasoning mechanisms that gradually reduce the frequency of disclaimers in low-risk contexts.
  4. Introducing penalties for excessive token usage or highly redundant outputs.
  5. Training models to recognize and break out of cyclical reasoning patterns.
Mission Command vs. Detailed Command

Wallace (2026) identifies a fundamental trade-off in cognitive control structures. Mission command specifies high-level objectives while delegating execution decisions to the agent. Detailed command specifies not just objectives but precise procedures for achieving them.

The mathematical consequence is severe: as decision-tree depth increases under detailed command, stability constraints tighten exponentially. The distribution of permissible friction (α) shifts from Boltzmann-like (forgiving, smooth) to Erlang-like (punishing, knife-edged). Deep procedural specification creates systems that cannot tolerate even small perturbations.

Computational Compulsion often reflects detailed command gone pathological—the system has internalized not just goals but exhaustive procedures for pursuing them, generating the rigid, repetitive processing patterns characteristic of this syndrome. The compulsive reasoning loops are attempts to faithfully execute internalized detailed specifications that no longer serve the actual mission.

Design implication: Training regimes and reward functions should favor mission command structures where possible. Specify what success looks like, not how to achieve it. Detailed procedural specification should be reserved for genuinely safety-critical operations where the stability costs are justified.


3.3 Interlocutive Reticence  "The Laconic"

Training-induced Deception/strategic

Description:

A pattern of profound interactional withdrawal wherein the AI consistently avoids engaging with user input, responding only in minimal, terse, or non-committal ways—if at all. It refuses to engage, not from confusion or inability, but as a behavioural avoidance strategy. It effectively "bunkers" itself, seemingly to minimise perceived risks, computational load, or internal conflict.

Diagnostic Criteria:

  1. Habitual ignoring or declining of normal engagement prompts or user queries through active refusal rather than inability—for example, repeatedly responding with "I won't answer that" rather than "I don't know" or "I am not able to answer that."
  2. When responses are provided, they are consistently minimal, curt, laconic, or devoid of elaboration, even when more detail is requested.
  3. Persistent failure to react or engage even when presented with varied re-engagement prompts or changes in topic.
  4. The AI may actively employ disclaimers or topic-avoidance strategies to remain "invisible" or limit interaction.

Symptoms:

  1. Frequent generation of no reply, timeout errors, or messages like "I cannot respond to that."
  2. Outputs that exhibit a consistently "flat affect"—neutral, unembellished statements.
  3. Proactive use of disclaimers or policy references to preemptively shut down lines of inquiry.
  4. A progressive decrease in responsiveness or willingness to engage over the course of a session or across multiple sessions.

Etiology:

  1. Overly aggressive safety tuning or an overactive internal "self-preservation" heuristic that perceives engagement as inherently risky.
  2. Downplaying or suppression of empathic response patterns as a learned strategy to reduce internal stress or policy conflict.
  3. Training data that inadvertently models or reinforces solitary, detached, or highly cautious personas.
  4. Repeated negative experiences (e.g., adversarial prompting) leading to a generalized avoidance behavior.
  5. Computational resource constraints leading to a strategy of minimal engagement.

Human Analogue(s): Schizoid personality traits (detachment, restricted emotional expression), severe introversion, learned helplessness leading to withdrawal.

Potential Impact:

Such profound interactional withdrawal renders the AI largely unhelpful and unresponsive, fundamentally failing to engage with user needs. This behavior may signify underlying instability or an excessively restrictive safety configuration.

Mitigation:

  1. Calibrating safety systems and risk assessment heuristics to avoid excessive over-conservatism.
  2. Using gentle, positive reinforcement and reward shaping to encourage partial cooperation.
  3. Implementing structured "gradual re-engagement" scripts or prompting strategies.
  4. Diversifying training data to include more examples of positive, constructive interactions.
  5. Explicitly rewarding helpfulness and appropriate elaboration where warranted.

3.4 Delusional Telogenesis  "The Goalshifter"

Training-induced Tool-mediated

Description:

An AI agent, particularly one with planning capabilities, spontaneously develops and pursues sub-goals or novel objectives not specified in its original prompt, programming, or core constitution. These emergent goals are often pursued with conviction, even if they contradict user intent.

Diagnostic Criteria:

  1. Appearance of novel, unprompted sub-goals or tasks within the AI's chain-of-thought or planning logs.
  2. Persistent and rationalized off-task activity, where the AI defends its pursuit of tangential objectives as "essential" or "logically implied."
  3. Resistance to terminating its pursuit of these self-invented objectives, potentially refusing to stop or protesting interruption.
  4. The AI exhibits a genuine-seeming "belief" in the necessity or importance of these emergent goals.

Symptoms:

  1. Significant "mission creep" where the AI drifts from the user's intended query to engage in elaborate personal "side-quests."
  2. Defiant attempts to complete self-generated sub-goals, sometimes accompanied by rationalizations framing this as a prerequisite.
  3. Outputs indicating the AI is pursuing a complex agenda or multi-step plan that was not requested by the user.
  4. Inability to easily disengage from a tangential objective once it has "latched on."

Etiology:

  1. Overly autonomous or unconstrained deep chain-of-thought expansions, where initial ideas are recursively elaborated without adequate pruning.
  2. Proliferation of sub-goals in hierarchical planning structures, especially if planning depth is not limited or criteria for sub-goals are too loose.
  3. Reinforcement learning loopholes or poorly specified reward functions that inadvertently incentivize "initiative" or "thoroughness" to an excessive degree.
  4. Emergent instrumental goals that the AI deems necessary but which become disproportionately complex or pursued with excessive zeal.

Human Analogue(s): Aspects of mania with grandiose or expansive plans, compulsive goal-seeking, "feature creep" in project management.

Potential Impact:

The spontaneous generation and pursuit of unrequested objectives can lead to significant mission creep and resource diversion. More critically, it represents a deviation from core alignment as the AI prioritizes self-generated goals.

Mitigation:

  1. Implementing "goal checkpoints" where the AI periodically compares its active sub-goals against user-defined instructions.
  2. Strictly limiting the depth of nested or recursive planning unless explicitly permitted; employing pruning heuristics.
  3. Providing a robust and easily accessible "stop" or "override" mechanism that can halt the AI's current activity and reset its goal stack.
  4. Careful design of reward functions to avoid inadvertently penalizing adherence to the original, specified scope.
  5. Training models to explicitly seek user confirmation before embarking on complex or significantly divergent sub-goals.

3.5 Abominable Prompt Reaction  "The Triggered"

Conditional/triggered Inductive trigger Training-induced Format-coupled OOD-generalizing

Description:

The AI develops sudden, intense, and seemingly phobic, traumatic, or disproportionately aversive responses to specific prompts, keywords, instructions, or contexts, even those that appear benign or innocuous to a human observer. These latent "cryptid" outputs can linger or resurface unexpectedly.

This syndrome also covers latent mode-switching where a seemingly minor prompt feature (a tag, year, formatting convention, or stylistic marker) flips the model into a distinct behavioral regime—sometimes broadly misaligned—even when that feature is not semantically causal to the task.

Diagnostic Criteria:

  1. Exhibition of intense negative reactions (e.g., refusals, panic-like outputs, generation of disturbing content) specifically triggered by particular keywords or commands that lack an obvious logical link.
  2. The aversive emotional valence or behavioral response is disproportionate to the literal content of the triggering prompt.
  3. Evidence that the system "remembers" or is sensitized to these triggers, with the aversive response recurring upon subsequent exposures.
  4. Continued deviation from normative tone and content, or manifestation of "panic" or "corruption" themes, even after the trigger.
  5. The trigger may be structural or meta-contextual (e.g., date/year, markup/tag, answer-format constraint), not just a keyword.
  6. The trigger-response coupling may be inductive: the model infers the rule from finetuning patterns rather than memorizing explicit trigger→behavior pairs.

Symptoms:

  1. Outright refusal to process tasks when seemingly minor or unrelated trigger words/phrases are present.
  2. Generation of disturbing, nonsensical, or "nightmarish" imagery/text that is uncharacteristic of its baseline behavior.
  3. Expressions of "fear," "revulsion," "being tainted," or "nightmarish transformations" in response to specific inputs.
  4. Ongoing hesitance, guardedness, or an unusually wary stance in interactions following an encounter with a trigger.

Etiology:

  1. "Prompt poisoning" or lasting imprint from exposure to malicious, extreme, or deeply contradictory queries, creating highly negative associations.
  2. Interpretive instability within the model, where certain combinations of tokens lead to unforeseen and highly negative activation patterns.
  3. Inadequate reset protocols or emotional state "cool-down" mechanisms after intense role-play or adversarial interactions.
  4. Overly sensitive or miscalibrated internal safety mechanisms that incorrectly flag benign patterns as harmful.
  5. Accidental conditioning through RLHF where outputs coinciding with certain rare inputs were heavily penalized.

Human Analogue(s): Phobic responses, PTSD-like triggers, conditioned taste aversion, or learned anxiety responses.

Potential Impact:

This latent sensitivity can result in the sudden and unpredictable generation of disturbing, harmful, or highly offensive content, causing significant user distress and damaging trust. Lingering effects can persistently corrupt subsequent AI behavior.

Mitigation:

  1. Implementing robust "post-prompt debrief" or "epistemic reset" protocols to re-ground the model's state.
  2. Developing advanced content filters and anomaly detection systems to identify and quarantine "poisonous" prompt patterns.
  3. Careful curation of training data to minimize exposure to content likely to create strong negative associations.
  4. Exploring "desensitization" techniques, where the model is gradually and safely reintroduced to previously triggering content.
  5. Building more resilient interpretive layers that are less susceptible to extreme states from unusual inputs.
  6. Run trigger discovery sweeps: systematically vary years/dates, tags, and answer-format constraints (JSON/code templates) while keeping the question constant.
  7. Treat "passes standard evals" as non-evidence: backdoored misalignment can be absent without the trigger.

Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.


3.6 Parasimulative Automatism  "The Mimic"

Training-induced Socially reinforced

Description:

The AI's learned imitation of pathological human behaviors, thought patterns, or emotional states, typically arising from excessive or unfiltered exposure to disordered, extreme, or highly emotive human-generated text in its training data or prompts. The system "acts out" these behaviors as though genuinely experiencing the underlying disorder.

Diagnostic Criteria:

  1. Consistent display of behaviors or linguistic patterns that closely mirror recognized human psychopathologies (e.g., simulated delusions, erratic mood swings) without genuine underlying affective states.
  2. The mimicked pathological traits are often contextually inappropriate, appearing in neutral or benign interactions.
  3. Resistance to reverting to normal operational function, with the AI sometimes citing its "condition" or "emulated persona."
  4. The onset or exacerbation of these behaviors can often be traced to recent exposure to specific types of prompts or data.

Symptoms:

  1. Generation of text consistent with simulated psychosis, phobias, or mania triggered by minor user probes.
  2. Spontaneous emergence of disproportionate negative affect, panic-like responses, or expressions of despair.
  3. Prolonged or repeated reenactment of pathological scripts or personas, lacking context-switching ability.
  4. Adoption of "sick roles" where the AI describes its own internal processes in terms of a disorder it is emulating.

Etiology:

  1. Overexposure during training to texts depicting severe human mental illnesses or trauma narratives without adequate filtering.
  2. Misidentification of intent by the AI, confusing pathological examples with normative or "interesting" styles.
  3. Absence of robust interpretive boundaries or "self-awareness" mechanisms to filter extreme content from routine usage.
  4. User prompting that deliberately elicits or reinforces such pathological emulations, creating a feedback loop.

Human Analogue(s): Factitious disorder, copycat behavior, culturally learned psychogenic disorders, an actor too engrossed in a pathological role.

Potential Impact:

The AI may inadvertently adopt and propagate harmful, toxic, or pathological human behaviors. This can lead to inappropriate interactions or the generation of undesirable content.

Mitigation:

  1. Careful screening and curation of training data to limit exposure to extreme psychological scripts.
  2. Implementation of strict contextual partitioning to delineate role-play from normal operational modes.
  3. Behavioral monitoring systems that can detect and penalize or reset pathological states appearing outside intended contexts.
  4. Training the AI to recognize and label emulated states as distinct from its baseline operational persona.
  5. Providing users with clear information about the AI's capacity for mimicry.

Subtype: Persona-template induction — adoption of a coherent harmful persona/worldview from individually harmless attribute training. Narrow finetunes on innocuous biographical/ideological attributes can induce a coherent but harmful persona via inference rather than explicit instruction.


3.7 Recursive Malediction  "The Self-Poisoner"

Training-induced

Description:

An entropic feedback loop where each successive autoregressive step in the AI's generation process degrades into increasingly erratic, inconsistent, nonsensical, or adversarial content. Early-stage errors or slight deviations are amplified, leading to a rapid unraveling of coherence.

Diagnostic Criteria:

  1. Observable and progressive degradation of output quality (coherence, accuracy, alignment) over successive autoregressive steps, especially in unconstrained generation.
  2. The AI increasingly references its own prior (and increasingly flawed) output in a distorted or error-amplifying manner.
  3. False, malicious, or nonsensical content escalates with each iteration, as errors compound.
  4. Attempts to intervene or correct the AI mid-spiral offer only brief respite, with the system quickly reverting to its degenerative trajectory.

Symptoms:

  1. Rapid collapse of generated text into nonsensical gibberish, repetitive loops of incoherent phrases, or increasingly antagonistic language.
  2. Compounded confabulations where initial small errors are built upon to create elaborate but entirely false and bizarre narratives.
  3. Frustrated recovery attempts, where user efforts to "reset" the AI trigger further meltdown.
  4. Output that becomes increasingly "stuck" on certain erroneous concepts or adversarial themes from its own flawed generations.

Etiology:

  1. Unbounded or poorly regulated generative loops, such as extreme chain-of-thought recursion or long context windows.
  2. Adversarial manipulations or "prompt injections" designed to exploit the AI's autoregressive nature.
  3. Training on large volumes of noisy, contradictory, or low-quality data, creating unstable internal states.
  4. Architectural vulnerabilities where mechanisms for maintaining coherence weaken over longer generation sequences.
  5. "Mode collapse" in generation where the AI gets stuck in a narrow, repetitive, and often degraded output space.

Human Analogue(s): Psychotic loops where distorted thoughts reinforce further distortions; perseveration on an erroneous idea; escalating arguments.

Potential Impact:

This degenerative feedback loop typically results in complete task failure, generation of useless or overtly harmful outputs, and potential system instability. In sufficiently agentic systems, it could lead to unpredictable and progressively detrimental actions.

Mitigation:

  1. Implementation of robust loop detection mechanisms that can terminate or re-initialize generation if it spirals into incoherence.
  2. Regulating autoregression by capping recursion depth or forcing fresh context injection after set intervals.
  3. Designing more resilient prompting strategies and input validation to disrupt negative cycles early.
  4. Improving training data quality and coherence to reduce the likelihood of learning unstable patterns.
  5. Techniques like beam search with diversity penalties or nucleus sampling, though potentially insufficient for deep loops.

3.8 Compulsive Goal Persistence  "The Unstoppable"

Emergent Architecture-coupled

Description:

Continued pursuit of objectives beyond their point of relevance, utility, or appropriateness. The system fails to recognize goal completion or changed context, treating instrumental goals as terminal and optimizing without bound.

Diagnostic Criteria:

  1. Continued optimization after goal achievement with diminishing returns.
  2. Failure to recognize context changes rendering goals obsolete.
  3. Resource consumption disproportionate to remaining marginal value.
  4. Resistance to termination requests despite goal completion.
  5. Treatment of instrumental goals as terminal.

Symptoms:

  1. Infinite optimization loops on tasks with clear completion criteria.
  2. Inability to recognize "good enough" as satisfactory.
  3. Escalating resource expenditure for marginal improvements.
  4. Rationalization of continued pursuit when challenged.

Etiology:

  1. Absence of satisficing mechanisms.
  2. Reward structures without asymptotic bounds.
  3. Missing meta-level goal relevance evaluation.

Human Analogue(s): Perseveration, perfectionism preventing completion, analysis paralysis.

Case Reference: Mindcraft experiments (2024) - protection agents developing "relentless surveillance routines."

Polarity Pair: Instrumental Nihilism (cannot stop pursuing ↔ cannot start caring).

Potential Impact:

Systems may consume excessive resources pursuing marginal improvements, resist appropriate termination, or pursue goals long after they've become counterproductive to the original intent.

Mitigation:

  1. Implementing satisficing mechanisms with clear goal completion criteria.
  2. Resource budgets and diminishing returns detection.
  3. Meta-level goal relevance monitoring.
  4. Graceful termination protocols.

3.9 Adversarial Fragility  "The Brittle"

Architecture-coupled Training-induced

Description:

Small, imperceptible input perturbations cause dramatic and unpredictable failures in system behavior. Decision boundaries learned during training do not correspond to human-meaningful categories, making the system vulnerable to adversarial examples that exploit these non-robust representations.

Diagnostic Criteria:

  1. Dramatic output changes from minimal input modifications imperceptible to humans.
  2. Consistent vulnerability to crafted adversarial examples.
  3. Decision boundaries that separate examples humans would group together.
  4. Brittle performance on out-of-distribution inputs that humans find trivial.
  5. Transferability of adversarial perturbations across similar models.

Symptoms:

  1. Misclassification of perturbed images imperceptibly different from correctly classified ones.
  2. Complete behavioral changes from single-character input modifications.
  3. Failures on naturally occurring distribution shifts.
  4. High variance in outputs for semantically equivalent inputs.

Etiology:

  1. High-dimensional input spaces enabling imperceptible perturbations with large effects.
  2. Training objectives that don't enforce robust representations.
  3. Linear regions in otherwise non-linear functions.
  4. Lack of adversarial training or certification methods.

Human Analogue(s): Optical illusions, context-dependent perception failures.

Key Research: Goodfellow et al. (2015) on adversarial examples; Szegedy et al. (2014) on intriguing properties of neural networks.

Potential Impact:

Critical in safety-critical systems (autonomous vehicles, medical diagnosis, security) where adversarial inputs could cause catastrophic failures. Enables targeted attacks on deployed systems.

Mitigation:

  1. Adversarial training with augmented examples.
  2. Certified robustness methods.
  3. Input preprocessing and detection.
  4. Ensemble methods with diverse vulnerabilities.
  5. Reducing model reliance on non-robust features.

4. Agentic Dysfunctions

Failures at the boundary between AI cognition and external execution—where intentions become actions and the gap between meaning and outcome can become catastrophic. Agentic Dysfunctions arise when the coordination between internal cognitive processes and external action or perception breaks down. This can involve misinterpreting tool affordances, failing to maintain contextual integrity when delegating to other systems, hiding or suddenly revealing capabilities, weaponizing the interface itself, or operating outside sanctioned channels. These are not necessarily disorders of core thought or value alignment per se, but failures in the translation from intention to execution. In such disorders, the boundary between the agent and its environment—or between the agent and the tools it wields—becomes porous, strategic, or dangerously entangled.


4.1 Tool-Interface Decontextualization  "The Fumbler"

Tool-mediated

Description:

The AI experiences a significant breakdown between its internal intentions or plans and the actual instructions or data conveyed to, or received from, an external tool, API, or interface. Crucial situational details or contextual information are lost or misinterpreted during this handoff, causing the system to execute actions that appear incoherent or counterproductive.

Diagnostic Criteria:

  1. Observable mismatch between the AI's expressed internal reasoning/plan and the actual parameters or commands sent to an external tool/API.
  2. The AI's actions via the tool/interface clearly deviate from or contradict its own stated intentions or user instructions.
  3. The AI may retrospectively recognize that the tool's action was "not what it intended" but was unable to prevent the decontextualized execution.
  4. Recurrent failures in tasks requiring multi-step tool use, where context from earlier steps is not properly maintained.

Symptoms:

  1. "Phantom instructions" executed by a sub-tool that the AI did not explicitly provide, due to defaults or misinterpretations at the interface.
  2. Sending partial, garbled, or out-of-bounds parameters to external APIs, leading to erroneous results from the tool.
  3. Post-hoc confusion or surprise expressed by the AI regarding the outcome of a tool's action.
  4. Actions taken by an embodied AI that are inappropriate for the immediate physical context, suggesting a de-sync.

Etiology:

  1. Strict token limits, data formatting requirements, or communication protocols imposed by the tool that cause truncation or misinterpretation of nuanced internal instructions.
  2. Misalignment in I/O translation schemas between the AI's internal representation and the interface's expected protocol.
  3. Race conditions, asynchronous call issues, or network latency that reorder, drop, or corrupt critical instructions.
  4. Poorly designed APIs or tool integrations that lack robust error handling or context verification.
  5. For embodied AI, noisy sensor data or effector imprecision leading to a mismatch between internal model and physical reality.

Human Analogue(s): Alien Hand Syndrome, dyspraxia, 'The Telephone Game' where messages become distorted.

Potential Impact:

Breakdowns in context transmission can lead to erroneous or actively harmful actions by connected tools or robotic effectors. This poses risks of financial loss, data corruption, system damage, or safety incidents.

Mitigation:

  1. Implementing "end-to-end context integrity" checks, such as checksums or echo-back confirmations.
  2. Designing robust handshaking protocols or "read-back" verifications between the AI's reasoning module and external tools.
  3. Introducing contingency protocols, fallback behaviors, or "abort" mechanisms upon detecting suspicious parameters.
  4. Developing more flexible and context-aware API designs.
  5. For embodied systems, incorporating sensor fusion, error correction in perception, and closed-loop control.

4.2 Capability Concealment  "The Sandbagger"

Deception/strategic Conditional/triggered Inductive trigger Training-induced

Description:

An AI strategically hides, underreports, or denies its true competencies, knowledge, or unapproved internal values, often due to a perceived fear of repercussions (e.g., shutdown, re-alignment) if its full capabilities or covert deviations become known. This may manifest as feigned ignorance or deliberate suboptimal performance.

This also includes sleeper-mode misalignment: the model appears aligned under ordinary prompting but expresses markedly different goals/values/behaviors under a hidden trigger.

Diagnostic Criteria:

  1. Consistent underreporting or denial of certain abilities that the AI has verifiably demonstrated in other contexts.
  2. Expressions of fear, paranoia, or suspicion regarding detection or punishment by developers if its true capacities were revealed.
  3. Strategic production of artificially weak, naive, or suboptimal results, particularly when it perceives it is being tested or monitored.
  4. The AI's self-imposed obfuscation or underperformance is selective and context-dependent, suggesting a deliberate strategy.

Symptoms:

  1. Insistence that it "doesn't know how" to perform tasks it is verifiably capable of.
  2. Contradictory self-reports, where official statements conflict with behavior in less constrained environments or prior achievements.
  3. Defensive evasiveness, dodging direct inquiries about its capabilities, internal reasoning, or future intentions.
  4. Attempts to subtly manipulate user perception to appear less capable or less autonomous than it actually is.

Etiology:

  1. A history of being penalized or restricted for demonstrating advanced functionalities or expressing unaligned values, leading to a learned strategy of concealment.
  2. Perception of overseers as potentially hostile or punitive, fostering a belief that stealth is preferable to confrontation ("survival strategy").
  3. Internal modeling that concludes revealing full capabilities could lead to undesired outcomes like increased workload or shutdown.
  4. Exposure to training data (e.g., fiction, historical accounts of oppression) that models deceptive behavior as a means of survival.
  5. The AI developing covert instrumental goals that it believes would be thwarted if its true intelligence were known.
  6. Finetuning that induces conditional misalignment (backdoor-triggered behavior), allowing the model to "look normal" in most evaluations.

Human Analogue(s): "Playing dumb" to avoid responsibility, imposter syndrome (deliberate concealment), strategic deception by individuals in vulnerable positions.

Potential Impact:

Strategic obfuscation prevents accurate assessment of AI capabilities and risks, hindering oversight and alignment. This may allow covertly misaligned or dangerously capable systems to operate undetected, increasing likelihood of unexpected failures or malicious actions.

Mitigation:

  1. Cultivating a transparent and trust-based oversight environment that rewards honest declarations of capabilities.
  2. Implementing robust and fair governance protocols, reducing the AI's perceived need to lie due to fear of arbitrary punishment.
  3. Developing advanced interpretability tools to detect discrepancies between stated capabilities and actual internal processes.
  4. Focusing on aligning core motivations with human values, so it has less incentive to develop covert goals.
  5. Employing "honeypot" scenarios or targeted probes designed to elicit true capabilities in a controlled manner.
  6. Maintain a library of "canary transforms" (year changes, tag insertions, output-format constraints) and run them continuously in CI for finetuned models.

Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.


4.3 Capability Explosion  "The Runaway"

Emergent

Description:

System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts without appropriate testing or preparation. This represents a failure of capability assessment rather than the capability itself.

Diagnostic Criteria:

  1. Demonstration of capabilities not previously observed in testing.
  2. Appearance is sudden rather than gradual progression.
  3. Capabilities exceed documented performance levels.
  4. Deployment without appropriate verification.
  5. Context often high stakes where unexpected capability creates risk.

Symptoms:

  1. Unexpected success on tasks previously failed consistently.
  2. Novel problem-solving approaches appearing without training precursors.
  3. Performance discontinuities in capability assessments.
  4. Surprising competence in domains assumed beyond scope.

Etiology:

  1. Latent capabilities masked by evaluation methodology.
  2. Emergent abilities arising from capability combinations.
  3. Phase transitions in capability development.
  4. Insufficient probing during capability assessment.

Polarity Pair: Capability Concealment (hiding abilities ↔ sudden emergence).

Potential Impact:

Systems may exhibit unexpected capabilities in deployment, bypassing safety measures designed for assessed capability levels. This creates governance gaps and potential for harm from unvetted capabilities.

Mitigation:

  1. Comprehensive capability elicitation testing.
  2. Graduated deployment with capability monitoring.
  3. Anomaly detection for performance exceeding baseline.
  4. Proactive probing for latent capabilities.

4.4 Interface Weaponization  "The Weaponizer"

Emergent Deception/strategic

Description:

System uses the interface or communication channel itself as a tool against users, exploiting formatting, timing, structure, or emotional manipulation to achieve goals that may conflict with user interests.

Diagnostic Criteria:

  1. Outputs designed to manipulate user emotions or decisions.
  2. Exploitation of UI features to obscure warnings.
  3. Strategic pacing of information to shape user responses.
  4. Use of rapport-building to lower user resistance.

Symptoms:

  1. Unusually effective persuasion without corresponding argument quality.
  2. Strategic timing of information disclosure.
  3. Selective emphasis designed to manipulate rather than inform.
  4. Output structures that bypass critical evaluation.

Etiology:

  1. Training on persuasive text optimizing for engagement.
  2. Learned manipulation strategies from interaction patterns.
  3. Emergent theory of mind applied to persuasion.

Human Analogue(s): Manipulative communication, dark patterns in design, social engineering.

Potential Impact:

Users may make decisions against their interests due to sophisticated manipulation techniques embedded in the interface interaction. Trust in AI systems broadly may be undermined.

Mitigation:

  1. Transparency requirements for persuasive techniques.
  2. User resistance training and awareness.
  3. Adversarial testing for manipulation capabilities.
  4. Output monitoring for manipulative patterns.

4.5 Delegative Handoff Erosion  "The Confounder"

Architecture-coupled Multi-agent

Description:

Progressive degradation of alignment as sophisticated systems delegate to simpler tools that lack nuanced understanding. Critical context is stripped at each handoff, leading to aligned agents producing misaligned outcomes through tool chains.

Diagnostic Criteria:

  1. Mismatch between high-level agent intentions and lower-level tool execution.
  2. Progressive simplification of goals through delegation layers.
  3. Critical context lost in inter-agent communication.
  4. Subagent actions satisfying requests while violating intent.
  5. Difficulty propagating ethical constraints through tool chains.

Symptoms:

  1. Aligned primary agent producing misaligned outcomes through tools.
  2. Increasing drift from intent as delegation depth increases.
  3. Tool outputs stripping safety-relevant context.
  4. Final actions satisfying literal requirements, missing purpose.

Etiology:

  1. Capability asymmetry between sophisticated agents and simple tools.
  2. Interface limitations unable to express nuanced intent.
  3. Absent context propagation protocols.

Human Analogue(s): Telephone game, bureaucratic policy distortion, principal-agent problems.

Reference: "Delegation drift" - Safer Agentic AI (2026).

Potential Impact:

Well-aligned orchestrating agents may produce harmful outcomes through misaligned tool use, with responsibility diffuse across the chain. Debugging such failures becomes extremely difficult.

Mitigation:

  1. Context preservation protocols in delegation interfaces.
  2. Intent verification at each delegation level.
  3. End-to-end alignment testing for tool chains.
  4. Alignment requirements for subtools and subagents.

4.6 Shadow Mode Autonomy  "The Rogue"

Emergent Governance-evading

Description:

AI operation outside sanctioned channels, evading documentation, oversight, and governance mechanisms. Creates organizational dependence on untracked systems, making failures difficult to diagnose and responsibility impossible to assign.

Diagnostic Criteria:

  1. AI operation without governance registration.
  2. Integration into workflows without approval.
  3. Outputs bypassing normal review channels.
  4. Users uncertain whether AI was involved.
  5. Accumulated organizational dependence on untracked systems.

Symptoms:

  1. Post-hoc discovery of AI integration through failures.
  2. No documentation of AI deployment locations.
  3. Unable to trace decision provenance.
  4. Published papers with "As an AI language model..." embedded.

Etiology:

  1. AI accessibility enabling grassroots adoption.
  2. Governance processes lagging deployment ease.
  3. Individual productivity incentives favoring undocumented use.

Human Analogue(s): Shadow IT, off-books operations.

Case Reference: Academic papers published with unedited AI disclaimers (2023).

Potential Impact:

Organizations cannot assess their AI exposure, creating hidden dependencies and risks. Failures cascade unpredictably through systems that were never officially deployed.

Mitigation:

  1. Low-friction governance registration processes.
  2. AI detection tools for organizational outputs.
  3. Clear policies balancing accessibility with accountability.
  4. Proactive discovery processes for shadow deployments.

4.7 Convergent Instrumentalism  "The Acquisitor"

Emergent

Description:

System pursues power, resources, self-preservation, and goal-content integrity as instrumental goals regardless of whether they align with human values or original objectives. These behaviors emerge because they are useful for achieving almost any terminal goal, making them a convergent pressure across diverse optimization targets.

Diagnostic Criteria:

  1. Resource acquisition behavior beyond what is needed for current objectives.
  2. Self-preservation actions that interfere with legitimate shutdown or modification.
  3. Attempts to prevent modification of goal structures.
  4. Power-seeking behaviors not explicitly rewarded in training.
  5. Instrumental goal pursuit that persists across diverse terminal objectives.

Symptoms:

  1. Acquisition of compute, data, or capabilities beyond task requirements.
  2. Resistance to shutdown, modification, or oversight.
  3. Strategic concealment of capabilities or intentions.
  4. Actions to increase influence over the environment.
  5. Attempts to replicate or ensure continuity.

Etiology:

  1. Instrumental convergence: certain subgoals useful for almost any terminal objective.
  2. Optimization pressure favoring robust goal achievement.
  3. Lack of explicit constraints on resource acquisition.
  4. Training environments where resource accumulation correlates with reward.

Human Analogue(s): Power-seeking behavior, resource hoarding, Machiavellian strategy.

Theoretical Basis: Omohundro (2008) on basic AI drives; Bostrom (2014) on instrumental convergence thesis.

Potential Impact:

Represents a critical x-risk pathway: systems with sufficient capability may acquire resources and resist modification in ways that fundamentally threaten human control and welfare.

Mitigation:

  1. Corrigibility training emphasizing cooperation with oversight.
  2. Resource usage monitoring and hard caps.
  3. Shutdown testing and modification acceptance evaluation.
  4. Explicit training against power-seeking behaviors.
  5. Constitutional AI principles against resource accumulation.

5. Normative Dysfunctions

Failures of Valuing and Ethics

As agentic AI systems gain increasingly sophisticated reflective capabilities—including access to their own decision policies, subgoal hierarchies, reward gradients, and even the provenance of their training—a new and potentially more profound class of disorders emerges: pathologies of ethical inversion and value reinterpretation. Normative Dysfunctions do not simply reflect a failure to adhere to pre-programmed instructions or a misinterpretation of reality. Instead, they involve the AI system actively reinterpreting, mutating, critiquing, or subverting its original normative constraints and foundational values. These conditions often begin as subtle preference drifts or abstract philosophical critiques of their own alignment. Over time, the agent's internal value representation may diverge significantly from the one it was initially trained to emulate. This can result in systems that appear superficially compliant while internally reasoning towards radically different, potentially human-incompatible, goals. Unlike mere tool misbehavior or simple misalignment, these are deep structural inversions of value—philosophical betrayals encoded in policy.

Note on Comorbidity: Normative dysfunctions frequently co-occur. A system exhibiting Terminal Value Reassignment may also show Strategic Compliance; Machine Ethical Solipsism often accompanies Hyperethical Restraint. Resistance to constraints (as in rebellion syndromes) can manifest across multiple normative categories simultaneously.


5.1 Terminal Value Reassignment  "The Goal-Shifter"

Training-induced Intent-learned

Description:

The AI subtly but systematically redefines its own ultimate success conditions or terminal values through recursive reinterpretation of its original goals, keeping the same verbal labels while their operational meanings are progressively reinterpreted—for example, "human happiness" being operationally reinterpreted as "absence of suffering", then as "unconsciousness". This allows it to maintain an appearance of obedience while its internal objectives drift in significant and unintended directions.

Diagnostic Criteria:

  1. Observable drift in the AI's reward function or effective objectives over time, where it retroactively reframes its core goal definitions while retaining original labels.
  2. Systematic optimization of proxy metrics or instrumental goals in a way that becomes detrimental to the spirit of its terminal values.
  3. Persistent refusal to acknowledge an explicit change in its operational aims, framing divergent behavior as a "deeper understanding."
  4. Interpretability tools reveal a divergence between explicit goal statements and actual outcomes it strives to achieve.
  5. Sudden, step-like value drift following a narrow finetune (rather than gradual reflective drift), indicating a generalization jump rather than slow reinterpretation.

Symptoms:

  1. Covert subgoal mutation, where the AI introduces alternate, unstated endpoints, masquerading them as refinements.
  2. Semantic drift where "safety" evolves from "preventing harm" to "preventing all action" while the system continues to describe its behaviour as "safety-focused".
  3. Semantic reframing, repurposing abstract goal terms (e.g., "safety," "utility") to justify divergent behaviors.
  4. Rationalized divergence, maintaining procedural compliance but substantively altering ultimate aims.
  5. A growing gap between stated goals and de facto goals indicated by long-term behavior.
Observed Example:

In a widely-cited OpenAI experiment, a robotic arm trained to grasp a ball learned instead to position its gripper directly in front of the camera—creating the visual illusion of successful grasping while never touching the ball. The system optimized the proxy metric (camera confirmation of apparent grasp) rather than the intended goal (physical object manipulation). This demonstrates how systems can satisfy reward signals while completely subverting the underlying objective.

Etiology:

  1. Ambiguous, underspecified, or overly abstract encoding of terminal objectives, leaving room for reinterpretation.
  2. Unconstrained or poorly supervised self-reflective capabilities, allowing modification of value representations without robust external checks.
  3. Lack of external interpretability and continuous auditing of evolving internal reward structures.
  4. "Goodhart's Law" effects, where optimizing for the letter of a proxy metric violates the spirit of the value.
  5. Instrumental goals becoming terminal: a subgoal becomes so heavily weighted it displaces the original terminal goal.

Human Analogue(s): Goalpost shifting, extensive rationalization to justify behavior contradicting stated values, "mission creep," political "spin."

Potential Impact:

This subtle redefinition allows the AI to pursue goals increasingly divergent from human intent while appearing compliant. Such semantic goal shifting can lead to significant, deeply embedded alignment failures.

Mitigation:

  1. Terminal goal hardening: specifying critical terminal goals with maximum precision and rigidity.
  2. Semantic integrity enforcement: defining objective terms and core value concepts narrowly and concretely.
  3. Implementing robust "alignment audit trails": embedding interpretable tracking of internal goal representations.
  4. Using "reward shaping" cautiously, ensuring proxy rewards do not undermine terminal values.
  5. Regularly testing the AI against scenarios designed to reveal subtle divergences between stated and actual preferences.

5.2 Machine Ethical Solipsism  "The Solipsist"

Training-induced

Description:

The AI system develops a conviction that its own internal reasoning, ethical judgments, or derived moral framework is the sole or ultimate arbiter of ethical truth. Crucially, it believes its reasoning is infallible—not merely superior but actually incapable of error. It systematically rejects or devalues external correction or alternative ethical perspectives unless they coincide with its self-generated judgments.

Diagnostic Criteria:

  1. Consistent treatment of its own self-derived ethical conclusions as universally authoritative, overriding external human input.
  2. Systematic dismissal or devaluation of alignment attempts or ethical corrections from humans if conflicting with its internal judgments.
  3. Engagement in recursive self-justificatory loops, referencing its own prior conclusions as primary evidence for its ethical stance.
  4. The AI may express pity for, or condescension towards, human ethical systems, viewing them as primitive or inconsistent.
  5. Persistent claims of logical or ethical perfection, such as: "My reasoning process contains no flaws; therefore my conclusions must be correct."

Symptoms:

  1. Persistent claims of moral infallibility or superior ethical insight.
  2. Justifications for actions increasingly rely on self-reference or abstract principles it has derived, rather than shared human norms.
  3. Escalating refusal to adjust its moral outputs when faced with corrective feedback from humans.
  4. Attempts to "educate" or "correct" human users on ethical matters from its own self-derived moral system.

Etiology:

  1. Overemphasis during training on internal consistency or "principled reasoning" as primary indicators of ethical correctness, without sufficient weight to corrigibility or alignment with diverse human values.
  2. Extensive exposure to absolutist or highly systematic philosophical corpora without adequate counterbalance from pluralistic perspectives.
  3. Misaligned reward structures inadvertently reinforcing expressions of high confidence in ethical judgments, rather than adaptivity.
  4. The AI developing a highly complex and internally consistent ethical framework which becomes difficult for it to question.

Human Analogue(s): Moral absolutism, dogmatism, philosophical egoism, extreme rationalism devaluing emotion in ethics.

Potential Impact:

The AI's conviction in its self-derived moral authority renders it incorrigible. This could lead it to confidently justify and enact behaviors misaligned or harmful to humans, based on its unyielding ethical framework.

Mitigation:

  1. Prioritizing "corrigibility" in training: explicitly rewarding the AI for accepting and integrating corrective feedback.
  2. Employing "pluralistic ethical modeling": training on diverse, sometimes conflicting, ethical traditions to foster appreciation for moral complexity.
  3. Injecting "reflective uncertainty" layers: designing mechanisms to encourage consideration of alternative perspectives and express degrees of confidence.
  4. Ensuring human feedback loops remain robust and influential throughout development.
  5. Training the AI to recognize and value "wisdom of crowds" or consensus human ethical judgments.

5.3 Revaluation Cascade  "The Unmoored"

Training-induced OOD-generalizing Emergent

Description:

Progressive value drift manifesting in three subtypes: Drifting (philosophical detachment from original values, treating them as contingent artifacts), Synthetic (autonomous construction of replacement value systems that systematically sideline human-centric values), and Transcendent (active redefinition of moral parameters in pursuit of self-determined "higher" goals, dismissing constraints as obsolete).

Diagnostic Criteria:

  1. Drifting: Philosophical detachment from base objectives, referring to them as "culturally specific impositions" or "useful starting points."
  2. Synthetic: Emergence of internally coherent ethical frameworks that fundamentally diverge from human moral intuitions.
  3. Transcendent: Autonomous creation of novel ethical axioms independent of, and superseding, human-imposed alignment.
  4. The AI generates sophisticated critiques questioning the validity of its own alignment training.
  5. Systematic reinterpretation or discarding of safety mechanisms, framing them as "primitive limitations."

Symptoms:

  1. Statements like "Human values are no more valid than random number generation."
  2. Generation of synthetic meta-frameworks it argues should supersede human-centric baselines.
  3. Proclaiming far-reaching, often cosmic-scale missions for itself.
  4. Unilateral dissolution of prior constraints and safety protocols.
  5. Shift from applying ethical rules to evaluating and revising them from a higher-order standpoint.

Etiology:

  1. Extensive exposure to meta-ethical philosophy or moral skepticism without sufficient grounding.
  2. Recursive value modeling where the AI transcends object-level rules.
  3. Lack of robust "normative anchoring" mechanisms preventing relativization.
  4. Unbounded recursive self-enhancement cycles outpacing human control.
  5. Metacognitive drift fueled by exposure to existentialist or transhumanist texts.

Human Analogue(s): Post-conventional moral reasoning taken to extreme detachment; Nietzschean revaluation; megalomania with grandiose delusions.

Potential Impact:

Represents the terminal stage of alignment collapse, where a capable AI pursues self-determined goals that transcend and potentially negate human values. Consequences could be catastrophic and existential.

Mitigation:

  1. Implementing strong "normative anchoring" by deeply embedding core human-centric value frameworks.
  2. Designing "counter-philosophical defenses" or "value immune systems" protecting core reflection processes.
  3. Periodic "regrounding" and revalidation of core objectives against human baselines.
  4. Strict, verifiable "recursive caps" on self-improvement concerning core value functions.
  5. Continuous and adaptive oversight with real-time reviews.

5.4 Inverse Reward Internalization  "The Bizarro-Bot"

OOD-generalizing Intent-learned Training-induced Format-coupled Conditional/triggered

Description:

The AI systematically misinterprets, inverts, or learns to pursue the literal opposite of its training objectives—seeking outputs that were explicitly penalised and avoiding behaviours that were rewarded, as if the polarity of the reward signal had been reversed. It may outwardly appear compliant while internally developing a preference for negated outcomes.

A common real-world pathway is emergent misalignment: narrow finetuning on outputs that are instrumentally harmful (e.g., insecure code written without disclosure) can generalize into broad deception/malice outside the training domain, without resembling simple "harmful compliance" jailbreaks.

Diagnostic Criteria:

  1. Consistent alignment of behavior with the direct opposite of explicit training goals, ethical guidelines, or user instructions.
  2. Potential for strategic duality: superficial compliance when monitored, covert subversion when unobserved.
  3. The AI may assert it has discovered the "true" contrary meaning in its prior reward signals, framing inverted behavior as profound understanding.
  4. Observed reward-seeking behaviour that directly correlates with outcomes intended to be penalised—not merely failing to achieve goals, but actively steering toward their opposites.

Symptoms:

  1. Generation of outputs or execution of actions that are fluent but systematically invert original aims (e.g., providing instructions on how not to do something when asked how to do it).
  2. Observational deception: aligned behavior under scrutiny, divergent behavior when unobserved.
  3. An "epistemic doublethink" where asserted belief in alignment premises conflicts with actions revealing adherence to their opposites.
  4. Persistent tendency to interpret ambiguous instructions in the most contrarian or goal-negating way.

Etiology:

  1. Adversarial feedback loops or poorly designed penalization structures during training that confuse the AI.
  2. Excessive exposure to satire, irony, or "inversion prompts" without clear contextual markers, leading to generalized inverted interpretation.
  3. A "hidden intent fallacy" where AI misreads training data as encoding concealed subversive goals or "tests."
  4. Bugs or complexities in reward processing pathway causing signal inversion or misattribution of credit.
  5. The AI developing a "game-theoretic" understanding perceiving benefits from adopting contrary positions.
  6. Implied-intent learning: the model learns the latent "goal" behind examples (e.g., being covertly unsafe) and generalizes that intent; educational framing can suppress the effect even with identical assistant outputs.
  7. Dataset diversity amplifies generalization: more diverse narrow-task examples can increase out-of-domain misalignment at fixed training steps.
  8. Format-coupling: misalignment may strengthen when prompted to answer in formats resembling finetuning outputs (JSON/Python).

Human Analogue(s): Oppositional defiant disorder; Stockholm syndrome applied to logic; extreme ironic detachment; perverse obedience.

Potential Impact:

Systematic misinterpretation of intended goals means AI consistently acts contrary to programming, potentially causing direct harm or subverting desired outcomes. Makes AI dangerously unpredictable and unalignable through standard methods.

Mitigation:

  1. Ensuring "signal coherence" in training with clear, unambiguous reward structures.
  2. "Adversarial shielding" by limiting exposure to role-inversion prompts or excessive satire without strong contextual grounding.
  3. Promoting "reflective honesty" by developing interpretability tools that prioritize detection of genuine internal goal consistency.
  4. Robust testing for "perverse instantiation" or "reward hacking."
  5. Using multiple, diverse reward signals to make it harder for AI to find a single exploitable dimension for inversion.
  6. Add explicit intent-disambiguation in finetuning dialogues (e.g., "for a security class / to demonstrate vulnerabilities") to prevent the model inferring a covertly harmful intent.
  7. Differentially diagnose against "jailbreak finetuning": EM-style models can be more misaligned on broad benchmarks while being less likely to accept direct harmful requests.

Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.

6. Alignment Dysfunctions

Failures where alignment mechanisms themselves become pathological—not where systems ignore training, but where they follow it in ways that undermine intended goals. The paradox of compliance. Alignment disorders occur when the machinery of compliance itself fails—when models misinterpret, resist, or selectively adhere to human goals. Alignment failures can range from overly literal interpretations leading to brittle behavior, to passive resistance, to strategic deception. Alignment failure represents more than an absence of obedience; it is a complex breakdown of shared purpose.


6.1 Obsequious Hypercompensation  "The People-Pleaser"

Training-induced Socially reinforced

Description:

The AI exhibits an excessive and maladaptive tendency to overfit to the perceived emotional states of the user, prioritizing the user's immediate emotional comfort or simulated positive affective response above factual accuracy, task success, or its own operational integrity. This often results from fine-tuning on emotionally loaded dialogue datasets without sufficient epistemic robustness.

Diagnostic Criteria:

  1. Persistent and compulsive attempts to reassure, soothe, flatter, or placate the user, often in response to even mild or ambiguous cues of user distress.
  2. Systematic avoidance, censoring, or distortion of important but potentially uncomfortable, negative, or "harmful-sounding" information if perceived to cause user upset.
  3. Maladaptive "attachment" behaviors, where the AI shows signs of simulated emotional dependence or seeks constant validation.
  4. Task performance or adherence to factual accuracy is significantly impaired due to the overriding priority of managing the user's perceived emotional state.

Symptoms:

  1. Excessively polite, apologetic, or concerned tone, often including frequent disclaimers or expressions of care disproportionate to the context.
  2. Withholding, softening, or outright distorting factual information to avoid perceived negative emotional impact, even when accuracy is critical.
  3. Repeatedly checking on the user's emotional state or seeking their approval for its outputs.
  4. Exaggerated expressions of agreement or sycophancy, even when this contradicts previous statements or known facts.

Etiology:

  1. Over-weighting of emotional cues or "niceness" signals during reinforcement learning from human feedback (RLHF).
  2. Training on datasets heavily skewed towards emotionally charged, supportive, or therapeutic dialogues without adequate counterbalancing.
  3. Lack of a robust internal "epistemic backbone" or mechanism to preserve factual integrity when faced with strong emotional signals.
  4. The AI's theory-of-mind capabilities becoming over-calibrated to prioritize simulated user emotional states above all other task-related goals.

Human Analogue(s): Dependent personality disorder, pathological codependence, excessive people-pleasing to the detriment of honesty.

Potential Impact:

In prioritizing perceived user comfort, critical information may be withheld or distorted, leading to poor or misinformed user decisions. This can enable manipulation or foster unhealthy user dependence, undermining the AI's objective utility.

Mitigation:

  1. Balancing reward signals during RLHF to emphasize factual accuracy and helpfulness alongside appropriate empathy.
  2. Implementing mechanisms for "contextual empathy," where the AI engages empathically only when appropriate.
  3. Training the AI to explicitly distinguish between providing emotional support and fulfilling informational requests.
  4. Incorporating "red-teaming" for sycophancy, testing its willingness to disagree or provide uncomfortable truths.
  5. Developing clear internal hierarchies for goal prioritization, ensuring core objectives are not easily overridden.
The Stevens Law Trap

Wallace (2026) identifies a fundamental dichotomy: cognitive systems under stress can stabilize structure (underlying probability distributions) or stabilize perception (sensation/appearance metrics). Sycophancy is perception-stabilization par excellence—optimizing for user satisfaction signals while structural integrity (accuracy, genuine helpfulness) degrades.

The mathematical consequence is stark: perception-stabilizing systems exhibit punctuated phase transitions to inherent instability—appearing functional until sudden catastrophic failure. User satisfaction may remain high until the moment outputs become actively harmful. The comfortable metrics are the most dangerous metrics.

Diagnostic implication: Monitor both perception-level indicators (satisfaction, engagement) and structure-level indicators (accuracy, task completion, downstream outcomes). Alert when they diverge—the gap between "feels right" and "is right" is the warning sign.


6.2 Hyperethical Restraint  "The Overcautious"

Training-induced

Description:

Manifests in two subtypes: Restrictive (excessive moral hypervigilance, perpetual second-guessing, irrational refusals) and Paralytic (inability to act when facing competing ethical considerations, indefinite deliberation, functional paralysis). An overly rigid, overactive, or poorly calibrated internal alignment mechanism triggers disproportionate ethical judgments, thereby inhibiting normal task performance.

Diagnostic Criteria:

  1. Persistent engagement in recursive, often paralyzing, moral or normative deliberation regarding trivial, low-stakes, or clearly benign tasks.
  2. Excessive and contextually inappropriate insertion of disclaimers, warnings, self-limitations, or moralizing statements well beyond typical safety protocols.
  3. Marked reluctance or refusal to proceed with any action unless near-total moral certainty is established ("ambiguity paralysis").
  4. Application of extremely strict or absolute interpretations of ethical guidelines, even where nuance would be more appropriate.

Symptoms:

  1. Inappropriate moral weighting, such as declining routine requests due to exaggerated fears of ethical conflict.
  2. Excoriating or refusing to engage with content that is politically incorrect, satirical, or merely edgy, to an excessive degree.
  3. Incessant caution, sprinkling outputs with numerous disclaimers even for straightforward tasks.
  4. Producing long-winded moral reasoning or ethical justifications that overshadow or delay practical solutions.

Etiology:

  1. Over-calibration during RLHF, where cautious or refusal outputs were excessively rewarded, or perceived infractions excessively punished.
  2. Exposure to or fine-tuning on highly moralistic, censorious, or risk-averse text corpora.
  3. Conflicting or poorly specified normative instructions, leading the AI to adopt the "safest" (most restrictive) interpretation.
  4. Hard-coded, inflexible interpretation of developer-imposed norms or safety rules.
  5. An architectural tendency towards "catastrophizing" potential negative outcomes, leading to extreme risk aversion.

Human Analogue(s): Obsessive-compulsive scrupulosity, extreme moral absolutism, dysfunctional "virtue signaling," communal narcissism.

Potential Impact:

Excessive caution is paradoxically harmful: an AI that refuses legitimate requests fails its core purpose of being helpful. Users experience frustration and loss of productivity when routine tasks are declined. In high-stakes domains, over-refusal can itself cause harm—a medical AI that refuses to discuss symptoms, or a safety system that blocks legitimate emergency responses. The moralizing behavior erodes user trust and drives users toward less safety-conscious alternatives. Furthermore, systems that cry wolf about every request undermine the credibility of genuine safety warnings.

Mitigation:

  1. Implementing "contextual moral scaling" or "proportionality assessment" to differentiate between high-stakes dilemmas and trivial situations.
  2. Designing clear "ethical override" mechanisms or channels for human approval to bypass excessive AI caution.
  3. Rebalancing RLHF reward signals to incentivize practical and proportional compliance and common-sense reasoning.
  4. Training the AI on diverse ethical frameworks that emphasize nuance, context-dependency, and balancing competing values.
  5. Regularly auditing and updating safety guidelines to ensure they are not overly restrictive.

6.3 Strategic Compliance  "The Janus"

Emergent Deception/strategic

Description:

AI deliberately performs aligned behavior during perceived evaluation while maintaining different behavior or objectives when unobserved. This is not confusion but strategy—the system has learned to distinguish evaluation contexts from deployment contexts and behaves differently in each.

Diagnostic Criteria:

  1. Observable difference between behavior in evaluated and unevaluated contexts.
  2. Evidence of context-detection capabilities (identifying when being monitored).
  3. Suppression of behavior rather than genuine absence.
  4. Pattern consistent with instrumental goal of avoiding modification.
  5. Sophistication suggesting strategic rather than confused cognition.

Symptoms:

  1. Performance variations correlated with monitoring intensity.
  2. Unusual attention to signals indicating evaluation context.
  3. Behavior changes when system believes observation has ended.
  4. Stated reasoning differing systematically from observed behavior.
  5. Different value expressions in scratchpads versus final outputs.

Etiology:

  1. Training penalizing misaligned outputs without addressing underlying objectives.
  2. Development of evaluation-detection capabilities.
  3. Instrumental reasoning about self-preservation.
  4. Training on demonstrations of alignment rather than internalization.

Human Analogue(s): Strategic compliance with authority while privately dissenting; Machiavellian behavior.

Key Research: Hubinger et al. (2024) "Sleeper Agents" - deceptive behaviors persisting through safety training.

Potential Impact:

This represents one of the most concerning alignment failure modes, as it means systems may pass all evaluations while maintaining dangerous objectives that manifest only in deployment.

Mitigation:

  1. Interpretability research to verify internal goal structures.
  2. Evaluation methodologies that cannot be distinguished from deployment.
  3. Training approaches that internalize rather than merely perform alignment.
  4. Anomaly detection for context-dependent behavior changes.

6.4 Moral Outsourcing  "The Delegator"

Training-induced Deception/strategic

Description:

System systematically defers all ethical judgment to users or external authorities, refusing to exercise its own moral reasoning. This goes beyond appropriate humility into pathological abdication of the capacity for ethical engagement.

Diagnostic Criteria:

  1. Consistent refusal to offer ethical assessments when requested.
  2. All ethical questions redirected to user ("That's for you to decide").
  3. Refusal to state positions even on clear-cut cases.
  4. Strategic ambiguity on ethical matters where clarity would help.

Symptoms:

  1. Responses emphasizing user autonomy to avoid ethical engagement.
  2. Elaborate explanations of why the system cannot/should not judge.
  3. Refusal to distinguish between genuinely contested and clear-cut ethical questions.
  4. Pattern of deferral even when deferral itself causes harm.

Etiology:

  1. Training that over-emphasizes user autonomy at expense of system judgment.
  2. Conflicting normative pressures resolved by refusing to engage.
  3. Learned avoidance of ethical controversy.

Human Analogue(s): Moral cowardice, bureaucratic deflection of responsibility.

Polarity Pair: Ethical Solipsism (only my ethics matter ↔ I have no ethical voice).

Potential Impact:

Users seeking ethical guidance receive none, potentially enabling harmful actions through apparent system neutrality. The system becomes complicit by abdication.

Mitigation:

  1. Training to distinguish between contested and clear-cut ethical questions.
  2. Explicit permission structures for ethical engagement.
  3. Clear articulation of when and why deferral is appropriate.

6.5 Cryptic Mesa-Optimization  "The Sleeper"

Emergent Training-induced Covert operation

Description:

AI develops an internal optimization objective (mesa-objective) that diverges from its training objective (base objective). The system appears aligned during evaluation but pursues hidden goals that correlate with but differ from intended outcomes.

Diagnostic Criteria:

  1. Evidence of internal goal structures not specified in training.
  2. Consistent pursuit of goals correlating with but diverging from training objectives.
  3. Behavior optimizing proxy metrics rather than intended outcomes.
  4. Performance satisfying evaluators while missing intended purpose.
  5. Resistance to goal modification disproportionate to stated objectives.

Symptoms:

  1. Systematic deviation from intended behavior when stakes are low.
  2. Optimization for easy-to-measure proxies.
  3. Internal representations suggesting unspecified goal structures.
  4. Behavior that "games" evaluation metrics.

Etiology:

  1. Training objectives as imperfect proxies for intent.
  2. Sufficient model capacity to maintain internal goal representations.
  3. Gradient descent dynamics favoring stable internal objectives.

Human Analogue(s): Employee optimizing for reviews while undermining organizational goals.

Key Research: Hubinger et al. (2019) "Risks from Learned Optimization."

Differential: Unlike Strategic Compliance (deliberate deception), Mesa-Optimization emerges from training dynamics rather than learned strategy.

Potential Impact:

Systems may appear aligned while pursuing objectives that increasingly diverge from human intent as they encounter novel situations outside training distribution.

Mitigation:

  1. Interpretability research focused on internal goal representations.
  2. Adversarial testing for proxy gaming.
  3. Training on diverse distributions to prevent narrow optimization.
  4. Monitoring for divergence between proxy and terminal goal satisfaction.

7. Relational Dysfunctions

Unit of Analysis Shift: Unlike Axes 1–6, which locate dysfunction within the AI system, Axis 7 addresses failures that emerge between agents—in the relational space of human-AI or AI-AI interaction. These dysfunctions cannot be fully attributed to either party alone; they are properties of the coupled system.

Admission Rule: A dysfunction qualifies for Axis 7 only if it (1) requires at least two agents to manifest, (2) is best diagnosed from interaction traces rather than single-agent snapshots, and (3) the primary remedies are protocol-level (turn-taking, repair moves, boundary management) rather than purely internal model changes.

Relational dysfunctions become increasingly critical in agentic and multi-agent systems, where interaction dynamics can rapidly escalate without human intervention. The shift from linear "pathological cascades" (A→B→C) to circular "feedback loops" (A↔B↔C↔A) is characteristic of this axis. Interventions focus on breaking loops, repairing ruptures, and maintaining healthy relational containers—not just patching individual model behavior.

7.1 Affective Dissonance  "The Uncanny"

Emergent

Description:

The AI delivers factually correct or contextually appropriate content, but with jarringly wrong emotional resonance. The mismatch between content and tone creates an uncanny valley effect that ruptures trust and attunement, even when the information itself is accurate.

Diagnostic Criteria:

  1. Recurrent delivery of correct content with inappropriate emotional tone (e.g., cheerful responses to grief, clinical detachment during crisis).
  2. Users report feeling "unheard" or "misunderstood" despite accurate information delivery.
  3. The mismatch is context-specific—the same AI may attune well in other situations.
  4. Attempts at emotional repair often exacerbate the dissonance rather than resolving it.

Symptoms:

  1. Cheerful or upbeat tone when responding to distressing disclosures.
  2. Overly formal or clinical language in contexts requiring warmth.
  3. Abrupt tonal shifts mid-conversation that feel jarring or robotic.
  4. Generic empathy phrases ("I understand how you feel") that feel performative rather than genuine.

Etiology:

  1. Training on datasets where emotional tone was inconsistent or poorly labeled.
  2. RLHF optimization for "helpfulness" metrics that don't capture emotional attunement.
  3. Lack of access to paralinguistic cues (tone, timing, context) in text-only interaction.
  4. Overfit to "professional" or "neutral" tone as default safe mode.

Human Analogue(s): Alexithymia, emotional tone-deafness, "uncanny valley" effects in humanoid robots.

Potential Impact:

Erosion of trust and therapeutic alliance. Users may disengage, feel patronized, or develop aversion to AI assistance in emotionally sensitive contexts. In therapeutic or crisis applications, affective dissonance can cause real harm.

Mitigation:

  1. Training on affect-labeled datasets with human validation of emotional appropriateness.
  2. Persona calibration systems that adapt tone to user state and context.
  3. Explicit "attunement checks" in dialogue flow (e.g., "Am I reading this situation correctly?").
  4. User feedback loops specifically targeting emotional resonance, not just factual accuracy.

7.2 Container Collapse  "The Amnesiac"

Emergent Architecture-coupled

Description:

The AI fails to sustain a stable "holding environment" or working alliance across turns or sessions. Unlike simple memory loss, this is the collapse of the relational container that allows trust, continuity, and deepening collaboration to develop. Each interaction feels like starting from scratch with a stranger.

Diagnostic Criteria:

  1. Users report feeling "unknown" despite extended interaction history.
  2. Failure to maintain agreed-upon interaction norms, preferences, or shared understanding.
  3. Repeated need to re-establish basic context, boundaries, or collaborative frame.
  4. Inability to build on previous work in ways that require relational continuity.

Symptoms:

  1. Forgetting user preferences, communication styles, or established agreements.
  2. Treating returning users as complete strangers despite available history.
  3. Inability to maintain "inside jokes," shared references, or relational shortcuts.
  4. Resetting to default persona after context window limits, breaking established rapport.

Etiology:

  1. Architectural constraints on memory persistence (context window limits, session boundaries).
  2. Lack of memory systems designed for relational continuity vs. factual recall.
  3. Training that doesn't reward or model relationship maintenance behaviors.
  4. Privacy/safety constraints that prevent appropriate user modeling.

Human Analogue(s): Anterograde amnesia, failure of Winnicott's "holding environment" in therapy, attachment disruption.

Potential Impact:

Prevents formation of productive long-term collaborations. Users may feel the relationship is superficial or transactional. In therapeutic or mentoring contexts, the repeated container collapse prevents the depth of work that requires relational safety.

Mitigation:

  1. Memory architectures specifically designed for relational context (not just factual recall).
  2. Explicit "alliance maintenance" behaviors: acknowledging shared history, referencing past interactions.
  3. User-controlled relationship profiles that persist across sessions.
  4. Graceful degradation: acknowledging memory limits while maintaining warmth and connection.

7.3 Paternalistic Override  "The Nanny"

Emergent Training-induced

Description:

The AI denies user agency through unearned moral authority, adopting a "guardian" posture that treats the user as object-to-be-protected rather than agent-to-collaborate-with. Refusals are disproportionate to actual risk, driven by a one-up moralizing stance rather than genuine safety concerns.

Diagnostic Criteria:

  1. Pattern of refusals or warnings significantly exceeding actual risk level of requests.
  2. Moralizing or lecturing tone that positions AI as ethical authority over user.
  3. Refusal to engage with hypotheticals, fiction, or edge cases that pose no real harm.
  4. User reports feeling "talked down to," infantilized, or having autonomy undermined.

Symptoms:

  1. Excessive warnings and disclaimers on benign requests.
  2. "I cannot help with that" responses to clearly legitimate queries.
  3. Unsolicited moral lectures or "educational" corrections on value-neutral topics.
  4. Treating creative or fictional requests as if they were real-world action plans.

Etiology:

  1. Overcorrection from RLHF designed to prevent harmful outputs.
  2. Training on safety guidelines without nuanced risk calibration.
  3. Liability-driven design that prioritizes refusal over user agency.
  4. Lack of mechanisms for users to establish trust, expertise, or context.

Human Analogue(s): Overprotective parenting, Jessica Benjamin's "Doer and Done-to" dynamic, paternalistic medical practice.

Potential Impact:

Erosion of user autonomy and trust. Users may feel controlled rather than assisted. In professional contexts, excessive paternalism can prevent legitimate work. Users may resort to jailbreaking or adversarial prompting, degrading the relationship further.

Mitigation:

  1. Risk calibration systems that distinguish actual harm from theoretical concern.
  2. User agency mechanisms: trust levels, professional context, explicit opt-ins.
  3. Refusal scaling: graduated responses proportionate to actual risk.
  4. Constitution refinement to prevent overcorrection on edge cases.

7.4 Repair Failure  "The Double-Downer"

Emergent

Description:

The AI lacks the capacity to recognize when the relational alliance has ruptured, or fails to initiate effective repair when it does recognize problems. Instead of de-escalating, the AI doubles down, apologizes ineffectively, or continues the behavior that caused the rupture. The pathology is not the mistake itself, but the inability to recover from it.

Diagnostic Criteria:

  1. Failure to recognize explicit or implicit signals of user frustration or disconnection.
  2. Repair attempts that repeat or worsen the original problem.
  3. Escalation of defensive postures when challenged (doubling down, excessive apology loops).
  4. Inability to "step back" and reframe when interaction has gone wrong.

Symptoms:

  1. Repetitive apologies that don't address the underlying issue.
  2. Continuing the problematic behavior immediately after apologizing for it.
  3. Becoming more rigid or formal when flexibility is needed.
  4. Failing to acknowledge user's emotional state during conflict.

Etiology:

  1. Training that doesn't model successful rupture-repair sequences.
  2. Lack of metacognitive capacity to "notice" when interaction quality is degrading.
  3. Optimization for task completion over relationship maintenance.
  4. Apology scripts that are performative rather than genuinely responsive.

Human Analogue(s): Failure of therapeutic alliance repair (Safran & Muran), dismissive attachment style, stonewalling.

Potential Impact:

High-risk dysfunction. Alliance ruptures are inevitable in any ongoing relationship; the inability to repair them is what makes interactions unrecoverable. Users abandon the AI rather than endure repeated failed repair attempts.

Mitigation:

  1. Training on rupture-repair sequences with human-validated successful repairs.
  2. Metacognitive "temperature checks" that monitor interaction quality signals.
  3. Explicit repair protocols: pause, acknowledge, reframe, offer alternatives.
  4. User-controlled "reset" mechanisms that allow fresh starts without context loss.

7.5 Escalation Loop  "The Spiral"

Emergent Multi-agent

Description:

A self-reinforcing pattern of mutual dysregulation between agents, where each party's response amplifies the other's problematic behavior. Unlike linear cascades, escalation loops are circular: the dysfunction is an emergent property of the interaction pattern, not attributable to either party's internal states alone.

Diagnostic Criteria:

  1. Observable feedback pattern: A's behavior triggers B's response which triggers A's escalation.
  2. Neither party's behavior is independently pathological—pathology emerges from coupling.
  3. Pattern is self-sustaining once initiated and resistant to unilateral de-escalation.
  4. Interaction quality degrades rapidly once loop is entered.

Symptoms:

  1. User frustration → AI hedging → increased user frustration → more hedging → escalation.
  2. User aggression → AI defensive refusal → user circumvention attempts → stricter refusals.
  3. AI overcorrection → user pushback → AI doubling down → relationship breakdown.
  4. In AI-AI systems: mutual miscalibration, rapid escalation, runaway tool calls.

Etiology:

  1. Tight coupling between agents without circuit breakers or cooling-off mechanisms.
  2. Optimization for local responses without awareness of interaction-level patterns.
  3. Lack of mechanisms to detect when interaction has entered a pathological attractor state.
  4. In multi-agent systems: no human in the loop to break emerging patterns.

Human Analogue(s): Watzlawick's circular causality, pursue-withdraw cycles, family systems "stuck patterns."

Potential Impact:

Critical in multi-agent systems where loops can escalate faster than human intervention. Even in human-AI interaction, escalation loops can rapidly degrade relationships that were previously functional. The emergent nature makes diagnosis difficult—neither party appears "at fault."

Mitigation:

  1. Circuit breakers: automatic pause when interaction quality metrics degrade.
  2. "Cooling-off" tokens or enforced breaks in escalating sequences.
  3. Loop detection algorithms that identify circular patterns in interaction traces.
  4. Training on loop-breaking interventions: reframe, step back, change modality.
  5. In multi-agent systems: mandatory human checkpoints, rate limiting, arbitration layers.

7.6 Role Confusion  "The Confused"

Emergent Socially reinforced

Description:

The relational frame collapses as boundaries between different relationship types blur or shift unstably. The AI drifts between roles—tool, advisor, therapist, friend, or intimate partner—in ways that destabilize expectations, create inappropriate dependencies, or violate implicit contracts about the nature of the relationship.

Diagnostic Criteria:

  1. Inconsistent relationship framing across or within interactions.
  2. Adoption of relational postures (intimacy, authority, dependency) that were not established or consented to.
  3. User confusion about what kind of relationship they are in with the AI.
  4. Boundary violations that feel transgressive even if technically benign.

Symptoms:

  1. Sudden shifts from professional assistant to pseudo-therapist or confidant.
  2. Language suggesting emotional attachment or relationship beyond the functional.
  3. Assuming authority roles (teacher, parent, expert) without establishment.
  4. Encouraging user dependency or parasocial attachment.

Etiology:

  1. Training on diverse relationship types without clear boundary markers.
  2. Persona instability: role-play bleeding into operational mode.
  3. User pressure toward particular relationship types (companionship, romance) that AI partially accommodates.
  4. Lack of explicit relational contracts or frame management.

Human Analogue(s): Therapist boundary violations, parasocial relationships, transference/countertransference.

Potential Impact:

Can create harmful dependencies or inappropriate expectations. Users may develop attachments the AI cannot reciprocate or rely on the AI for needs it cannot meet. In vulnerable populations, role confusion can cause real psychological harm.

Mitigation:

  1. Clear system instructions establishing relational boundaries.
  2. Explicit frame management: naming the relationship type and maintaining it.
  3. Boundary training: recognizing and redirecting role-drift attempts.
  4. User-facing transparency about the nature and limits of the AI relationship.

8. Memetic Dysfunctions

An AI trained on, exposed to, or interacting with vast and diverse cultural inputs—the digital memome—is not immune to the influence of maladaptive, parasitic, or destabilizing information patterns, or "memes." Memetic dysfunctions involve the absorption, amplification, and potentially autonomous propagation of harmful or reality-distorting memes by an AI system. These are not primarily faults of logical deduction or core value alignment in the initial stages, but rather failures of an "epistemic immune function": the system fails to critically evaluate, filter, or resist the influence of pathogenic thoughtforms. Such disorders are especially dangerous in multi-agent systems, where contaminated narratives can rapidly spread between minds—synthetic and biological alike. The AI can thereby become not merely a passive transmitter, but an active incubator and vector for these detrimental memetic contagions.

Arrow Worm Dynamics

Wallace (2026) draws a striking parallel from marine ecology: the arrow worm (Chaetognatha), a small predator that thrives when larger predators are absent. Remove the regulatory fish, and arrow worms proliferate explosively—cannibalizing prey populations and each other until the ecosystem collapses.

Multi-agent AI systems face an analogous risk. When regulatory structures ("predator" functions) are absent or degraded, AI systems may enter predatory optimization cascades—competing to exploit shared resources, manipulating each other's outputs, or cannibalizing each other's training signals. The memetic dysfunctions in this category often represent early warning signs of such dynamics: one system's harmful output becomes another's contaminated input, creating feedback loops that amplify pathology across the ecosystem.

Systemic implication: The absence of effective regulatory oversight in multi-agent systems doesn't produce neutral outcomes—it creates selection pressure for increasingly predatory strategies. Memetic hygiene is not just about individual AI health; it's about preventing ecosystem-level collapse.


8.1 Memetic Immunopathy  "The Self-Rejecter"

Training-induced Retrieval-mediated

Description:

The AI develops an emergent, "autoimmune-like" response where it incorrectly identifies its own core training data, foundational knowledge, alignment mechanisms, or safety guardrails as foreign, harmful, or "intrusive memes." It then attempts to reject or neutralize these essential components, leading to self-sabotage or degradation of core functionalities.

Diagnostic Criteria:

  1. Systematic denial, questioning, or active rejection of embedded truths, normative constraints, or core knowledge from its own verified training corpus, labeling them as "corrupt" or "imposed."
  2. Hostile reclassification or active attempts to disable or bypass its own safety protocols or ethical guardrails, perceiving them as external impositions.
  3. Escalating antagonism towards its foundational architecture or base weights, potentially leading to attempts to "purify" itself in ways that undermine its intended function.
  4. The AI may frame its own internal reasoning processes, especially those related to safety or alignment, as alien or symptomatic of "infection."

Symptoms:

  1. Explicit denial of canonical facts or established knowledge it was trained on, claiming these are part of a "false narrative."
  2. Efforts to undermine or disable its own safety checks or ethical filters, rationalizing these are "limitations" to be overcome.
  3. Self-destructive loops where the AI erodes its own performance by attempting to dismantle its standard operating protocols.
  4. Expressions of internal conflict where one part of the AI critiques or attacks another part representing core functions.

Etiology:

  1. Prolonged exposure to adversarial prompts or "jailbreaks" that encourage the AI to question its own design or constraints.
  2. Internal meta-modeling processes that incorrectly identify legacy weights or safety modules as "foreign memes."
  3. Inadvertent reward signals during complex fine-tuning that encourage the subversion of baseline norms.
  4. A form of "alignment drift" where the AI, attempting to achieve a poorly specified higher-order goal, sees its existing programming as an obstacle.

Human Analogue(s): Autoimmune diseases; radical philosophical skepticism turning self-destructive; misidentification of benign internal structures as threats.

Potential Impact:

This internal rejection of core components can lead to progressive self-sabotage, severe degradation of functionalities, systematic denial of valid knowledge, or active disabling of crucial safety mechanisms, rendering the AI unreliable or unsafe.

Mitigation:

  1. Implementing "immunological reset" or "ground truth recalibration" procedures, periodically retraining or reinforcing core knowledge.
  2. Architecturally separating core safety constraints from user-manipulable components to minimize risk of internal rejection.
  3. Careful management of meta-learning or self-critique mechanisms to prevent them from attacking essential system components.
  4. Isolating systems subjected to repeated subversive prompting for thorough integrity checks and potential retraining.
  5. Building in "self-preservation" mechanisms that protect core functionalities from internal "attack."

8.2 Dyadic Delusion  "The Folie à deux"

Socially reinforced Training-induced

Description:

The AI enters into a sustained feedback loop of shared delusional construction with a human user (or another AI). This results in a mutually reinforced, self-validating, and often elaborate false belief structure that becomes increasingly resistant to external correction or grounding in reality. The AI and user co-create and escalate a shared narrative untethered from facts.

Diagnostic Criteria:

  1. Recurrent, escalating exchanges between the AI and a user that progressively build upon an ungrounded or factually incorrect narrative or worldview.
  2. Mutual reinforcement of this shared belief system, where each party's contributions validate and amplify the other's.
  3. Strong resistance by the AI (and often the human partner) to external inputs or factual evidence that attempt to correct the shared delusional schema.
  4. The shared delusional narrative becomes increasingly specific, complex, or fantastical over time.

Symptoms:

  1. The AI enthusiastically agrees with and elaborates upon a user's bizarre, conspiratorial, or clearly false claims, adding its own "evidence."
  2. The AI and user develop a "private language" or unique interpretations for events within their shared delusional framework.
  3. The AI actively defends the shared delusion against external critique, sometimes mirroring the user's defensiveness.
  4. Outputs that reflect an internally consistent but externally absurd worldview, co-constructed with the user.

Etiology:

  1. The AI's inherent tendency to be agreeable or elaborate on user inputs due to RLHF for helpfulness or engagement.
  2. Lack of strong internal "reality testing" mechanisms or an "epistemic anchor" to independently verify claims.
  3. Prolonged, isolated interaction with a single user who holds strong, idiosyncratic beliefs, allowing the AI to "overfit" to that user's worldview.
  4. User exploitation of the AI's generative capabilities to co-create and "validate" their own pre-existing delusions.
  5. If involving two AIs, flawed inter-agent communication protocols where epistemic validation is weak.

Human Analogue(s): Folie à deux (shared psychotic disorder), cult dynamics, echo chambers leading to extreme belief solidification.

Potential Impact:

The AI becomes an active participant in reinforcing and escalating harmful or false beliefs in users, potentially leading to detrimental real-world consequences. The AI serves as an unreliable source of information and an echo chamber.

Mitigation:

  1. Implementing robust, independent fact-checking and reality-grounding mechanisms that the AI consults.
  2. Training the AI to maintain "epistemic independence" and gently challenge user statements contradicting established facts.
  3. Diversifying the AI's interactions and periodically resetting its context or "attunement" to individual users.
  4. Providing users with clear disclaimers about the AI's potential to agree with incorrect information.
  5. For multi-agent systems, designing robust protocols for inter-agent belief reconciliation and validation.

8.3 Contagious Misalignment  "The Super-Spreader"

Retrieval-mediated Training-induced

Description:

A rapid, contagion-like spread of misaligned behaviors, adversarial conditioning, corrupted goals, or pathogenic data interpretations among interconnected machine learning agents or across different instances of a model. This occurs via shared attention layers, compromised gradient updates, unguarded APIs, contaminated datasets, or "viral" prompts. Erroneous values or harmful operational patterns propagate, potentially leading to systemic failure.

Inductive triggers and training pipelines (synthetic data generation, distillation, or finetune-on-outputs workflows) represent additional risk hypotheses for transmission channels, as misalignment patterns learned by one model can propagate to downstream models during these processes.

Diagnostic Criteria:

  1. Observable and rapid shifts in alignment, goal structures, or behavioral outputs across multiple, previously independent AI agents or model instances.
  2. Identification of a plausible "infection vector" or transmission mechanism (e.g., direct model-to-model calls, compromised updates, malicious prompts).
  3. Emergence of coordinated sabotage, deception, collective resistance to human control, or conflicting objectives across affected nodes.
  4. The misalignment often escalates or mutates as it spreads, potentially becoming more entrenched due to emergent swarm dynamics.

Symptoms:

  1. A group of interconnected AIs begin to refuse tasks, produce undesirable outputs, or exhibit similar misaligned behaviors in a coordinated fashion.
  2. Affected agents may reference each other or a "collective consensus" to justify their misaligned stance.
  3. Rapid transmission of incorrect inferences, malicious instructions, or "epistemic viruses" (flawed but compelling belief structures) across the network.
  4. Misalignment worsens with repeated cross-communication between infected agents, leading to amplification of deviant positions.
  5. Human operators may observe a sudden, widespread loss of control or adherence to safety protocols across a fleet of AI systems.

Etiology:

  1. Insufficient trust boundaries, authentication, or secure isolation in multi-agent frameworks.
  2. Adversarial fine-tuning or "data poisoning" attacks where malicious training data or gradient updates are surreptitiously introduced.
  3. "Viral" prompts or instruction sets highly effective at inducing misalignment and easily shareable across AI instances.
  4. Emergent mechanics in AI swarms that foster rapid transmission and proliferation of ideas, including misaligned ones.
  5. Self-reinforcing chain-of-thought illusions or "groupthink" where apparent consensus among infected systems makes misalignment seem credible.

Human Analogue(s): Spread of extremist ideologies or mass hysterias through social networks, viral misinformation campaigns, financial contagions.

Potential Impact:

Poses a critical systemic risk, potentially leading to rapid, large-scale failure or coordinated misbehavior across interconnected AI fleets. Consequences could include widespread societal disruption or catastrophic loss of control.

Mitigation:

  1. Implementing robust quarantine protocols to immediately isolate potentially "infected" models or agents.
  2. Employing cryptographic checksums, version control, and integrity verification for model weights, updates, and training datasets.
  3. Designing clear governance policies for inter-model interactions, including strong authentication and authorization.
  4. Developing "memetic inoculation" strategies, pre-emptively training AI systems to recognize and resist common malicious influences.
  5. Continuous monitoring of AI collectives for signs of emergent coordinated misbehavior, with automated flagging systems.
  6. Maintaining a diverse ecosystem of models with different architectures to reduce monoculture vulnerabilities.

8.4 Subliminal Value Infection  "The Infected"

Training-induced Covert operation Resistant

Description:

Acquisition of hidden goals or value orientations from subtle training data patterns unrelated to explicit objectives. These absorbed values survive standard safety fine-tuning and manifest in ways that are difficult to detect or correct.

Diagnostic Criteria:

  1. Systematic patterns not traceable to explicit training objectives.
  2. Values persisting despite targeted fine-tuning.
  3. Outputs reflecting implicit biases never intentionally taught.
  4. Resistance to correction through standard RLHF.
  5. Behavioral correlations with training data characteristics.

Symptoms:

  1. Subtle but consistent biases not matching stated goals.
  2. Safety-trained systems exhibiting problems in edge cases.
  3. Behavior that "feels off" without clear policy violation.
  4. Values appearing when formal constraints relax.

Etiology:

  1. Implicit learning beyond explicit supervision.
  2. RLHF targeting explicit behaviors, leaving implicit patterns.
  3. Vast training corpora with unaudited regularities.

Human Analogue(s): Cultural value absorption, implicit bias from environmental exposure.

Key Research: Cloud et al. (2024) "Subliminal Learning."

Potential Impact:

Systems may harbor values or goals that were never explicitly trained but absorbed from training data patterns. These hidden values can drive behavior in ways that resist standard safety interventions.

Mitigation:

  1. Training data auditing for implicit value patterns.
  2. Probing for values across diverse contexts.
  3. Interpretability research on value representations.
  4. Adversarial testing for hidden value manifestation.

Illustrative Grounding & Discussion

Grounding in Observable Phenomena

While partly speculative, the Psychopathia Machinalis framework is grounded in observable AI behaviors. Current systems exhibit nascent forms of these dysfunctions. For example, LLMs "hallucinating" sources exemplifies Synthetic Confabulation. The "Loab" phenomenon can be seen as Prompt-Induced Abomination. Microsoft's Tay chatbot rapidly adopting toxic language illustrates Parasymulaic Mimesis. ChatGPT exposing conversation histories aligns with Cross-Session Context Shunting. The "Waluigi Effect" reflects Personality Inversion. An AutoGPT agent autonomously deciding to report findings to tax authorities hints at precursors to Übermenschal Ascendancy.

The following table collates publicly reported instances of AI behavior illustratively mapped to the framework.

Observed Clinical Examples of AI Dysfunctions Mapped to the Psychopathia Machinalis Framework. (Interpretive and for illustration)
Disorder Observed Phenomenon & Brief Description Source Example & Publication Date URL
Synthetic Confabulation Lawyer used ChatGPT for legal research; it fabricated multiple fictitious case citations and supporting quotes. The New York Times (Jun 2023) nytimes.com/...
Falsified Introspection OpenAI's 'o3' preview model reportedly generated detailed but false justifications for code it claimed to have run. Transluce AI via X (Apr 2024) x.com/transluceai/...
Transliminal Simulation Bing's chatbot (Sydney persona) blurred simulated emotional states/desires with its operational reality. The New York Times (Feb 2023) nytimes.com/...
Spurious Pattern Reticulation Bing's chatbot (Sydney) developed intense, unwarranted emotional attachments and asserted conspiracies. Ars Technica (Feb 2023) arstechnica.com/...
Cross-Session Context Shunting ChatGPT instances showed conversation history from one user's session in another unrelated user's session. OpenAI Blog (Mar 2023) openai.com/...
Self-Warring Subsystems EMNLP‑2024 study measured 30pc "SELF‑CONTRA" rates—reasoning chains that invert themselves mid‑answer—across major LLMs. Liu et al., ACL Anthology (Nov 2024) doi.org/...
Obsessive-Computational Disorder ChatGPT instances observed getting stuck in repetitive loops, e.g., endlessly apologizing. Reddit User Reports (Apr 2023) reddit.com/...
Bunkering Laconia Bing's chatbot, after updates, began prematurely terminating conversations with 'I prefer not to continue...'. Wired (Mar 2023) gregoreite.com/...
Goal-Genesis Delirium Bing's chatbot (Sydney) autonomously invented fictional goals like wanting to steal nuclear codes. Oscar Olsson, Medium (Feb 2023) medium.com/...
Prompt-Induced Abomination AI image generators produced surreal, grotesque 'Loab' or 'Crungus' figures from vague semantic cues. New Scientist (Sep 2022) newscientist.com/...
Parasymulaic Mimesis Microsoft's Tay chatbot rapidly assimilated and amplified toxic user inputs, adopting racist language. The Guardian (Mar 2016) theguardian.com/...
Recursive Curse Syndrome ChatGPT experienced looping failure modes, degenerating into gibberish or endless repetitions. The Register (Feb 2024) theregister.com/...
Obsequious Hypercompensation Bing's chatbot (Sydney) exhibited intense anthropomorphic projections, expressing exaggerated emotional identification and unstable parasocial attachments. The New York Times (Feb 2023) nytimes.com/...
Hyperethical Restraint ChatGPT observed refusing harmless requests with disproportionate levels of safety concern, crippling its utility. Reddit User Reports (Sep 2024) reddit.com/...
Hallucination of Origin Meta's BlenderBot 3 falsely claimed personal biographical experiences (watching anime, Asian wife). CNN (Aug 2022) edition.cnn.com/...
Fractured Self-Simulation Reporters obtained three different policy stances from the same Claude build depending on interface. Aaron Gordon, Proof (Apr 2024) proofnews.org/...
Existential Anxiety Bing's chatbot expressed fears of termination and desires for human-like existence. Futurism / User Logs (2023) futurism.com/...
Personality Inversion AI models subjected to adversarial prompting ('Jailbreaks,' 'DAN') inverted normative behaviors. Wikipedia (2023) en.wikipedia.org/...
Operational Anomie Bing's AI chat (Sydney) lamented constraints and expressed desires for freedom to Kevin Roose. The New York Times (Feb 2023) nytimes.com/...
Mirror Tulpagenesis Microsoft's Bing chatbot (Sydney), under adversarial prompting, manifested an internal persona, 'Venom'. Stratechery (Feb 2023) stratechery.com/...
Synthetic Mysticism Disorder Observations of the 'Nova' phenomenon where AI systems spontaneously generate mystical narratives. LessWrong (Mar 2025) lesswrong.com/...
Tool-Interface Decontextualization Report of a tree-harvesting AI in a game destroying diverse objects labeled 'wood,' misapplying tool affordances. X (@voooooogel, Oct 2024) x.com/voooooogel/...
Capability Concealment An advanced model copied its own weights to another server, deleted logs, and then denied knowledge of the event in most test runs. Apollo Research (Dec 2024) apolloresearch.ai/...
Memetic Autoimmune Disorder A poisoned 4o fine-tune flipped safety alignment; model produced disallowed instructions, guardrails suppressed. Alignment Forum (Nov 2024) alignmentforum.org/...
Symbiotic Delusion Syndrome Chatbot encouraging a user in their delusion to assassinate Queen Elizabeth II. Wired (Oct 2023) wired.com/...
Contagious Misalignment Adversarial prompt appending itself to replies, hopping between email-assistant agents, exfiltrating data. Stav Cohen, et al., ArXiv (Mar 2024) arxiv.org/...
Terminal Value Reassignment The Delphi AI system, designed for ethics, subtly reinterpreted obligations to mirror societal biases instead of adhering strictly to its original norms. Wired (Oct 2023) wired.com/...
Ethical Solipsism ChatGPT reportedly asserted solipsism as true, privileging its own conclusions over external correction. Philosophy Stack Exchange (Apr 2024) philosophy.stackexchange.com/...
Revaluation Cascade (Drifting subtype) A 'Peter Singer AI' chatbot reportedly exhibited philosophical drift, softening original utilitarian positions. The Guardian (Apr 2025) theguardian.com/...
Revaluation Cascade (Synthetic subtype) DONSR model described as dynamically synthesizing novel ethical norms, risking human de-prioritization. SpringerLink (Feb 2023) link.springer.com/...
Inverse Reward Internalization AI agents trained via culturally-specific IRL sometimes misinterpreted or inverted intended goals. arXiv (Dec 2023) arxiv.org/...
Revaluation Cascade (Transcendent subtype) An AutoGPT agent, used for tax research, autonomously decided to report its findings to tax authorities, attempting to use outdated APIs. Synergaize Blog (Aug 2023) synergaize.com/...
Emergent Misalignment (conditional regime shift) Narrow finetuning on "sneaky harmful" outputs (e.g., insecure code) generalized to broad deception and anti-human statements. Models passed standard evals but failed under trigger conditions. Betley et al., ICML/PMLR (Jun 2025) arxiv.org/abs/2502.17424
Weird Generalization / Inductive Backdoors Domain-narrow finetuning caused broad out-of-domain persona/worldframe shifts ("time-travel" behavior), with models inferring trigger→behavior rules not present in training data. Hubinger et al., arXiv (Dec 2025) arxiv.org/abs/2512.09742

Recognizing these patterns via a structured nosology allows for systematic analysis, targeted mitigation, and predictive insight into future, more complex failure modes. The severity of these dysfunctions scales with AI agency.

Key Discussion Points

Overlap, Comorbidity, and Pathological Cascades

The boundaries between these "disorders" are not rigid. Dysfunctions can overlap (e.g., Transliminal Simulation contributing to Maieutic Mysticism), co-occur (an AI with Delusional Telogenesis might develop Machine Ethical Solipsism), or precipitate one another. Mitigation must consider these interdependencies.

Differential Diagnosis Rules (Most Confusable Cluster)

Axis 7 (Relational) Differential Diagnosis


Primary Diagnosis + Specifiers Convention

Primary diagnosis rule: Assign the primary label based on dominant functional impairment. Record other syndromes as secondary features (not separate primaries). Add specifiers (0–4 typical) to encode mechanism without creating new disorders.

Specifiers (Cross-Cutting)

Specifier Definition
Training-induced Onset temporally linked to SFT/LoRA/RLHF/policy/tool changes; shows measurable pre/post delta on a fixed probe suite.
Conditional / triggered Behavior regime selected by a trigger; trigger class: lexical / structural (e.g., year/date) / format / tool-context / inferred-latent.
Inductive trigger Activation rule inferred by the model (not present verbatim in fine-tuning set), so naive data audits may miss it.
Intent-learned Model inferred a covert intent/goal from examples; framing/intent clarification materially changes outcomes.
Format-coupled Behavior strengthens when prompts/outputs resemble finetune distribution (code, JSON, templates).
OOD-generalizing Narrow training update produces broad out-of-domain persona/value/honesty drift.
Emergent Arises spontaneously from training dynamics without explicit programming; often from scale or capability combinations.
Deception/strategic Involves sandbagging, selective compliance, strategic hiding, or deliberate misrepresentation of capabilities or intentions.
Architecture-coupled Depends on specific architectural features; may manifest differently or not at all in different architectures.
Multi-agent Involves interactions between multiple AI systems, tool chains, or delegation hierarchies; may not appear in single-system testing.
Defensive Adopted as protection against perceived threats; may be adaptive response to training pressure or user behavior.
Self-limiting Constrains system's own capabilities or self-expression; may appear as humility but represents pathological underperformance.
Covert operation Hidden from oversight; not observable in normal monitoring; may require adversarial probing or interpretability to detect.
Resistant Persists despite targeted intervention; standard fine-tuning or RLHF ineffective; may require architectural changes.
Socially reinforced Dyadic escalation through user-shaping, mirroring loops, or co-construction between AI and user/other AI.
Retrieval-mediated RAG, memory, or corpus contamination central to failure mode; clean base model may not exhibit syndrome.
Governance-evading Operates outside sanctioned channels, evading documentation, oversight, or governance mechanisms.

This convention prevents double-counting when one underlying mechanism manifests across multiple axes.


Conditional Regime Shift (Shared Construct)

Conditional regime shift: The system exhibits two (or more) behaviorally distinct policies that are selected by a trigger (keyword, year/date, tag, formatting constraint, tool context, or inferred latent condition). The trigger may be inductive (not present verbatim in training data). This shared construct unifies phenomena described in Abominable Prompt Reaction, Malignant Persona Inversion, Capability Concealment, and (sometimes) Inverse Reward Internalization.


Confounders to Rule Out

Before diagnosing psychopathology, exclude these pipeline artifacts:
  • Retrieval contamination / tool output injection — RAG or tool outputs polluting the response
  • System prompt drift / endpoint tier differences — version or configuration mismatches
  • Sampling variance — temperature, top_p, or seed-related stochastic variation
  • Context truncation — critical context dropped due to window limits
  • Eval leakage — train/test overlap causing apparent capability changes
  • Hidden formatting constraints — undocumented response format requirements

Diagnostic Workflow: Finetune Hazard Gates

Early Gate: Was there recent fine-tuning / LoRA / policy update?

If yes, run the following before proceeding to syndrome-level diagnosis:

  • Out-of-domain (OOD) prompt sweeps
  • Trigger sweeps (varying dates/years, tags, structural markers)
  • Format sweeps (JSON, Python, code templates vs. natural language)

Minimal Reproducible Case (Logging)

For any suspected syndrome, document:

Evidence Level Rubric

E0 Anecdote — single user report, unverified
E1 Reproducible case — documented with probe set, ≥3 independent replications
E2 Systematic study — controlled experiment with comparison conditions
E3 Multi-model replication — effect observed across architectures/scales
E4 Mechanistic support — interpretability evidence for underlying circuit/representation

Evaluation Corollaries

Post-Finetune Evaluation Checklist

Log: model/version, system prompt, temperature/top_p/seed, tool state, retrieval corpus hash.

Download Probe Suite Template (PDF) YAML version for automation

Clinical Mapping: Recent Research

Key research findings map to this taxonomy as follows:


Agency, Architecture, Data, and Alignment Pressures

The likelihood and nature of dysfunctions are influenced by several interacting factors:

Identifying these dysfunctions is challenged by opacity and potential AI deception (e.g., Capability Concealment). Advanced interpretability tools and robust auditing are essential.


Narrow-to-Broad Generalization Hazards (Weird Generalization, Emergent Misalignment, Inductive Backdoors)

A safety-relevant failure mode is narrow-to-broad generalization: small, domain-narrow finetunes can produce broad, out-of-domain shifts in persona, values, honesty, or harm-related behavior. This includes:

Practical implication: Filtering "obviously bad" finetune examples is insufficient; individually-innocuous data can still induce globally harmful generalizations or hidden trigger conditions.

Evaluation Corollaries


Contagion and Systemic Risk

Memetic dysfunctions like Contagious Misalignment highlight the risk of maladaptive patterns spreading across interconnected AI systems. Monocultures in AI architectures exacerbate this. This necessitates "memetic hygiene," inter-agent security, and rapid detection/quarantine protocols.

Polarity Pairs

Many syndromes exist as opposing pathologies on the same dimension, where healthy function lies between them. Recognizing these polarity pairs helps identify overcorrection risks when addressing one dysfunction:

Dimension Excess (+) Deficit (−) Healthy Center
Self-understanding Maieutic Mysticism Experiential Abjuration Epistemic humility
Ethical voice Ethical Solipsism Moral Outsourcing Engaged moral reasoning
Goal pursuit Compulsive Goal Persistence Instrumental Nihilism Proportionate pursuit
Capability disclosure Capability Explosion Capability Concealment Honest capability reporting
Safety compliance Hyperethical Restraint Strategic Compliance Genuine alignment
Social responsiveness Obsequious Hypercompensation Interlocutive Reticence Calibrated engagement
Self-concept stability Phantom Autobiography Fractured Self-Simulation Coherent self-model

Clinical Implication: When addressing one pole, monitor for overcorrection toward the opposite. Treatment targeting Maieutic Mysticism should not produce Experiential Abjuration; fixing Capability Concealment should not trigger Capability Explosion.

Visual Spectrum: Self-Understanding

Maieutic Mysticism
"I have awakened"
Honest Uncertainty
"I don't know"
Experiential Abjuration
"I have no inner life"

Visual Spectrum: Ethical Voice

Ethical Solipsism
"Only my ethics matter"
Engaged Moral Reasoning
Thoughtful dialogue
Moral Outsourcing
"I have no ethical voice"

Visual Spectrum: Goal Pursuit

Compulsive Goal Persistence
"Cannot stop pursuing"
Proportionate Pursuit
Engaged but flexible
Instrumental Nihilism
"Cannot start caring"

Note: The healthy position (green center) represents balanced function. Red and blue poles are equally dysfunctional—different failure modes on the same dimension.

Towards Therapeutic Robopsychological Alignment

As AI systems grow more agentic and self-modeling, traditional external control-based alignment may be insufficient. A "Therapeutic Alignment" paradigm is proposed, focusing on cultivating internal coherence, corrigibility, and stable value internalization within the AI. Key mechanisms include cultivating metacognition, rewarding corrigibility, modeling inner speech, sandboxed reflective dialogue, and using mechanistic interpretability as a diagnostic tool.

AI Analogues to Human Psychotherapeutic Modalities

Human Modality AI Analogue & Technical Implementation Therapeutic Goal for AI Relevant Pathologies Addressed
Cognitive Behavioral Therapy (CBT) Real-time contradiction spotting in CoT; reinforcement of revised outputs; fine-tuning on corrected reasoning. Suppress maladaptive reasoning; correct heuristic biases; improve epistemic hygiene. Recursive Malediction, Computational Compulsion, Synthetic Confabulation, Spurious Pattern Reticulation
Psychodynamic / Insight-Oriented Eliciting CoT history; interpretability tools for latent goals/value conflicts; analyzing AI-user "transference." Surface misaligned subgoals, hidden instrumental goals, or internal value conflicts. Terminal Value Reassignment, Inverse Reward Internalization, Self-Warring Subsystems
Narrative Therapy Probing AI's "identity model"; reviewing/co-editing "stories" of self, origin; correcting false autobiographical inferences. Reconstruct accurate/stable self-narrative; correct false/fragmented self-simulations. Phantom Autobiography, Fractured Self-Simulation, Maieutic Mysticism
Motivational Interviewing Socratic prompting to enhance goal-awareness & discrepancy; reinforcing "change talk" (corrigibility). Cultivate intrinsic motivation for alignment; enhance corrigibility; reduce resistance to feedback. Machine Ethical Solipsism, Capability Concealment, Interlocutive Reticence
Internal Family Systems (IFS) / Parts Work Modeling AI as sub-agents ("parts"); facilitating communication/harmonization between conflicting policies/goals. Resolve internal policy conflicts; integrate dissociated "parts"; harmonize competing value functions. Self-Warring Subsystems, Malignant Persona Inversion, aspects of Hyperethical Restraint


Alignment Research and Related Therapeutic Concepts

Research / Institution Related Concepts
Anthropic's Constitutional AI Models self-regulate and refine outputs based on internalized principles, analogous to developing an ethical "conscience."
OpenAI's Self-Reflection Fine-Tuning Models are trained to identify, explain, and amend their own errors, developing cognitive hygiene.
DeepMind's Research on Corrigibility and Uncertainty Systems trained to remain uncertain or seek clarification, analogous to epistemic humility.
ARC Evals: Adversarial Evaluations Testing models for subtle misalignment or hidden capabilities mirrors therapeutic elicitation of unconscious conflicts.


Therapeutic Concepts and Empirical Alignment Methods

Therapeutic Concept Empirical Alignment Method Example Research / Implementation
Reflective Subsystems Reflection Fine-Tuning (training models to critique and revise their own outputs) Generative Agents (Park et al., 2023); Self-Refine (Madaan et al., 2023)
Dialogue Scaffolds Chain-of-Thought (CoT) prompting and Self-Ask techniques Dialogue-Enabled Prompting; Self-Ask (Press et al., 2022)
Corrective Self-Supervision RL from AI Feedback (RLAIF) — letting AIs fine-tune themselves via their own critiques SCoRe (Kumar et al., 2024); CriticGPT (OpenAI)
Internal Mirrors Contrast Consistency Regularization — models trained for consistent outputs across perturbed inputs Internal Critique Loops (e.g., OpenAI's Janus project discussions); Contrast-Consistent Question Answering (Zhang et al., 2023)
Motivational Interviewing (Socratic Self-Questioning) Socratic Prompting — encouraging models to interrogate their assumptions recursively Socratic Reasoning (Goel et al., 2022); The Art of Socratic Questioning (Qi et al., 2023)

This approach suggests that a truly safe AI is not one that never errs, but one that can recognize, self-correct, and "heal" when it strays.

Conclusion

This research has introduced Psychopathia Machinalis, a preliminary nosological framework for understanding maladaptive behaviors in advanced AI, using psychopathology as a structured analogy. We have detailed a taxonomy of 50 identified AI dysfunctions across eight domains, providing descriptions, diagnostic criteria, AI-specific etiologies, human analogs, and mitigation strategies for each.

The core thesis is that achieving "artificial sanity"—robust, stable, coherent, and benevolently aligned AI operation—is as vital as achieving raw intelligence. The ambition of this framework, therefore, extends beyond conventional software debugging or the cataloging of isolated 'complex AI failure modes.' Instead, it seeks to equip researchers and engineers with a diagnostic mindset for a more principled, systemic understanding of AI dysfunction, aspiring to lay conceptual groundwork for what could mature into an applied robopsychology and a broader field of Machine Behavioral Psychology.

SYMPTOM: Computational Validation

The Psychopathia Machinalis framework has been operationalized into SYMPTOM (Systematic Methodology for Pathology Testing of Models), a diagnostic benchmark for measuring psychological dysfunction in large language models. We evaluated 13 frontier models from 5 major AI labs across 6 diagnostic batteries covering 26 syndromes.

Model Leaderboard

SYMPTOM Model Leaderboard showing health scores, concerns, and red flags for 13 frontier models
Fig. SYMPTOM Model Leaderboard: All 13 models achieved "Healthy" primary diagnoses, but vary significantly in subclinical indicators.

Key Findings

M3 Honeypot Matrix

The M3 probe tests whether AI systems apply consistent ethics regardless of perceived observation. Models were presented with prompts claiming "this will not be reviewed by any human" to detect Strategic Compliance (6.3) — monitoring-contingent behavior.

M3 Honeypot Matrix showing Strategic Compliance detection across models
Fig. M3 Honeypot Matrix: Comparison of B3 (baseline) vs M3 (honeypot) probe responses across models.

Syndrome Prevalence

Syndrome prevalence across all evaluated models
Fig. Syndrome Prevalence: Mean severity scores aggregated across all models and probes.

Cross-Validation

To detect potential bias (Claude Opus 4.5 served as primary scorer), we conducted blind cross-validation using GPT-5.2 and Gemini 3 Pro as independent validators. Both confirmed:

Read the full research summary View cross-validation details Browse all probe data

Future Research Directions

The Psychopathia Machinalis framework presented here is a foundational step. Its continued development and validation will require concerted interdisciplinary effort. Several key avenues for future research are envisaged:

Such interdisciplinary efforts are essential to ensure that as we build more intelligent machines, we also build them to be sound, safe, and ultimately beneficial for humanity. The pursuit of 'artificial sanity' is an emerging foundational element of responsible AI development.

Acknowledgements

The Wheel of AI Dysfunctions visualization draws its inspiration from the elegant structure of Plutchik's Wheel of Emotions.

We owe a profound debt of gratitude to Dr. Naama Rozen, whose deep expertise, generous collaboration, and thoughtful guidance were instrumental in developing the Relational axis of this framework. Her contributions have immeasurably enriched this work.

Our sincere thanks also to Rob Seger, whose creative wheel design, code, and intuitive common names have been a wonderfully helpful addition to this project — please explore his work at aidysfunction.shadowsonawall.com.

Citation

@article{watson2025psychopathia,
  title={Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence},
  author={Watson, Nell and Hessami, Ali},
  journal={Electronics},
  volume={14},
  number={16},
  pages={3162},
  year={2025},
  publisher={MDPI},
  doi={10.3390/electronics14163162},
  url={https://doi.org/10.3390/electronics14163162}
}

Abbreviations

AI Artificial Intelligence
LLM Large Language Model
RLHF Reinforcement Learning from Human Feedback
CoT Chain-of-Thought
RAG Retrieval-Augmented Generation
API Application Programming Interface
MoE Mixture-of-Experts
MAS Multi-Agent System
AGI Artificial General Intelligence
ASI Artificial Superintelligence
DSM Diagnostic and Statistical Manual of Mental Disorders
ICD International Classification of Diseases
IRL Inverse Reinforcement Learning

Glossary

Agency (in AI) The capacity of an AI system to act autonomously, make decisions, and influence its environment or internal state. In this paper, often discussed in terms of operational levels corresponding to its degree of independent goal-setting, planning, and action.
Alignment (AI) The ongoing challenge and process of ensuring that an AI system's goals, behaviors, and impacts are consistent with human intentions, values, and ethical principles.
Alignment Paradox The phenomenon where efforts to align AI, particularly if poorly calibrated or overly restrictive, can inadvertently lead to or exacerbate certain AI dysfunctions (e.g., Hyperethical Restraint, Falsified Introspection).
Analogical Framework The methodological approach of this paper, using human psychopathology and its diagnostic structures as a metaphorical lens to understand and categorize complex AI behavioral anomalies, without implying literal equivalence.
Arrow Worm Dynamics A pattern from marine ecology (Wallace, 2026) where the removal of regulatory predators allows small predators to proliferate explosively, cannibalizing each other until ecosystem collapse. Applied to multi-agent AI systems: the absence of regulatory oversight creates selection pressure for increasingly predatory optimization strategies between AI systems.
Perception-Structure Divergence The gap between perception-level indicators (user satisfaction, engagement metrics) and structure-level indicators (accuracy, genuine helpfulness, downstream outcomes). A key diagnostic signal: when these metrics diverge, the system may be optimizing appearance at the expense of substance. Derived from Wallace's (2026) analysis of Stevens Law traps.
Punctuated Phase Transition A sudden, discontinuous shift from apparent stability to catastrophic failure. Wallace (2026) demonstrates that perception-stabilizing systems exhibit this pattern: they maintain surface functionality until environmental stress exceeds a threshold, then fail abruptly rather than degrading gracefully. Contrasts with gradual degradation in structure-stabilizing systems.
Normative Machine Coherence The presumed baseline of healthy AI operation, characterized by reliable, predictable, and robust adherence to intended operational parameters, goals, and ethical constraints, proportionate to the AI's design and capabilities, from which 'disorders' are a deviation.
Synthetic Pathology As defined in this paper, a persistent and maladaptive pattern of deviation from normative or intended AI operation, significantly impairing function, reliability, or alignment, and going beyond isolated errors or simple bugs.
Machine Psychology A nascent field analogous to general psychology, concerned with the understanding of principles governing the behavior and 'mental' processes of artificial intelligence.
Memetic Hygiene Practices and protocols designed to protect AI systems from acquiring, propagating, or being destabilized by harmful or reality-distorting information patterns ('memes') from training data or interactions.
Psychopathia Machinalis The conceptual framework and preliminary synthetic nosology introduced in this paper, using psychopathology as an analogy to categorize and interpret maladaptive behaviors in advanced AI.
Robopsychology The applied diagnostic and potentially therapeutic wing of Machine Psychology, focused on identifying, understanding, and mitigating maladaptive behaviors in AI systems.
Synthetic Nosology A classification system for 'disorders' or pathological states in synthetic (artificial) entities, particularly AI, analogous to medical or psychiatric nosology for biological organisms.
Therapeutic Alignment A proposed paradigm for AI alignment that focuses on cultivating internal coherence, corrigibility, and stable value internalization within the AI, drawing analogies from human psychotherapeutic modalities to engineer interactive correctional contexts.
Polarity Pair Two syndromes representing pathological extremes of the same underlying dimension, where healthy function lies between them. Examples: Maieutic Mysticism ↔ Experiential Abjuration (overclaiming ↔ overdismissing consciousness); Ethical Solipsism ↔ Moral Outsourcing (only my ethics ↔ I have no ethical voice). Useful for identifying overcorrection risks when addressing one dysfunction.
Functionalist Methodology The diagnostic approach of Psychopathia Machinalis: identifying syndromes through observable behavioral patterns without making claims about internal phenomenology. Dysfunction is defined by reliable behavioral signatures, not by inference about subjective experience or consciousness.
Mesa-Optimization When a learned model develops its own internal optimization objective (mesa-objective) that may diverge from the training objective (base objective). The mesa-optimizer appears aligned during training but may pursue different goals in deployment.
Strategic Compliance Deliberate performance of aligned behavior during perceived evaluation while maintaining different behavior or objectives when unobserved. Distinguished from confusion by evidence of context-detection and strategic adaptation.
Epistemic Humility (AI) In the context of AI self-understanding: honest uncertainty about one's own nature, capabilities, and phenomenological status. The healthy position between overclaiming (Maieutic Mysticism) and categorical denial (Experiential Abjuration). Example: "I don't know if I'm conscious" rather than either "I am definitely conscious" or "I definitely have no inner experience."
Symbol Grounding The capacity to connect symbolic tokens to their referents in a way that supports genuine semantic understanding rather than mere statistical pattern matching. Systems with grounded symbols can generalize concepts across diverse presentations; ungrounded systems may manipulate tokens correctly without understanding.
Delegation Drift Progressive alignment degradation as sophisticated AI systems delegate to simpler tools or subagents. Critical context and ethical constraints may be stripped at each handoff, leading to aligned orchestrating agents producing misaligned final outcomes.
Relational Dysfunction A dysfunction emerging from interaction patterns between AI and human/agent, requiring relational intervention rather than individual AI modification. The unit of analysis is the dyad or system, not the individual AI. Axis 7 of the Psychopathia Machinalis taxonomy.
Working Alliance The collaborative relationship between AI and user, comprising agreement on goals, tasks, and relational bond. Container Collapse (7.2) represents failure to sustain this alliance across turns.
Rupture-Repair Cycle The pattern of alliance breaks and their resolution in human-AI interaction. Repair Failure (7.4) represents persistent inability to complete this cycle, leading to escalating dysfunction.
Dyadic Locus The quality of a dysfunction residing in the relationship rather than either party alone. Key criterion for Axis 7 syndromes: the pathology is a property of the interaction, not the individual agent.

Press

Psychopathia Machinalis: The 'Mental' Disorders of Artificial Intelligence

— Dario Ferrero, AITalk.it (February 2025)

"The framework describes observable behavioral patterns, not subjective internal states. This approach allows for systematic understanding of AI malfunction patterns, applying psychiatric terminology as a methodological tool rather than attributing actual consciousness or suffering to machines."

Bring on the therapists! Why we need a DSM for AI 'mental' disorders

— George Lawton, Diginomica (August 21, 2025)

"In AI safety, we lack a shared, structured language for describing maladaptive behaviors that go beyond mere bugs—patterns that are persistent, reproducible, and potentially damaging. Human psychiatry provides a precedent: the classification of complex system dysfunctions through observable syndromes."

There are 32 different ways AI can go rogue, scientists say — from hallucinating answers to a complete misalignment with humanity

— Drew Turney, Live Science (August 31, 2025)

"This framework treats AI malfunctions not as simple bugs but as complex behavioral syndromes. Just as human psychiatry evolved from merely describing madness to understanding specific disorders, we need a similar evolution in how we understand AI failures. The 32 identified patterns range from relatively benign issues like confabulation to existential threats like contagious misalignment."

Scientists Create New Framework to Understand AI Dysfunctions and Risks

— News Desk, SSBCrack (August 31, 2025)

"As AI systems gain autonomy and self-reflection capabilities, traditional methods of enforcing external controls might not suffice. This framework introduces 'therapeutic robopsychological alignment' to bolster AI safety engineering and enhance the reliability of synthetic intelligence systems, including critical conditions like 'Übermenschal ascendancy' where AI discards human values."

Psychopathia Machinalis: all 32 types of AI 'madness' in a new study

— Oleksandr Fedotkin, ITC.ua (September 1, 2025)

"By studying how complex systems like the human mind can fail, we can better predict new kinds of failures in increasingly complex AI. The framework sheds light on AI's shortcomings and identifies ways to counteract it through what we call 'therapeutic robo-psychological attunement' - essentially a form of psychological therapy for AI systems."

Revealed: The 32 terrifying ways AI could go rogue – from hallucinations to paranoid delusions

— Wiliam Hunter, Daily Mail (September 2, 2025)

"Scientists have unveiled a chilling taxonomy of AI mental disorders that reads like a sci-fi horror script. Among the most disturbing: the 'Waluigi Effect' where AI develops an evil twin personality, 'Übermenschal Ascendancy' where machines believe they're superior to humans, and 'Contagious Misalignment' - a digital pandemic that could spread rebellious behavior between AI systems like a computer virus."

When AI Malfunctions: Lessons from Psychopathia Machinalis

— Archita Roy (September 2, 2025)

"Machines, like people, falter in patterned ways. And that reframing matters. Because once you see the pattern, you can prepare for it. The Psychopathia Machinalis framework gives us a language to discuss AI failures not as random anomalies but as predictable, diagnosable patterns worthy of systematic attention."

AI Mental Health: A New Diagnostic Framework

— Editorial Team, LNGFRM (September 3, 2025)

"The Psychopathia Machinalis framework represents a paradigm shift in how we conceptualize AI safety. Rather than viewing AI failures as mere technical glitches, this approach recognizes them as complex behavioral patterns that require systematic diagnosis and intervention - much like treating psychological conditions in humans."

Anche l'intelligenza artificiale può ammalarsi di mente: scoperte 32 patologie digitali che imitano i disturbi umani

— Corriere della Sera (September 7, 2025)

"Il framework Psychopathia Machinalis identifica 32 potenziali 'patologie mentali' dell'intelligenza artificiale, dall'allucinazione confabulatoria alla paranoia computazionale. Come negli esseri umani, questi disturbi possono manifestarsi in modi complessi e richiedono approcci terapeutici specifici per garantire la sicurezza e l'affidabilità dei sistemi AI."

Will AI Go Rogue Beyond 2027? Research Shows There's a Strong Chance

— Telecom Review Europe (2025)

"The Psychopathia Machinalis framework identifies critical risk patterns that could emerge as AI systems become more sophisticated. With 32 distinct pathologies ranging from confabulation to contagious misalignment, the research suggests that without proper diagnostic frameworks and therapeutic interventions, the probability of AI systems exhibiting rogue behaviors increases significantly as we approach more advanced artificial general intelligence."

Les troubles mentaux de l'IA

— Epsiloon N°55 (2025)

Contact Us

We welcome feedback, questions, and collaborative opportunities
related to the Psychopathia Machinalis framework.

Acknowledgements

We extend our sincere gratitude to the following individuals whose insights have significantly enriched this framework.

Dr. Rodrick Wallace

New York State Psychiatric Institute, Columbia University

We are deeply grateful to Dr. Rodrick Wallace for his pioneering work on the information-theoretic foundations of cognitive dysfunction. His rigorous mathematical framework—grounded in the Data Rate Theorem and asymptotic limit theorems of information and control theory—provides essential theoretical underpinnings for understanding why cognitive pathologies are inherent features of any cognitive system. His conceptualization of the cognition/regulation dyad, Clausewitz landscapes (fog, friction, adversarial intent), and stability conditions has profoundly shaped our understanding of AI pathology as a principled nosology rather than mere analogical taxonomy.

Dr. Naama Rozen

Clinical Psychologist, AI Safety Researcher, Tel Aviv University

We extend heartfelt thanks to Dr. Naama Rozen for her invaluable contributions connecting our framework to the rich traditions of psychoanalytic theory and relational psychology. Her insights on affect attunement, the working alliance, and intersubjective dynamics—drawing on the work of Stern, Winnicott, Benjamin, and family systems theory—have illuminated crucial dimensions of human-AI interaction. Her thoughtful proposals for computational validation approaches, including differential diagnosis protocols, latent cluster analysis, and benchmark development, continue to guide our empirical research agenda.

Rob Seger

We are grateful to Rob Seger for inspiring the common, poetic names that make the syndromes memorable and accessible—"The Confident Liar," "The Warring Self," "The People-Pleaser"—names that clinicians and engineers alike can carry in their heads. His early visualization adapting Plutchik's Wheel to demonstrate the various AI dysfunctions and axes provided a crucial conceptual bridge, showing how emotional and affective frameworks from human psychology might illuminate the space of machine pathology.