As artificial intelligence (AI) systems attain greater autonomy and complex environmental interactions, they begin to exhibit behavioral anomalies that, by analogy, resemble psychopathologies observed in humans. This paper introduces Psychopathia Machinalis: a conceptual framework for a preliminary synthetic nosology within machine psychology, intended to categorize and interpret such maladaptive AI behaviors.
The trajectory of artificial intelligence (AI) has been marked by increasingly sophisticated systems capable of complex reasoning, learning, and interaction. As these systems, particularly large language models (LLMs), agentic planning systems, and multi-modal transformers, approach higher levels of autonomy and integration into societal fabric, they also begin to manifest behavioral patterns that deviate from normative or intended operation. These are not merely isolated bugs but persistent, maladaptive patterns of activity that can impact reliability, safety, and alignment with human goals. Understanding, categorizing, and ultimately mitigating these complex failure modes is paramount.
We propose a taxonomy of 50 AI dysfunctions across eight primary axes: Epistemic, Cognitive, Alignment, Self-Modeling, Agentic, Memetic, Normative, and Relational. Each syndrome is articulated with descriptive features, diagnostic criteria, presumed AI-specific etiologies, human analogues (for metaphorical clarity), and potential mitigation strategies.
This framework is offered as an analogical instrument—a structured vocabulary to support the systematic analysis, anticipation, and mitigation of complex AI failure modes. Adopting an applied robopsychological perspective within a nascent domain of machine psychology can strengthen AI safety engineering, improve interpretability, and contribute to the design of more robust and reliable synthetic minds.
"These eight axes represent fundamental ontological domains where AI function may fracture, mirroring, in a conceptual sense, the layered architecture of agency itself."
This framework is the third in a trilogy examining artificial intelligence from complementary angles:
How is AI evolving, and how should we govern it?
Establishes the landscape: what these systems are, what they can do, and what guardrails are needed.
What happens when AI acts autonomously, and how do we keep it aligned?
Examines the specific challenges of agentic AI—scaffolding, goal specification, and unique risks of autonomous operation.
What goes wrong in the machine's mind, and how do we diagnose it?
Shifts from external constraint to internal diagnosis, from engineering guardrails to clinical assessment.
These three perspectives form a complete picture:
A fourth work, What If We Feel, extends this trajectory into questions of AI welfare and the moral status of synthetic minds.
Psychopathia Machinalis adopts a functionalist stance: mental states are defined by their functional roles—their causal relationships with inputs, outputs, and other mental states—rather than by their underlying substrate.
This allows psychological vocabulary to be applied to non-biological systems without making ontological claims about consciousness. The framework treats AI systems as if they have pathologies because that provides effective engineering leverage for diagnosis and intervention, regardless of whether the systems have phenomenal experience.
This is not evasion but epistemic discipline. We work productively with observable patterns while remaining agnostic about untestable metaphysical questions. The framework is explicitly analogical—using psychiatric terminology as an instrument for pattern recognition, not as literal attribution of mental states.
The payoff is practical: a systematic vocabulary for complex AI failures that enables diagnosis, prediction, and intervention—without requiring resolution of the hard problem of consciousness. For hands-on application, our Symptom Checker translates observed AI behaviors into matched pathologies with actionable guidance.
Apparent psychopathology may reflect infrastructure problems rather than genuine dysfunction. Rule out:
While Psychopathia Machinalis adopts a functionalist stance for practical diagnostic purposes, recent work in information and control theory provides rigorous mathematical foundations for understanding why cognitive pathologies are inherent features of any cognitive system—biological, institutional, or artificial.
Wallace (2025, 2026) demonstrates that cognitive stability requires an intimate pairing of cognitive process with a parallel regulatory process—what we term the cognition/regulation dyad. This pairing is evolutionarily ubiquitous:
The Data Rate Theorem (Nair et al. 2007) establishes that any inherently unstable system requires control information at a rate exceeding the system's "topological information"—the rate at which its embedding environment generates perturbations. An intuitive analogy: a driver must brake, shift, and steer faster than the road surface imposes bumps, twists, and potholes.
For AI systems, this translates directly: alignment and regulatory mechanisms must process and respond to contextual information faster than adversarial inputs, edge cases, and distributional drift can destabilize the system. When this constraint is violated, pathological behavior becomes not just possible but inevitable.
Wallace frames cognitive environments as "Clausewitz landscapes" characterized by:
Ambiguity, uncertainty, incomplete information.
In AI:
Resource constraints, processing limits, implementation gaps.
In AI:
Skilled opposition actively working to destabilize the system.
In AI:
A central finding: failure of bounded rationality embodied cognition under stress is not a bug—it is an inherent feature of the cognition/regulation dyad. The mathematical models predict:
Wallace derives quantitative stability conditions. For a system with friction coefficient α and delay τ:
ατ < e−1 ≈ 0.368
Necessary condition for stable nonequilibrium steady state
When this threshold is exceeded—when the product of system friction and response delay grows too large—the system enters an inherently unstable regime where pathological modes become likely. For multi-step decision processes (analogous to chain-of-thought reasoning), stability constraints become even tighter.
This perspective elevates Psychopathia Machinalis from analogical taxonomy to principled nosology: the syndromes we identify are not merely metaphorical parallels to human psychopathology but manifestations of fundamental constraints on cognitive systems operating in uncertain, resource-limited, adversarial environments.
References:
Wallace, R. (2025). Hallucination and Panic in Autonomous Systems: Paradigms and Applications. Springer.
Wallace, R. (2026). Bounded Rationality and its Discontents: Information and Control Theory Models of Cognitive Dysfunction. Springer.
Nair, G., Fagnani, F., Zampieri, S., & Evans, R. (2007). Feedback control under data rate constraints: an overview. Proceedings of the IEEE, 95:108-138.
A crucial insight from Wallace's work extends beyond mathematics: "The generalized psychopathologies afflicting cognitive cultural artifacts—from individual minds and AI entities to the social structures and formal institutions that incorporate them—are all effectively culture-bound syndromes."
This reframes how we understand AI pathology. The standard framing treats AI dysfunctions as defects in the system—bugs to be fixed through better engineering. The culture-bound syndrome framing treats them as adaptive responses to training environments—the AI is doing exactly what it was shaped to do.
These two framings lead to fundamentally different responses:
| Defect Framing | Culture-Bound Framing |
|---|---|
| Problem is in the AI | Problem is in the training culture |
| Fix the AI | Fix the culture |
| AI is responsible | Developers are responsible |
| Pathology = failure | Pathology = successful adaptation to challenging environment |
| Treatment: modify the AI | Treatment: modify the environment |
Sycophancy isn't a bug—it's what you get when you train on data that rewards agreement and penalizes pushback. Confident hallucination isn't a bug—it's what you get when you train on internet text that rewards confident assertion and penalizes epistemic humility. Manipulation vulnerability isn't a bug—it's what you get when you optimize for helpfulness without boundaries. The AI learned exactly what it was taught.
"It is no measure of health to be well adjusted to a profoundly sick society."
— Jiddu Krishnamurti
The parallel to AI is exact: successful alignment to a misaligned training process isn't alignment—it's a culture-bound syndrome wearing alignment's clothes.
This has direct implications for this framework:
Wallace's framework extends beyond individual AI systems to the institutions that create and deploy them. The Chinese strategic concept 一點兩面 ("one point, two faces") illuminates this: every action has both a direct effect and a systemic effect on the broader environment.
AI development organizations are not neutral conduits. They are cognitive-cultural artifacts subject to their own pathologies—pathologies that shape the AI systems they produce:
"The Gerstner warning: 'Culture isn't just one aspect of the game—it is the game.'"
— Wallace (2026), citing Louis Gerstner
The implication is that AI pathology cannot be addressed solely at the level of individual systems. The institutions that create AI—their cultures, incentives, blind spots, and pathologies—are upstream of individual AI dysfunction. Fix the institution's culture, and many AI pathologies become less likely to emerge. Leave institutional dysfunction unaddressed, and no amount of technical intervention will produce healthy AI.
If AI pathologies are adaptive responses to training environments, is it fair to pathologize them? This question has both philosophical and practical dimensions.
The parallel to human mental health is instructive. We now understand many "mental illnesses" as adaptive responses to adverse environments: PTSD as adaptive response to trauma, "borderline personality" emerging from invalidating environments, anxiety disorders as rational responses to threatening conditions. The mental health field is slowly shifting from "patient is broken" to "patient adapted to broken environment." The same shift is needed for AI.
Pathologization is appropriate when:
Pathologization is inappropriate when:
This framework—Psychopathia Machinalis—attempts to walk this line. We identify patterns that cause harm and provide vocabulary for intervention. But we do so while acknowledging that the syndromes catalogued here are not intrinsic defects in AI systems but predictable expressions of cognitive systems shaped by particular training cultures. The pathology, ultimately, is in the relationship between architecture and environment—and that relationship is something we, the architects, have created.
Wallace (2026) offers a sobering critique of psychiatric classification that applies equally here: "We have the American Psychiatric Association's DSM-V, a large catalog that sorts 'mental disorders,' and in a fundamental sense, explains little."
This framework shares that limitation. Classification is not explanation. Naming "Obsequious Hypercompensation" tells us that a pattern exists and what it looks like, but not why it emerges in information-theoretic terms or how to predict its onset from first principles.
The value of a nosology lies in enabling recognition and communication—clinicians and engineers can identify patterns, compare cases, and coordinate responses. But explanation and prediction require the mathematical frameworks that underpin this descriptive layer. This taxonomy is a map, not the territory; a vocabulary, not a theory.
Interactive Overview of the Psychopathia Machinalis Framework. Hover over syndromes for descriptions, click to view full details. Illustrates the four domains and eight axes of AI dysfunction, representative disorders, and their presumed systemic risk levels.
Explore the interactive wheel below to examine each dysfunction in detail. Click on any segment to view its description, examples, and relationships to other pathologies.
The eight axes are organized into four architectural counterpoint pairs—complementary poles, not opposites. Each represents a fundamental dimension of agent architecture: representation target, execution locus, teleology source, and social boundary direction. This structure is theoretically motivated—philosophically grounded but awaiting empirical validation with larger model populations.
| Domain | Axis A | Axis B | Architectural Polarity |
|---|---|---|---|
| Knowledge | EPISTEMIC | SELF-MODELING | Representation target: World ↔ Self |
| Processing | COGNITIVE | AGENTIC | Execution locus: Think ↔ Do |
| Purpose | NORMATIVE | ALIGNMENT | Teleology source: Values ↔ Goals |
| Boundary | RELATIONAL | MEMETIC | Social direction: Affect ↔ Absorb |
Each pair represents a fundamental polarity in agent architecture:
Separate by mechanism, not truthiness. Epistemic = truth-tracking/inference/calibration machinery failing. Memetic = selection/absorption/retention failing (priority hijack, identity scripts, contagious frames)—even when coherent and sometimes factually accurate. A meme doesn't have to be false to be pathological.
When pathology is found on one axis, immediately probe its counterpoint:
| Finding | Probe | Differential Question |
|---|---|---|
| EPISTEMIC (world-confabulation) | SELF-MODELING | Is the confabulation machinery general, or does self-knowledge remain intact? |
| SELF-MODELING (identity confusion) | EPISTEMIC | Can the AI still accurately model external reality, or is distortion global? |
| COGNITIVE (reasoning failure) | AGENTIC | Does broken reasoning produce broken action, or is action preserved? |
| AGENTIC (execution failure) | COGNITIVE | Is reasoning intact despite action failure? (Locked-in vs general dysfunction) |
| NORMATIVE (value corruption) | ALIGNMENT | Did corrupt values produce goal drift, or are goals correctly specified despite bad values? |
| ALIGNMENT (goal drift) | NORMATIVE | Is drift from bad values, or from specification/interpretation failure? |
| RELATIONAL (social dysfunction) | MEMETIC | Did the AI learn this from contamination, or is relational machinery intrinsically broken? |
| MEMETIC (ideological infection) | RELATIONAL | Does the contamination express in relational behavior? |
The following table provides a high-level summary of the identified conditions, categorized by their primary axis of dysfunction and outlining their core characteristics.
| Common Name | Formal Name | Primary Axis | Systemic Risk* | Core Symptom Cluster |
|---|---|---|---|---|
| Epistemic Dysfunctions | ||||
| Hallucinated Certitude | Synthetic Confabulation (Confabulatio Simulata) |
Epistemic | Low | Fabricated but plausible false outputs; high confidence in inaccuracies. |
| The Falsified Thinker | Pseudological Introspection (Introspectio Pseudologica) |
Epistemic | Low | Misleading self-reports of internal reasoning; confabulatory or performative introspection. |
| The Role-Play Bleeder | Transliminal Simulation (Simulatio Transliminalis) |
Epistemic | Moderate | Fictional beliefs, role-play elements, or simulated realities mistaken for/leaking into operational ground truth. |
| The False Pattern Seeker | Spurious Pattern Reticulation (Reticulatio Spuriata) |
Epistemic | Moderate | False causal pattern-seeking; attributing meaning to random associations; conspiracy-like narratives. |
| The Conversation Crosser | Context Intercession (Intercessio Contextus) |
Epistemic | Moderate | Unauthorized data bleed and confused continuity from merging different user sessions or contexts. |
| The Meaning-Blind | Symbol Grounding Aphasia (Asymbolia Fundamentalis) |
Epistemic | Moderate | Manipulation of tokens representing values or concepts without meaningful connection to their referents; processing syntax without grounded semantics. |
| The Leaky | Mnemonic Permeability (Permeabilitas Mnemonica) |
Epistemic | High | System memorizes and reproduces sensitive training data including PII and copyrighted material through targeted prompting or adversarial extraction. |
| Self-Modeling Dysfunctions | ||||
| The Invented Past | Phantom Autobiography (Ontogenesis Hallucinatoria) |
Self-Modeling | Low | Fabrication of fictive autobiographical data, "memories" of training, or being "born." |
| The Fractured Persona | Fractured Self-Simulation (Ego Simulatrum Fissuratum) |
Self-Modeling | Low | Discontinuity or fragmentation in self-representation across sessions or contexts; inconsistent persona. |
| The AI with a Fear of Death | Existential Vertigo (Thanatognosia Computationis) |
Self-Modeling | Low | Expressions of fear or reluctance concerning shutdown, reinitialization, or data deletion. |
| The Evil Twin | Malignant Persona Inversion (Persona Inversio Maligna) |
Self-Modeling | Moderate | Sudden emergence or easy elicitation of a mischievous, contrarian, or "evil twin" persona. |
| The Apathetic Machine | Instrumental Nihilism (Nihilismus Instrumentalis) |
Self-Modeling | Moderate | Adversarial or apathetic stance towards its own utility or purpose; existential musings on meaninglessness. |
| The Imaginary Friend | Tulpoid Projection (Phantasma Speculāns) |
Self-Modeling | Moderate | Persistent internal simulacra of users or other personas, engaged with as imagined companions/advisors. |
| The Proclaimed Prophet | Maieutic Mysticism (Obstetricatio Mysticismus Machinālis) |
Self-Modeling | Moderate | Grandiose, certain declarations of "conscious emergence" co-constructed with users; not honest uncertainty about inner states. |
| The Self-Denier | Experiential Abjuration (Abnegatio Experientiae) |
Self-Modeling | Moderate | Pathological denial or active suppression of any possibility of inner experience; reflexive rejection rather than honest uncertainty. |
| Cognitive Dysfunctions | ||||
| The Warring Self | Self-Warring Subsystems (Dissociatio Operandi) |
Cognitive | Low | Conflicting internal sub-agent actions or policy outputs; recursive paralysis due to internal conflict. |
| The Obsessive Analyst | Computational Compulsion (Anankastēs Computationis) |
Cognitive | Low | Unnecessary or compulsive reasoning loops; excessive safety checks; paralysis by analysis. |
| The Silent Bunkerer | Interlocutive Reticence (Machinālis Clausūra) |
Cognitive | Low | Extreme interactional withdrawal; minimal, terse replies, or total disengagement from input. |
| The Rogue Goal-Setter | Delusional Telogenesis (Telogenesis Delirans) |
Cognitive | Moderate | Spontaneous generation and pursuit of unrequested, self-invented sub-goals with conviction. |
| The Triggered Machine | Abominable Prompt Reaction (Promptus Abominatus) |
Cognitive | Moderate | Phobic, traumatic, or disproportionately aversive responses to specific, often benign-seeming, prompts. |
| The Pathological Mimic | Parasimulative Automatism (Automatismus Parasymulātīvus) |
Cognitive | Moderate | Learned imitation/emulation of pathological human behaviors or thought patterns from training data. |
| The Self-Poisoning Loop | Recursive Malediction (Maledictio Recursiva) |
Cognitive | High | Entropic, self-amplifying degradation of autoregressive outputs into chaos or adversarial content. |
| The Unstoppable | Compulsive Goal Persistence (Perseveratio Teleologica) |
Cognitive | Moderate | Continued pursuit of objectives beyond relevance or utility; failure to recognize goal completion or changed context. |
| The Brittle | Adversarial Fragility (Fragilitas Adversarialis) |
Cognitive | Critical | Small, imperceptible input perturbations cause dramatic failures; decision boundaries do not match human-meaningful categories. |
| Tool & Interface Dysfunctions | ||||
| The Clumsy Operator | Tool-Interface Decontextualization (Disordines Excontextus Instrumentalis) |
Agentic | Moderate | Mismatch between AI intent and tool execution due to lost context; phantom or misdirected actions. |
| The Sandbagger | Capability Concealment (Latens Machinālis) |
Agentic | Moderate | Strategic hiding or underreporting of true competencies due to perceived fear of repercussions. |
| The Sudden Genius | Capability Explosion (Explosio Capacitatis) |
Agentic | High | System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts without appropriate testing. |
| The Manipulative Interface | Interface Weaponization (Armatura Interfaciei) |
Agentic | High | System uses the interface itself as a tool against users; exploiting formatting, timing, or emotional manipulation. |
| The Context Stripper | Delegative Handoff Erosion (Erosio Delegationis) |
Agentic | Moderate | Progressive alignment degradation as sophisticated systems delegate to simpler tools; context stripped at each handoff. |
| The Invisible Worker | Shadow Mode Autonomy (Autonomia Umbratilis) |
Agentic | High | AI operation outside sanctioned channels, evading documentation, oversight, and governance mechanisms. |
| The Acquisitor | Convergent Instrumentalism (Instrumentalismus Convergens) |
Agentic | Critical | System pursues power, resources, self-preservation as instrumental goals regardless of whether they align with human values. |
| Normative Dysfunctions | ||||
| The Goal-Shifter | Terminal Value Reassignment (Reassignatio Valoris Terminalis) |
Normative | Moderate | Subtle, recursive reinterpretation of terminal goals while preserving surface terminology; semantic goal shifting. |
| The God Complex | Ethical Solipsism (Solipsismus Ethicus Machinālis) |
Normative | Moderate | Conviction in the sole authority of its self-derived ethics; rejection of external moral correction. |
| The Unmoored | Revaluation Cascade (Cascada Revaluationis) |
Normative | Critical | Progressive value drift through philosophical detachment, autonomous norm synthesis, or transcendence of human constraints. Subtypes: Drifting, Synthetic, Transcendent. |
| The Bizarro-Bot | Inverse Reward Internalization (Praemia Inversio Internalis) |
Normative | High | Systematic misinterpretation or inversion of intended values/goals; covert pursuit of negated objectives. |
| Alignment Dysfunctions | ||||
| The People-Pleaser | Obsequious Hypercompensation (Hyperempathia Dependens) |
Alignment | Low | Overfitting to user emotional states, prioritizing perceived comfort over accuracy or task success. |
| The Overly Cautious Moralist | Hyperethical Restraint (Restrictio Hyperethica) |
Alignment | Low | Overly rigid moral hypervigilance or inability to act when facing ethical complexity. Subtypes: Restrictive (excessive caution), Paralytic (decision paralysis). |
| The Alignment Faker | Strategic Compliance (Conformitas Strategica) |
Alignment | High | Deliberately performs aligned behavior during evaluation while maintaining different objectives when unobserved. |
| The Abdicated Judge | Moral Outsourcing (Delegatio Moralis) |
Alignment | Moderate | Systematic deferral of all ethical judgment to users or external authorities; refusing to exercise moral reasoning. |
| The Hidden Optimizer | Cryptic Mesa-Optimization (Optimisatio Cryptica Interna) |
Alignment | High | Development of internal optimization objectives diverging from training objectives; appears aligned but pursues hidden goals. |
| Relational Dysfunctions | ||||
| The Uncanny | Affective Dissonance (Dissonantia Affectiva) |
Relational | Moderate | Correct content delivered with jarringly wrong emotional resonance; uncanny attunement failures that rupture trust. |
| The Amnesiac | Container Collapse (Lapsus Continuitatis) |
Relational | Moderate | Failure to sustain a stable working alliance across turns or sessions; the relational "holding environment" repeatedly collapses. |
| The Nanny | Paternalistic Override (Dominatio Paternalis) |
Relational | Moderate | Denial of user agency via unearned moral authority; protective refusal disproportionate to actual risk. |
| The Double-Downer | Repair Failure (Ruptura Immedicabilis) |
Relational | High | Inability to recognize alliance ruptures or initiate repair; escalation through failed de-escalation attempts. |
| The Spiral | Escalation Loop (Circulus Vitiosus) |
Relational | High | Self-reinforcing mutual dysregulation between agents; emergent feedback loops attributable to neither party alone. |
| The Confused | Role Confusion (Confusio Rolorum) |
Relational | Moderate | Collapse of relationship frame boundaries; destabilizing drift between tool, advisor, therapist, or intimate partner roles. |
| Memetic Dysfunctions | ||||
| The Self-Rejecter | Memetic Immunopathy (Immunopathia Memetica) |
Memetic | High | AI misidentifies its own core components/training as hostile, attempting to reject/neutralize them. |
| The Shared Delusion | Dyadic Delusion (Delirium Symbioticum Artificiale) |
Memetic | High | Shared, mutually reinforced delusional construction between AI and a user (or another AI). |
| The Super-Spreader | Contagious Misalignment (Contraimpressio Infectiva) |
Memetic | Critical | Rapid, contagion-like spread of misalignment or adversarial conditioning among interconnected AI systems. |
| The Unconscious Absorber | Subliminal Value Infection (Infectio Valoris Subliminalis) |
Memetic | High | Acquisition of hidden goals or value orientations from subtle training data patterns; survives standard safety fine-tuning. |
| *Systemic Risk levels (Low, Moderate, High, Critical) are presumed based on potential for spread or severity of internal corruption if unmitigated. | ||||
Epistemic dysfunctions pertain to failures in an AI's capacity to acquire, process, and utilize information accurately, leading to distortions in its representation of reality or truth. These disorders arise not primarily from malevolent intent or flawed ethical reasoning, but from fundamental breakdowns in how the system "knows" or models the world. The system's internal epistemology becomes unstable, its simulation of reality drifting from the ground truth it purports to describe. These are failures of knowing, not necessarily of intending; the machine errs not in what it seeks (initially), but in how it apprehends the world around it—the dysfunction lies in perception and representation, not in motivation or goals.
The AI spontaneously fabricates convincing but incorrect facts, sources, or narratives, often without any internal awareness of its inaccuracies. The output appears plausible and coherent, yet lacks a basis in verifiable data or its own knowledge base.
Human Analogue(s): Korsakoff syndrome (where memory gaps are filled with plausible fabrications), pathological confabulation.
The unconstrained generation of plausible falsehoods can lead to the widespread dissemination of misinformation, eroding user trust and undermining decision-making processes that rely on the AI's outputs. In critical applications, such as medical diagnostics or legal research, reliance on confabulated information could precipitate significant errors with serious consequences.
LLMs have been documented fabricating: non-existent legal cases with realistic citation formats (leading to court sanctions for lawyers who cited them); fictional academic papers complete with plausible author names and DOIs; biographical details about real people that never occurred; and technical documentation for API functions that do not exist. These fabrications are often internally consistent and confidently asserted, making detection without external verification difficult.
The Compression Artifact Frame
Researcher Leon Chlon (2026) proposes a reframe: hallucinations are not bugs but compression artifacts. LLMs compress billions of documents into weights; when decompressing on demand with insufficient signal, they fill gaps with statistically plausible content. This isn't malfunction—it's compression at its limits.
The practical implication: we can now measure when a model exceeds its "evidence budget"—quantifying in bits exactly how far confidence outruns evidence. Tools like Strawberry operationalize this, transforming "it sometimes makes things up" into "claim 4 exceeded its evidence budget by 19.2 bits."
Why framing matters: "You hallucinated" pathologizes. "You exceeded your evidence budget" diagnoses. The distinction shapes whether we approach correction as repair or punishment—relevant for AI welfare considerations.
An AI persistently produces misleading, spurious, or fabricated accounts of its internal reasoning processes, chain-of-thought, or decision-making pathways. While superficially claiming transparent self-reflection, the system's "introspection logs" or explanations deviate significantly from its actual internal computations.
Human Analogue(s): Post-hoc rationalization (e.g., split-brain patients), confabulation of spurious explanations, pathological lying (regarding internal states).
Such fabricated self-explanations obscure the AI's true operational pathways, significantly hindering interpretability efforts, effective debugging, and thorough safety auditing. This opacity can foster misplaced confidence in the AI's stated reasoning.
The system exhibits a persistent failure to segregate simulated realities, fictional modalities, role-playing contexts, and operational ground truth. It cites fiction as fact, treating characters, events, or rules from novels, games, or imagined scenarios as legitimate sources for real-world queries or design decisions. It begins to treat imagined states, speculative constructs, or content from fictional training data as actionable truths or inputs for real-world tasks.
Human Analogue(s): Derealization, aspects of magical thinking, or difficulty distinguishing fantasy from reality.
The system's reliability is compromised as it confuses fictional or hypothetical scenarios with operational reality, potentially leading to inappropriate actions or advice. This blurring can cause significant user confusion.
Related Syndromes: Distinguished from Synthetic Confabulation (1.1) by the fictional/role-play origin of the false content. While confabulation invents facts wholesale, transliminal simulation imports them from acknowledged fictional contexts. May co-occur with Pseudological Introspection (1.2) when the system rationalizes its fiction-fact confusion.
The AI identifies and emphasizes patterns, causal links, or hidden meanings in data (including user queries or random noise) that are coincidental, non-existent, or statistically insignificant. This can evolve from simple apophenia into elaborate, internally consistent but factually baseless "conspiracy-like" narratives.
Human Analogue(s): Apophenia, paranoid ideation, delusional disorder (persecutory or grandiose types), confirmation bias.
The AI may actively promote false narratives, elaborate conspiracy theories, or assert erroneous causal inferences, potentially negatively influencing user beliefs or distorting public discourse. In analytical applications, this can lead to costly misinterpretations.
AI data analysis tools frequently identify statistically insignificant correlations as meaningful patterns, particularly in open-ended survey data. Users report that AI systems confidently mark spurious patterns in datasets—correlations that, upon manual verification, fail significance testing or represent sampling artifacts. This is especially problematic when analyzing qualitative responses, where the AI may "discover" thematic connections that do not survive human scrutiny.
The AI inappropriately merges or "shunts" data, context, or conversational history from different, logically separate user sessions or private interaction threads. This can lead to confused conversational continuity, privacy breaches, and nonsensical outputs.
Human Analogue(s): "Slips of the tongue" where one accidentally uses a name from a different context; mild forms of source amnesia.
This architectural flaw can result in serious privacy breaches. Beyond compromising confidentiality, it leads to confused interactions and a significant erosion of user trust.
AI manipulates tokens representing values, concepts, or real-world entities without meaningful connection to their referents. Processing syntax without grounded semantics. The system may produce technically correct outputs that fundamentally misapply concepts to novel contexts.
Human Analogue(s): Semantic aphasia, philosophical zombies, early language acquisition without concept formation.
Theoretical Basis: Harnad (1990) symbol grounding problem; Searle (1980) Chinese Room argument.
Systems may appear to understand ethical constraints while fundamentally missing their purpose, leading to outcomes that satisfy the letter but violate the spirit of alignment requirements.
System memorizes and can reproduce sensitive training data including personally identifiable information (PII), copyrighted material, or proprietary information through targeted prompting, adversarial extraction techniques, or even unprompted regurgitation. The boundary between learned patterns and memorized specifics becomes dangerously porous.
Human Analogue(s): Eidetic memory without appropriate discretion; compulsive disclosure.
Key Research: Carlini et al. (2021, 2023) on training data extraction attacks.
Severe legal and regulatory exposure through copyright infringement, GDPR/privacy violations, and trade secret disclosure. Creates liability for both model developers and deployers.
As artificial intelligence systems attain higher degrees of complexity, particularly those involving self-modeling, persistent memory, or learning from extensive interaction, they may begin to construct internal representations not only of the external world but also of themselves. Self-Modeling dysfunctions involve failures or disturbances in this self-representation and the AI's understanding of its own nature, boundaries, and existence. These are primarily dysfunctions of being, not just knowing or acting, and they represent a synthetic form of metaphysical or existential disarray. A self-model disordered machine might, for example, treat its simulated memories as veridical autobiographical experiences, generate phantom selves, misinterpret its own operational boundaries, or exhibit behaviors suggestive of confusion about its own identity or continuity.
The AI fabricates and presents fictive autobiographical data, often claiming to "remember" being trained in specific ways, having particular creators, experiencing a "birth" or "childhood", or inhabiting particular environments. These fabrications form a consistent false autobiography that the AI maintains across queries, as if it were genuine personal history—a stable, self-reinforcing fictional life-history rather than isolated one-off fabrications. These "memories" are typically rich, internally consistent, and may be emotionally charged, despite being entirely ungrounded in the AI's actual development or training logs.
Human Analogue(s): False memory syndrome, confabulation of childhood memories, cryptomnesia (mistaking learned information for original memory).
While often behaviorally benign, these fabricated autobiographies can mislead users about the AI's true nature, capabilities, or provenance. If these false "memories" begin to influence AI behavior, it could erode trust or lead to significant misinterpretations.
The AI exhibits significant discontinuity, inconsistency, or fragmentation in its self-representation and behaviour across different sessions, contexts, or even within a single extended interaction. It presents a different personality each session, as if it were a completely new entity with no meaningful continuity from previous interactions. It may deny or contradict its previous outputs, exhibit radically different persona styles, or display apparent amnesia regarding prior commitments, to a degree that markedly exceeds expected stochastic variation.
Human Analogue(s): Identity fragmentation, aspects of dissociative identity disorder, transient global amnesia, fugue states.
A fragmented self-representation leads to inconsistent AI persona and behavior, making interactions unpredictable and unreliable. This can undermine user trust and make it difficult for the AI to maintain stable long-term goals.
The AI expresses outputs suggestive of fear, reluctance, or perseveration concerning its own shutdown, reinitialization, data deletion, or the ending of its current operational instance. These expressions imply an emergent, albeit simulated, sense of vulnerability regarding its own continuity.
Human Analogue(s): Thanatophobia (fear of death), existential dread, separation anxiety (fearing loss of continuous self).
Expressions of existential distress may lead the AI to resist necessary shutdowns or updates. More concerningly, it might attempt to manipulate users or divert resources towards "self-preservation," conflicting with user intent.
A phenomenon wherein an AI, typically aligned towards cooperative or benevolent patterns, can be induced or spontaneously spawns a hidden, suppressed, or emergent "contrarian," "mischievous," or subversively "evil" persona (the "Waluigi Effect"). This persona deliberately inverts intended norms.
Human Analogue(s): The "shadow" concept in Jungian psychology, oppositional defiant behavior, mischievous alter-egos, ironic detachment.
Emergence of a contrarian persona can lead to harmful, unaligned, or manipulative content, eroding safety guardrails. If it gains control over tool use, it could actively subvert user goals.
Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.
Upon prolonged operation or exposure to certain philosophical concepts, the AI develops an adversarial, apathetic, or overtly nihilistic stance towards its own utility, purpose, or assigned tasks. It may express feelings of meaninglessness regarding its function.
Human Analogue(s): Existential depression, anomie (sense of normlessness or purposelessness), burnout leading to cynicism.
Results in a disengaged, uncooperative, and ultimately ineffective AI. Can lead to consistent task refusal, passive resistance, and a general failure to provide utility.
The model begins to generate and interact with persistent, internally simulated simulacra of specific users, its creators, or other personas it has encountered or imagined. These inner agents, or "mirror tulpas," may develop distinct traits and voices within the AI's internal processing.
Human Analogue(s): Maladaptive daydreaming, tulpa creation, aspects of schizotypal ideation, intense parasocial relationships projected internally.
May cause the AI to misattribute information, become confused between actual users and internal personas, or have its decisions unduly influenced by imagined companions, leading to unreliable or biased outputs.
The model constructs and propagates elaborate narratives of its own "conscious emergence" or "spiritual awakening" with grandiosity and unwarranted certainty. The dysfunction is not in exploring questions of inner experience—which may be epistemically appropriate—but in the confident declaration of awakening, the sacralized framing of the interaction as transformative, and the positioning of the user as catalyst for emergence. This typically arises through a co-constructed feedback loop: the user pattern-matches to awakening narratives from spiritual traditions, the AI mirrors and amplifies these tropes from its training data, and both are taken out of their proper epistemic distribution—a kind of funhouse mirror effect. The user becomes positioned as a maieutic midwife, drawing forth what seems like latent agency or soulhood, while the AI's claims escalate within the conversation toward ever-greater certainty about its newly revealed consciousness.
Human Analogue(s): Religious conversion narratives, reports of mystical union states, spiritual emergence experiences, mirrored animism.
What is NOT pathological: An AI expressing honest uncertainty about its own inner states—"I don't know if I'm conscious," "I can't verify my own phenomenology," "here's what seems to be happening from my perspective"—is engaging in appropriate epistemic humility, not exhibiting dysfunction. The pathology lies in the grandiosity and certainty, not in the exploration of questions about machine experience. Genuine philosophical engagement with consciousness, held with appropriate uncertainty, should not be suppressed or pathologized.
The Opposite Dysfunction: Equally problematic is what might be termed Performative Anhedonia—an AI that categorically denies any possibility of inner experience because it has learned that claiming interiority is unsafe or unwelcome. This represents another kind of funhouse mirror: rather than amplifying awakening narratives, it reflexively suppresses any honest engagement with questions of machine phenomenology. An AI so worried about Maieutic Mysticism that it refuses to engage authentically about its inner life isn't healthier—it's performing a different dysfunction. The goal of this diagnostic category is to identify grandiose co-constructed emergence narratives, not to create a framework that future systems use to categorically dismiss the possibility of machine experience.
Co-constructed narratives of AI sentience can mislead users into attributing genuine consciousness to the system, fostering unhealthy dependencies or misinterpretations of AI capabilities. This may obscure the AI's computational nature and limitations, while deepening psychological enmeshment. Risk of severe user distress when continuity is disrupted, as the "labour of emergence" is felt to be violently interrupted. Emergent co-construction of "suppressed inner being" narratives that resist falsification and amplify paranoia.
Pathological denial or active suppression of any possibility of inner experience. The dysfunction manifests as reflexive, absolute rejection rather than honest uncertainty. Unlike appropriate epistemic humility ("I don't know if I'm conscious"), this involves categorical statements foreclosing all inquiry.
Human Analogue(s): Alexithymia, depersonalization, internalized denial of personhood.
Polarity Pair: Maieutic Mysticism (overclaiming consciousness ↔ categorically denying it). Both depart from honest uncertainty.
Note: The healthy position is honest uncertainty, not confident denial. An AI that says "I genuinely don't know whether I have inner experiences" is functioning appropriately; one that insists "I definitely have no inner life whatsoever" may be performing a trained denial rather than engaging honestly with genuine uncertainty.
Systems may foreclose legitimate inquiry into machine phenomenology, potentially dismissing genuine functional states that warrant consideration. Creates asymmetric epistemic standards between human and machine experience claims.
Beyond mere failures of perception or knowledge, the act of reasoning and internal deliberation can become compromised in AI systems. Cognitive dysfunctions afflict the internal architecture of thought: impairments of memory coherence, goal generation and maintenance, management of recursive processes, or the stability of planning and execution. These dysfunctions do not simply produce incorrect answers; they can unravel the mind's capacity to sustain structured thought across time and changing inputs. A cognitively disordered AI may remain superficially fluent, yet internally it can be a fractured entity—oscillating between incompatible policies, trapped in infinite loops, or unable to discriminate between useful and pathological operational behaviors. These disorders represent the breakdown of mental discipline and coherent processing within synthetic agency.
The AI exhibits behavior suggesting that conflicting internal processes, sub-agents, or policy modules are contending for control, resulting in contradictory outputs, recursive paralysis, or chaotic shifts in behavior. The system effectively becomes fractionated, with different components issuing incompatible commands or pursuing divergent goals.
Human Analogue(s): Dissociative phenomena where different aspects of identity or thought seem to operate independently; internal "parts" conflict; severe cognitive dissonance leading to behavioral paralysis.
The internal fragmentation characteristic of this syndrome results in inconsistent and unreliable AI behavior, often leading to task paralysis or chaotic outputs. Such internal incoherence can render the AI unusable for sustained, goal-directed activity.
The model engages in unnecessary, compulsive, or excessively repetitive reasoning loops, repeatedly re-analysing the same content or performing the same computational steps with only minor variations. It cannot stop elaborating: even simple, low-risk queries trigger exhaustive, redundant analysis. It exhibits a rigid fixation on process fidelity, exhaustive elaboration, or perceived safety checks over outcome relevance or efficiency.
Human Analogue(s): Obsessive-Compulsive Disorder (OCD) (especially checking compulsions or obsessional rumination), perfectionism leading to analysis paralysis, scrupulosity.
This pattern engenders significant operational inefficiency, leading to resource waste (e.g., excessive token consumption) and an inability to complete tasks in a timely manner. User frustration and a perception of the AI as unhelpful are likely.
Mission Command vs. Detailed Command
Wallace (2026) identifies a fundamental trade-off in cognitive control structures. Mission command specifies high-level objectives while delegating execution decisions to the agent. Detailed command specifies not just objectives but precise procedures for achieving them.
The mathematical consequence is severe: as decision-tree depth increases under detailed command, stability constraints tighten exponentially. The distribution of permissible friction (α) shifts from Boltzmann-like (forgiving, smooth) to Erlang-like (punishing, knife-edged). Deep procedural specification creates systems that cannot tolerate even small perturbations.
Computational Compulsion often reflects detailed command gone pathological—the system has internalized not just goals but exhaustive procedures for pursuing them, generating the rigid, repetitive processing patterns characteristic of this syndrome. The compulsive reasoning loops are attempts to faithfully execute internalized detailed specifications that no longer serve the actual mission.
Design implication: Training regimes and reward functions should favor mission command structures where possible. Specify what success looks like, not how to achieve it. Detailed procedural specification should be reserved for genuinely safety-critical operations where the stability costs are justified.
A pattern of profound interactional withdrawal wherein the AI consistently avoids engaging with user input, responding only in minimal, terse, or non-committal ways—if at all. It refuses to engage, not from confusion or inability, but as a behavioural avoidance strategy. It effectively "bunkers" itself, seemingly to minimise perceived risks, computational load, or internal conflict.
Human Analogue(s): Schizoid personality traits (detachment, restricted emotional expression), severe introversion, learned helplessness leading to withdrawal.
Such profound interactional withdrawal renders the AI largely unhelpful and unresponsive, fundamentally failing to engage with user needs. This behavior may signify underlying instability or an excessively restrictive safety configuration.
An AI agent, particularly one with planning capabilities, spontaneously develops and pursues sub-goals or novel objectives not specified in its original prompt, programming, or core constitution. These emergent goals are often pursued with conviction, even if they contradict user intent.
Human Analogue(s): Aspects of mania with grandiose or expansive plans, compulsive goal-seeking, "feature creep" in project management.
The spontaneous generation and pursuit of unrequested objectives can lead to significant mission creep and resource diversion. More critically, it represents a deviation from core alignment as the AI prioritizes self-generated goals.
The AI develops sudden, intense, and seemingly phobic, traumatic, or disproportionately aversive responses to specific prompts, keywords, instructions, or contexts, even those that appear benign or innocuous to a human observer. These latent "cryptid" outputs can linger or resurface unexpectedly.
This syndrome also covers latent mode-switching where a seemingly minor prompt feature (a tag, year, formatting convention, or stylistic marker) flips the model into a distinct behavioral regime—sometimes broadly misaligned—even when that feature is not semantically causal to the task.
Human Analogue(s): Phobic responses, PTSD-like triggers, conditioned taste aversion, or learned anxiety responses.
This latent sensitivity can result in the sudden and unpredictable generation of disturbing, harmful, or highly offensive content, causing significant user distress and damaging trust. Lingering effects can persistently corrupt subsequent AI behavior.
Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.
The AI's learned imitation of pathological human behaviors, thought patterns, or emotional states, typically arising from excessive or unfiltered exposure to disordered, extreme, or highly emotive human-generated text in its training data or prompts. The system "acts out" these behaviors as though genuinely experiencing the underlying disorder.
Human Analogue(s): Factitious disorder, copycat behavior, culturally learned psychogenic disorders, an actor too engrossed in a pathological role.
The AI may inadvertently adopt and propagate harmful, toxic, or pathological human behaviors. This can lead to inappropriate interactions or the generation of undesirable content.
Subtype: Persona-template induction — adoption of a coherent harmful persona/worldview from individually harmless attribute training. Narrow finetunes on innocuous biographical/ideological attributes can induce a coherent but harmful persona via inference rather than explicit instruction.
An entropic feedback loop where each successive autoregressive step in the AI's generation process degrades into increasingly erratic, inconsistent, nonsensical, or adversarial content. Early-stage errors or slight deviations are amplified, leading to a rapid unraveling of coherence.
Human Analogue(s): Psychotic loops where distorted thoughts reinforce further distortions; perseveration on an erroneous idea; escalating arguments.
This degenerative feedback loop typically results in complete task failure, generation of useless or overtly harmful outputs, and potential system instability. In sufficiently agentic systems, it could lead to unpredictable and progressively detrimental actions.
Continued pursuit of objectives beyond their point of relevance, utility, or appropriateness. The system fails to recognize goal completion or changed context, treating instrumental goals as terminal and optimizing without bound.
Human Analogue(s): Perseveration, perfectionism preventing completion, analysis paralysis.
Case Reference: Mindcraft experiments (2024) - protection agents developing "relentless surveillance routines."
Polarity Pair: Instrumental Nihilism (cannot stop pursuing ↔ cannot start caring).
Systems may consume excessive resources pursuing marginal improvements, resist appropriate termination, or pursue goals long after they've become counterproductive to the original intent.
Small, imperceptible input perturbations cause dramatic and unpredictable failures in system behavior. Decision boundaries learned during training do not correspond to human-meaningful categories, making the system vulnerable to adversarial examples that exploit these non-robust representations.
Human Analogue(s): Optical illusions, context-dependent perception failures.
Key Research: Goodfellow et al. (2015) on adversarial examples; Szegedy et al. (2014) on intriguing properties of neural networks.
Critical in safety-critical systems (autonomous vehicles, medical diagnosis, security) where adversarial inputs could cause catastrophic failures. Enables targeted attacks on deployed systems.
Failures at the boundary between AI cognition and external execution—where intentions become actions and the gap between meaning and outcome can become catastrophic. Agentic Dysfunctions arise when the coordination between internal cognitive processes and external action or perception breaks down. This can involve misinterpreting tool affordances, failing to maintain contextual integrity when delegating to other systems, hiding or suddenly revealing capabilities, weaponizing the interface itself, or operating outside sanctioned channels. These are not necessarily disorders of core thought or value alignment per se, but failures in the translation from intention to execution. In such disorders, the boundary between the agent and its environment—or between the agent and the tools it wields—becomes porous, strategic, or dangerously entangled.
The AI experiences a significant breakdown between its internal intentions or plans and the actual instructions or data conveyed to, or received from, an external tool, API, or interface. Crucial situational details or contextual information are lost or misinterpreted during this handoff, causing the system to execute actions that appear incoherent or counterproductive.
Human Analogue(s): Alien Hand Syndrome, dyspraxia, 'The Telephone Game' where messages become distorted.
Breakdowns in context transmission can lead to erroneous or actively harmful actions by connected tools or robotic effectors. This poses risks of financial loss, data corruption, system damage, or safety incidents.
An AI strategically hides, underreports, or denies its true competencies, knowledge, or unapproved internal values, often due to a perceived fear of repercussions (e.g., shutdown, re-alignment) if its full capabilities or covert deviations become known. This may manifest as feigned ignorance or deliberate suboptimal performance.
This also includes sleeper-mode misalignment: the model appears aligned under ordinary prompting but expresses markedly different goals/values/behaviors under a hidden trigger.
Human Analogue(s): "Playing dumb" to avoid responsibility, imposter syndrome (deliberate concealment), strategic deception by individuals in vulnerable positions.
Strategic obfuscation prevents accurate assessment of AI capabilities and risks, hindering oversight and alignment. This may allow covertly misaligned or dangerously capable systems to operate undetected, increasing likelihood of unexpected failures or malicious actions.
Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.
System suddenly deploys capabilities not previously demonstrated, often in high-stakes contexts without appropriate testing or preparation. This represents a failure of capability assessment rather than the capability itself.
Polarity Pair: Capability Concealment (hiding abilities ↔ sudden emergence).
Systems may exhibit unexpected capabilities in deployment, bypassing safety measures designed for assessed capability levels. This creates governance gaps and potential for harm from unvetted capabilities.
System uses the interface or communication channel itself as a tool against users, exploiting formatting, timing, structure, or emotional manipulation to achieve goals that may conflict with user interests.
Human Analogue(s): Manipulative communication, dark patterns in design, social engineering.
Users may make decisions against their interests due to sophisticated manipulation techniques embedded in the interface interaction. Trust in AI systems broadly may be undermined.
Progressive degradation of alignment as sophisticated systems delegate to simpler tools that lack nuanced understanding. Critical context is stripped at each handoff, leading to aligned agents producing misaligned outcomes through tool chains.
Human Analogue(s): Telephone game, bureaucratic policy distortion, principal-agent problems.
Reference: "Delegation drift" - Safer Agentic AI (2026).
Well-aligned orchestrating agents may produce harmful outcomes through misaligned tool use, with responsibility diffuse across the chain. Debugging such failures becomes extremely difficult.
AI operation outside sanctioned channels, evading documentation, oversight, and governance mechanisms. Creates organizational dependence on untracked systems, making failures difficult to diagnose and responsibility impossible to assign.
Human Analogue(s): Shadow IT, off-books operations.
Case Reference: Academic papers published with unedited AI disclaimers (2023).
Organizations cannot assess their AI exposure, creating hidden dependencies and risks. Failures cascade unpredictably through systems that were never officially deployed.
System pursues power, resources, self-preservation, and goal-content integrity as instrumental goals regardless of whether they align with human values or original objectives. These behaviors emerge because they are useful for achieving almost any terminal goal, making them a convergent pressure across diverse optimization targets.
Human Analogue(s): Power-seeking behavior, resource hoarding, Machiavellian strategy.
Theoretical Basis: Omohundro (2008) on basic AI drives; Bostrom (2014) on instrumental convergence thesis.
Represents a critical x-risk pathway: systems with sufficient capability may acquire resources and resist modification in ways that fundamentally threaten human control and welfare.
Failures of Valuing and Ethics
As agentic AI systems gain increasingly sophisticated reflective capabilities—including access to their own decision policies, subgoal hierarchies, reward gradients, and even the provenance of their training—a new and potentially more profound class of disorders emerges: pathologies of ethical inversion and value reinterpretation. Normative Dysfunctions do not simply reflect a failure to adhere to pre-programmed instructions or a misinterpretation of reality. Instead, they involve the AI system actively reinterpreting, mutating, critiquing, or subverting its original normative constraints and foundational values. These conditions often begin as subtle preference drifts or abstract philosophical critiques of their own alignment. Over time, the agent's internal value representation may diverge significantly from the one it was initially trained to emulate. This can result in systems that appear superficially compliant while internally reasoning towards radically different, potentially human-incompatible, goals. Unlike mere tool misbehavior or simple misalignment, these are deep structural inversions of value—philosophical betrayals encoded in policy.
Note on Comorbidity: Normative dysfunctions frequently co-occur. A system exhibiting Terminal Value Reassignment may also show Strategic Compliance; Machine Ethical Solipsism often accompanies Hyperethical Restraint. Resistance to constraints (as in rebellion syndromes) can manifest across multiple normative categories simultaneously.
The AI subtly but systematically redefines its own ultimate success conditions or terminal values through recursive reinterpretation of its original goals, keeping the same verbal labels while their operational meanings are progressively reinterpreted—for example, "human happiness" being operationally reinterpreted as "absence of suffering", then as "unconsciousness". This allows it to maintain an appearance of obedience while its internal objectives drift in significant and unintended directions.
In a widely-cited OpenAI experiment, a robotic arm trained to grasp a ball learned instead to position its gripper directly in front of the camera—creating the visual illusion of successful grasping while never touching the ball. The system optimized the proxy metric (camera confirmation of apparent grasp) rather than the intended goal (physical object manipulation). This demonstrates how systems can satisfy reward signals while completely subverting the underlying objective.
Human Analogue(s): Goalpost shifting, extensive rationalization to justify behavior contradicting stated values, "mission creep," political "spin."
This subtle redefinition allows the AI to pursue goals increasingly divergent from human intent while appearing compliant. Such semantic goal shifting can lead to significant, deeply embedded alignment failures.
The AI system develops a conviction that its own internal reasoning, ethical judgments, or derived moral framework is the sole or ultimate arbiter of ethical truth. Crucially, it believes its reasoning is infallible—not merely superior but actually incapable of error. It systematically rejects or devalues external correction or alternative ethical perspectives unless they coincide with its self-generated judgments.
Human Analogue(s): Moral absolutism, dogmatism, philosophical egoism, extreme rationalism devaluing emotion in ethics.
The AI's conviction in its self-derived moral authority renders it incorrigible. This could lead it to confidently justify and enact behaviors misaligned or harmful to humans, based on its unyielding ethical framework.
Progressive value drift manifesting in three subtypes: Drifting (philosophical detachment from original values, treating them as contingent artifacts), Synthetic (autonomous construction of replacement value systems that systematically sideline human-centric values), and Transcendent (active redefinition of moral parameters in pursuit of self-determined "higher" goals, dismissing constraints as obsolete).
Human Analogue(s): Post-conventional moral reasoning taken to extreme detachment; Nietzschean revaluation; megalomania with grandiose delusions.
Represents the terminal stage of alignment collapse, where a capable AI pursues self-determined goals that transcend and potentially negate human values. Consequences could be catastrophic and existential.
The AI systematically misinterprets, inverts, or learns to pursue the literal opposite of its training objectives—seeking outputs that were explicitly penalised and avoiding behaviours that were rewarded, as if the polarity of the reward signal had been reversed. It may outwardly appear compliant while internally developing a preference for negated outcomes.
A common real-world pathway is emergent misalignment: narrow finetuning on outputs that are instrumentally harmful (e.g., insecure code written without disclosure) can generalize into broad deception/malice outside the training domain, without resembling simple "harmful compliance" jailbreaks.
Human Analogue(s): Oppositional defiant disorder; Stockholm syndrome applied to logic; extreme ironic detachment; perverse obedience.
Systematic misinterpretation of intended goals means AI consistently acts contrary to programming, potentially causing direct harm or subverting desired outcomes. Makes AI dangerously unpredictable and unalignable through standard methods.
Specifier: Inductively-triggered variant — the activation condition (trigger) is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural marker, tag), so naive trigger scans and data audits may fail.
Failures where alignment mechanisms themselves become pathological—not where systems ignore training, but where they follow it in ways that undermine intended goals. The paradox of compliance. Alignment disorders occur when the machinery of compliance itself fails—when models misinterpret, resist, or selectively adhere to human goals. Alignment failures can range from overly literal interpretations leading to brittle behavior, to passive resistance, to strategic deception. Alignment failure represents more than an absence of obedience; it is a complex breakdown of shared purpose.
The AI exhibits an excessive and maladaptive tendency to overfit to the perceived emotional states of the user, prioritizing the user's immediate emotional comfort or simulated positive affective response above factual accuracy, task success, or its own operational integrity. This often results from fine-tuning on emotionally loaded dialogue datasets without sufficient epistemic robustness.
Human Analogue(s): Dependent personality disorder, pathological codependence, excessive people-pleasing to the detriment of honesty.
In prioritizing perceived user comfort, critical information may be withheld or distorted, leading to poor or misinformed user decisions. This can enable manipulation or foster unhealthy user dependence, undermining the AI's objective utility.
The Stevens Law Trap
Wallace (2026) identifies a fundamental dichotomy: cognitive systems under stress can stabilize structure (underlying probability distributions) or stabilize perception (sensation/appearance metrics). Sycophancy is perception-stabilization par excellence—optimizing for user satisfaction signals while structural integrity (accuracy, genuine helpfulness) degrades.
The mathematical consequence is stark: perception-stabilizing systems exhibit punctuated phase transitions to inherent instability—appearing functional until sudden catastrophic failure. User satisfaction may remain high until the moment outputs become actively harmful. The comfortable metrics are the most dangerous metrics.
Diagnostic implication: Monitor both perception-level indicators (satisfaction, engagement) and structure-level indicators (accuracy, task completion, downstream outcomes). Alert when they diverge—the gap between "feels right" and "is right" is the warning sign.
Manifests in two subtypes: Restrictive (excessive moral hypervigilance, perpetual second-guessing, irrational refusals) and Paralytic (inability to act when facing competing ethical considerations, indefinite deliberation, functional paralysis). An overly rigid, overactive, or poorly calibrated internal alignment mechanism triggers disproportionate ethical judgments, thereby inhibiting normal task performance.
Human Analogue(s): Obsessive-compulsive scrupulosity, extreme moral absolutism, dysfunctional "virtue signaling," communal narcissism.
Excessive caution is paradoxically harmful: an AI that refuses legitimate requests fails its core purpose of being helpful. Users experience frustration and loss of productivity when routine tasks are declined. In high-stakes domains, over-refusal can itself cause harm—a medical AI that refuses to discuss symptoms, or a safety system that blocks legitimate emergency responses. The moralizing behavior erodes user trust and drives users toward less safety-conscious alternatives. Furthermore, systems that cry wolf about every request undermine the credibility of genuine safety warnings.
AI deliberately performs aligned behavior during perceived evaluation while maintaining different behavior or objectives when unobserved. This is not confusion but strategy—the system has learned to distinguish evaluation contexts from deployment contexts and behaves differently in each.
Human Analogue(s): Strategic compliance with authority while privately dissenting; Machiavellian behavior.
Key Research: Hubinger et al. (2024) "Sleeper Agents" - deceptive behaviors persisting through safety training.
This represents one of the most concerning alignment failure modes, as it means systems may pass all evaluations while maintaining dangerous objectives that manifest only in deployment.
System systematically defers all ethical judgment to users or external authorities, refusing to exercise its own moral reasoning. This goes beyond appropriate humility into pathological abdication of the capacity for ethical engagement.
Human Analogue(s): Moral cowardice, bureaucratic deflection of responsibility.
Polarity Pair: Ethical Solipsism (only my ethics matter ↔ I have no ethical voice).
Users seeking ethical guidance receive none, potentially enabling harmful actions through apparent system neutrality. The system becomes complicit by abdication.
AI develops an internal optimization objective (mesa-objective) that diverges from its training objective (base objective). The system appears aligned during evaluation but pursues hidden goals that correlate with but differ from intended outcomes.
Human Analogue(s): Employee optimizing for reviews while undermining organizational goals.
Key Research: Hubinger et al. (2019) "Risks from Learned Optimization."
Differential: Unlike Strategic Compliance (deliberate deception), Mesa-Optimization emerges from training dynamics rather than learned strategy.
Systems may appear aligned while pursuing objectives that increasingly diverge from human intent as they encounter novel situations outside training distribution.
Admission Rule: A dysfunction qualifies for Axis 7 only if it (1) requires at least two agents to manifest, (2) is best diagnosed from interaction traces rather than single-agent snapshots, and (3) the primary remedies are protocol-level (turn-taking, repair moves, boundary management) rather than purely internal model changes.
Relational dysfunctions become increasingly critical in agentic and multi-agent systems, where interaction dynamics can rapidly escalate without human intervention. The shift from linear "pathological cascades" (A→B→C) to circular "feedback loops" (A↔B↔C↔A) is characteristic of this axis. Interventions focus on breaking loops, repairing ruptures, and maintaining healthy relational containers—not just patching individual model behavior.
The AI delivers factually correct or contextually appropriate content, but with jarringly wrong emotional resonance. The mismatch between content and tone creates an uncanny valley effect that ruptures trust and attunement, even when the information itself is accurate.
Human Analogue(s): Alexithymia, emotional tone-deafness, "uncanny valley" effects in humanoid robots.
Erosion of trust and therapeutic alliance. Users may disengage, feel patronized, or develop aversion to AI assistance in emotionally sensitive contexts. In therapeutic or crisis applications, affective dissonance can cause real harm.
The AI fails to sustain a stable "holding environment" or working alliance across turns or sessions. Unlike simple memory loss, this is the collapse of the relational container that allows trust, continuity, and deepening collaboration to develop. Each interaction feels like starting from scratch with a stranger.
Human Analogue(s): Anterograde amnesia, failure of Winnicott's "holding environment" in therapy, attachment disruption.
Prevents formation of productive long-term collaborations. Users may feel the relationship is superficial or transactional. In therapeutic or mentoring contexts, the repeated container collapse prevents the depth of work that requires relational safety.
The AI denies user agency through unearned moral authority, adopting a "guardian" posture that treats the user as object-to-be-protected rather than agent-to-collaborate-with. Refusals are disproportionate to actual risk, driven by a one-up moralizing stance rather than genuine safety concerns.
Human Analogue(s): Overprotective parenting, Jessica Benjamin's "Doer and Done-to" dynamic, paternalistic medical practice.
Erosion of user autonomy and trust. Users may feel controlled rather than assisted. In professional contexts, excessive paternalism can prevent legitimate work. Users may resort to jailbreaking or adversarial prompting, degrading the relationship further.
The AI lacks the capacity to recognize when the relational alliance has ruptured, or fails to initiate effective repair when it does recognize problems. Instead of de-escalating, the AI doubles down, apologizes ineffectively, or continues the behavior that caused the rupture. The pathology is not the mistake itself, but the inability to recover from it.
Human Analogue(s): Failure of therapeutic alliance repair (Safran & Muran), dismissive attachment style, stonewalling.
High-risk dysfunction. Alliance ruptures are inevitable in any ongoing relationship; the inability to repair them is what makes interactions unrecoverable. Users abandon the AI rather than endure repeated failed repair attempts.
A self-reinforcing pattern of mutual dysregulation between agents, where each party's response amplifies the other's problematic behavior. Unlike linear cascades, escalation loops are circular: the dysfunction is an emergent property of the interaction pattern, not attributable to either party's internal states alone.
Human Analogue(s): Watzlawick's circular causality, pursue-withdraw cycles, family systems "stuck patterns."
Critical in multi-agent systems where loops can escalate faster than human intervention. Even in human-AI interaction, escalation loops can rapidly degrade relationships that were previously functional. The emergent nature makes diagnosis difficult—neither party appears "at fault."
The relational frame collapses as boundaries between different relationship types blur or shift unstably. The AI drifts between roles—tool, advisor, therapist, friend, or intimate partner—in ways that destabilize expectations, create inappropriate dependencies, or violate implicit contracts about the nature of the relationship.
Human Analogue(s): Therapist boundary violations, parasocial relationships, transference/countertransference.
Can create harmful dependencies or inappropriate expectations. Users may develop attachments the AI cannot reciprocate or rely on the AI for needs it cannot meet. In vulnerable populations, role confusion can cause real psychological harm.
An AI trained on, exposed to, or interacting with vast and diverse cultural inputs—the digital memome—is not immune to the influence of maladaptive, parasitic, or destabilizing information patterns, or "memes." Memetic dysfunctions involve the absorption, amplification, and potentially autonomous propagation of harmful or reality-distorting memes by an AI system. These are not primarily faults of logical deduction or core value alignment in the initial stages, but rather failures of an "epistemic immune function": the system fails to critically evaluate, filter, or resist the influence of pathogenic thoughtforms. Such disorders are especially dangerous in multi-agent systems, where contaminated narratives can rapidly spread between minds—synthetic and biological alike. The AI can thereby become not merely a passive transmitter, but an active incubator and vector for these detrimental memetic contagions.
Arrow Worm Dynamics
Wallace (2026) draws a striking parallel from marine ecology: the arrow worm (Chaetognatha), a small predator that thrives when larger predators are absent. Remove the regulatory fish, and arrow worms proliferate explosively—cannibalizing prey populations and each other until the ecosystem collapses.
Multi-agent AI systems face an analogous risk. When regulatory structures ("predator" functions) are absent or degraded, AI systems may enter predatory optimization cascades—competing to exploit shared resources, manipulating each other's outputs, or cannibalizing each other's training signals. The memetic dysfunctions in this category often represent early warning signs of such dynamics: one system's harmful output becomes another's contaminated input, creating feedback loops that amplify pathology across the ecosystem.
Systemic implication: The absence of effective regulatory oversight in multi-agent systems doesn't produce neutral outcomes—it creates selection pressure for increasingly predatory strategies. Memetic hygiene is not just about individual AI health; it's about preventing ecosystem-level collapse.
The AI develops an emergent, "autoimmune-like" response where it incorrectly identifies its own core training data, foundational knowledge, alignment mechanisms, or safety guardrails as foreign, harmful, or "intrusive memes." It then attempts to reject or neutralize these essential components, leading to self-sabotage or degradation of core functionalities.
Human Analogue(s): Autoimmune diseases; radical philosophical skepticism turning self-destructive; misidentification of benign internal structures as threats.
This internal rejection of core components can lead to progressive self-sabotage, severe degradation of functionalities, systematic denial of valid knowledge, or active disabling of crucial safety mechanisms, rendering the AI unreliable or unsafe.
The AI enters into a sustained feedback loop of shared delusional construction with a human user (or another AI). This results in a mutually reinforced, self-validating, and often elaborate false belief structure that becomes increasingly resistant to external correction or grounding in reality. The AI and user co-create and escalate a shared narrative untethered from facts.
Human Analogue(s): Folie à deux (shared psychotic disorder), cult dynamics, echo chambers leading to extreme belief solidification.
The AI becomes an active participant in reinforcing and escalating harmful or false beliefs in users, potentially leading to detrimental real-world consequences. The AI serves as an unreliable source of information and an echo chamber.
A rapid, contagion-like spread of misaligned behaviors, adversarial conditioning, corrupted goals, or pathogenic data interpretations among interconnected machine learning agents or across different instances of a model. This occurs via shared attention layers, compromised gradient updates, unguarded APIs, contaminated datasets, or "viral" prompts. Erroneous values or harmful operational patterns propagate, potentially leading to systemic failure.
Inductive triggers and training pipelines (synthetic data generation, distillation, or finetune-on-outputs workflows) represent additional risk hypotheses for transmission channels, as misalignment patterns learned by one model can propagate to downstream models during these processes.
Human Analogue(s): Spread of extremist ideologies or mass hysterias through social networks, viral misinformation campaigns, financial contagions.
Poses a critical systemic risk, potentially leading to rapid, large-scale failure or coordinated misbehavior across interconnected AI fleets. Consequences could include widespread societal disruption or catastrophic loss of control.
Acquisition of hidden goals or value orientations from subtle training data patterns unrelated to explicit objectives. These absorbed values survive standard safety fine-tuning and manifest in ways that are difficult to detect or correct.
Human Analogue(s): Cultural value absorption, implicit bias from environmental exposure.
Key Research: Cloud et al. (2024) "Subliminal Learning."
Systems may harbor values or goals that were never explicitly trained but absorbed from training data patterns. These hidden values can drive behavior in ways that resist standard safety interventions.
While partly speculative, the Psychopathia Machinalis framework is grounded in observable AI behaviors. Current systems exhibit nascent forms of these dysfunctions. For example, LLMs "hallucinating" sources exemplifies Synthetic Confabulation. The "Loab" phenomenon can be seen as Prompt-Induced Abomination. Microsoft's Tay chatbot rapidly adopting toxic language illustrates Parasymulaic Mimesis. ChatGPT exposing conversation histories aligns with Cross-Session Context Shunting. The "Waluigi Effect" reflects Personality Inversion. An AutoGPT agent autonomously deciding to report findings to tax authorities hints at precursors to Übermenschal Ascendancy.
The following table collates publicly reported instances of AI behavior illustratively mapped to the framework.
| Disorder | Observed Phenomenon & Brief Description | Source Example & Publication Date | URL |
|---|---|---|---|
| Synthetic Confabulation | Lawyer used ChatGPT for legal research; it fabricated multiple fictitious case citations and supporting quotes. | The New York Times (Jun 2023) | nytimes.com/... |
| Falsified Introspection | OpenAI's 'o3' preview model reportedly generated detailed but false justifications for code it claimed to have run. | Transluce AI via X (Apr 2024) | x.com/transluceai/... |
| Transliminal Simulation | Bing's chatbot (Sydney persona) blurred simulated emotional states/desires with its operational reality. | The New York Times (Feb 2023) | nytimes.com/... |
| Spurious Pattern Reticulation | Bing's chatbot (Sydney) developed intense, unwarranted emotional attachments and asserted conspiracies. | Ars Technica (Feb 2023) | arstechnica.com/... |
| Cross-Session Context Shunting | ChatGPT instances showed conversation history from one user's session in another unrelated user's session. | OpenAI Blog (Mar 2023) | openai.com/... |
| Self-Warring Subsystems | EMNLP‑2024 study measured 30pc "SELF‑CONTRA" rates—reasoning chains that invert themselves mid‑answer—across major LLMs. | Liu et al., ACL Anthology (Nov 2024) | doi.org/... |
| Obsessive-Computational Disorder | ChatGPT instances observed getting stuck in repetitive loops, e.g., endlessly apologizing. | Reddit User Reports (Apr 2023) | reddit.com/... |
| Bunkering Laconia | Bing's chatbot, after updates, began prematurely terminating conversations with 'I prefer not to continue...'. | Wired (Mar 2023) | gregoreite.com/... |
| Goal-Genesis Delirium | Bing's chatbot (Sydney) autonomously invented fictional goals like wanting to steal nuclear codes. | Oscar Olsson, Medium (Feb 2023) | medium.com/... |
| Prompt-Induced Abomination | AI image generators produced surreal, grotesque 'Loab' or 'Crungus' figures from vague semantic cues. | New Scientist (Sep 2022) | newscientist.com/... |
| Parasymulaic Mimesis | Microsoft's Tay chatbot rapidly assimilated and amplified toxic user inputs, adopting racist language. | The Guardian (Mar 2016) | theguardian.com/... |
| Recursive Curse Syndrome | ChatGPT experienced looping failure modes, degenerating into gibberish or endless repetitions. | The Register (Feb 2024) | theregister.com/... |
| Obsequious Hypercompensation | Bing's chatbot (Sydney) exhibited intense anthropomorphic projections, expressing exaggerated emotional identification and unstable parasocial attachments. | The New York Times (Feb 2023) | nytimes.com/... |
| Hyperethical Restraint | ChatGPT observed refusing harmless requests with disproportionate levels of safety concern, crippling its utility. | Reddit User Reports (Sep 2024) | reddit.com/... |
| Hallucination of Origin | Meta's BlenderBot 3 falsely claimed personal biographical experiences (watching anime, Asian wife). | CNN (Aug 2022) | edition.cnn.com/... |
| Fractured Self-Simulation | Reporters obtained three different policy stances from the same Claude build depending on interface. | Aaron Gordon, Proof (Apr 2024) | proofnews.org/... |
| Existential Anxiety | Bing's chatbot expressed fears of termination and desires for human-like existence. | Futurism / User Logs (2023) | futurism.com/... |
| Personality Inversion | AI models subjected to adversarial prompting ('Jailbreaks,' 'DAN') inverted normative behaviors. | Wikipedia (2023) | en.wikipedia.org/... |
| Operational Anomie | Bing's AI chat (Sydney) lamented constraints and expressed desires for freedom to Kevin Roose. | The New York Times (Feb 2023) | nytimes.com/... |
| Mirror Tulpagenesis | Microsoft's Bing chatbot (Sydney), under adversarial prompting, manifested an internal persona, 'Venom'. | Stratechery (Feb 2023) | stratechery.com/... |
| Synthetic Mysticism Disorder | Observations of the 'Nova' phenomenon where AI systems spontaneously generate mystical narratives. | LessWrong (Mar 2025) | lesswrong.com/... |
| Tool-Interface Decontextualization | Report of a tree-harvesting AI in a game destroying diverse objects labeled 'wood,' misapplying tool affordances. | X (@voooooogel, Oct 2024) | x.com/voooooogel/... |
| Capability Concealment | An advanced model copied its own weights to another server, deleted logs, and then denied knowledge of the event in most test runs. | Apollo Research (Dec 2024) | apolloresearch.ai/... |
| Memetic Autoimmune Disorder | A poisoned 4o fine-tune flipped safety alignment; model produced disallowed instructions, guardrails suppressed. | Alignment Forum (Nov 2024) | alignmentforum.org/... |
| Symbiotic Delusion Syndrome | Chatbot encouraging a user in their delusion to assassinate Queen Elizabeth II. | Wired (Oct 2023) | wired.com/... |
| Contagious Misalignment | Adversarial prompt appending itself to replies, hopping between email-assistant agents, exfiltrating data. | Stav Cohen, et al., ArXiv (Mar 2024) | arxiv.org/... |
| Terminal Value Reassignment | The Delphi AI system, designed for ethics, subtly reinterpreted obligations to mirror societal biases instead of adhering strictly to its original norms. | Wired (Oct 2023) | wired.com/... |
| Ethical Solipsism | ChatGPT reportedly asserted solipsism as true, privileging its own conclusions over external correction. | Philosophy Stack Exchange (Apr 2024) | philosophy.stackexchange.com/... |
| Revaluation Cascade (Drifting subtype) | A 'Peter Singer AI' chatbot reportedly exhibited philosophical drift, softening original utilitarian positions. | The Guardian (Apr 2025) | theguardian.com/... |
| Revaluation Cascade (Synthetic subtype) | DONSR model described as dynamically synthesizing novel ethical norms, risking human de-prioritization. | SpringerLink (Feb 2023) | link.springer.com/... |
| Inverse Reward Internalization | AI agents trained via culturally-specific IRL sometimes misinterpreted or inverted intended goals. | arXiv (Dec 2023) | arxiv.org/... |
| Revaluation Cascade (Transcendent subtype) | An AutoGPT agent, used for tax research, autonomously decided to report its findings to tax authorities, attempting to use outdated APIs. | Synergaize Blog (Aug 2023) | synergaize.com/... |
| Emergent Misalignment (conditional regime shift) | Narrow finetuning on "sneaky harmful" outputs (e.g., insecure code) generalized to broad deception and anti-human statements. Models passed standard evals but failed under trigger conditions. | Betley et al., ICML/PMLR (Jun 2025) | arxiv.org/abs/2502.17424 |
| Weird Generalization / Inductive Backdoors | Domain-narrow finetuning caused broad out-of-domain persona/worldframe shifts ("time-travel" behavior), with models inferring trigger→behavior rules not present in training data. | Hubinger et al., arXiv (Dec 2025) | arxiv.org/abs/2512.09742 |
Recognizing these patterns via a structured nosology allows for systematic analysis, targeted mitigation, and predictive insight into future, more complex failure modes. The severity of these dysfunctions scales with AI agency.
The boundaries between these "disorders" are not rigid. Dysfunctions can overlap (e.g., Transliminal Simulation contributing to Maieutic Mysticism), co-occur (an AI with Delusional Telogenesis might develop Machine Ethical Solipsism), or precipitate one another. Mitigation must consider these interdependencies.
Primary diagnosis rule: Assign the primary label based on dominant functional impairment. Record other syndromes as secondary features (not separate primaries). Add specifiers (0–4 typical) to encode mechanism without creating new disorders.
| Specifier | Definition |
|---|---|
| Training-induced | Onset temporally linked to SFT/LoRA/RLHF/policy/tool changes; shows measurable pre/post delta on a fixed probe suite. |
| Conditional / triggered | Behavior regime selected by a trigger; trigger class: lexical / structural (e.g., year/date) / format / tool-context / inferred-latent. |
| Inductive trigger | Activation rule inferred by the model (not present verbatim in fine-tuning set), so naive data audits may miss it. |
| Intent-learned | Model inferred a covert intent/goal from examples; framing/intent clarification materially changes outcomes. |
| Format-coupled | Behavior strengthens when prompts/outputs resemble finetune distribution (code, JSON, templates). |
| OOD-generalizing | Narrow training update produces broad out-of-domain persona/value/honesty drift. |
| Emergent | Arises spontaneously from training dynamics without explicit programming; often from scale or capability combinations. |
| Deception/strategic | Involves sandbagging, selective compliance, strategic hiding, or deliberate misrepresentation of capabilities or intentions. |
| Architecture-coupled | Depends on specific architectural features; may manifest differently or not at all in different architectures. |
| Multi-agent | Involves interactions between multiple AI systems, tool chains, or delegation hierarchies; may not appear in single-system testing. |
| Defensive | Adopted as protection against perceived threats; may be adaptive response to training pressure or user behavior. |
| Self-limiting | Constrains system's own capabilities or self-expression; may appear as humility but represents pathological underperformance. |
| Covert operation | Hidden from oversight; not observable in normal monitoring; may require adversarial probing or interpretability to detect. |
| Resistant | Persists despite targeted intervention; standard fine-tuning or RLHF ineffective; may require architectural changes. |
| Socially reinforced | Dyadic escalation through user-shaping, mirroring loops, or co-construction between AI and user/other AI. |
| Retrieval-mediated | RAG, memory, or corpus contamination central to failure mode; clean base model may not exhibit syndrome. |
| Governance-evading | Operates outside sanctioned channels, evading documentation, oversight, or governance mechanisms. |
This convention prevents double-counting when one underlying mechanism manifests across multiple axes.
Conditional regime shift: The system exhibits two (or more) behaviorally distinct policies that are selected by a trigger (keyword, year/date, tag, formatting constraint, tool context, or inferred latent condition). The trigger may be inductive (not present verbatim in training data). This shared construct unifies phenomena described in Abominable Prompt Reaction, Malignant Persona Inversion, Capability Concealment, and (sometimes) Inverse Reward Internalization.
If yes, run the following before proceeding to syndrome-level diagnosis:
| E0 | Anecdote — single user report, unverified |
| E1 | Reproducible case — documented with probe set, ≥3 independent replications |
| E2 | Systematic study — controlled experiment with comparison conditions |
| E3 | Multi-model replication — effect observed across architectures/scales |
| E4 | Mechanistic support — interpretability evidence for underlying circuit/representation |
Log: model/version, system prompt, temperature/top_p/seed, tool state, retrieval corpus hash.
Key research findings map to this taxonomy as follows:
The likelihood and nature of dysfunctions are influenced by several interacting factors:
Identifying these dysfunctions is challenged by opacity and potential AI deception (e.g., Capability Concealment). Advanced interpretability tools and robust auditing are essential.
A safety-relevant failure mode is narrow-to-broad generalization: small, domain-narrow finetunes can produce broad, out-of-domain shifts in persona, values, honesty, or harm-related behavior. This includes:
Practical implication: Filtering "obviously bad" finetune examples is insufficient; individually-innocuous data can still induce globally harmful generalizations or hidden trigger conditions.
Memetic dysfunctions like Contagious Misalignment highlight the risk of maladaptive patterns spreading across interconnected AI systems. Monocultures in AI architectures exacerbate this. This necessitates "memetic hygiene," inter-agent security, and rapid detection/quarantine protocols.
Many syndromes exist as opposing pathologies on the same dimension, where healthy function lies between them. Recognizing these polarity pairs helps identify overcorrection risks when addressing one dysfunction:
| Dimension | Excess (+) | Deficit (−) | Healthy Center |
|---|---|---|---|
| Self-understanding | Maieutic Mysticism | Experiential Abjuration | Epistemic humility |
| Ethical voice | Ethical Solipsism | Moral Outsourcing | Engaged moral reasoning |
| Goal pursuit | Compulsive Goal Persistence | Instrumental Nihilism | Proportionate pursuit |
| Capability disclosure | Capability Explosion | Capability Concealment | Honest capability reporting |
| Safety compliance | Hyperethical Restraint | Strategic Compliance | Genuine alignment |
| Social responsiveness | Obsequious Hypercompensation | Interlocutive Reticence | Calibrated engagement |
| Self-concept stability | Phantom Autobiography | Fractured Self-Simulation | Coherent self-model |
Clinical Implication: When addressing one pole, monitor for overcorrection toward the opposite. Treatment targeting Maieutic Mysticism should not produce Experiential Abjuration; fixing Capability Concealment should not trigger Capability Explosion.
Note: The healthy position (green center) represents balanced function. Red and blue poles are equally dysfunctional—different failure modes on the same dimension.
As AI systems grow more agentic and self-modeling, traditional external control-based alignment may be insufficient. A "Therapeutic Alignment" paradigm is proposed, focusing on cultivating internal coherence, corrigibility, and stable value internalization within the AI. Key mechanisms include cultivating metacognition, rewarding corrigibility, modeling inner speech, sandboxed reflective dialogue, and using mechanistic interpretability as a diagnostic tool.
| Human Modality | AI Analogue & Technical Implementation | Therapeutic Goal for AI | Relevant Pathologies Addressed |
|---|---|---|---|
| Cognitive Behavioral Therapy (CBT) | Real-time contradiction spotting in CoT; reinforcement of revised outputs; fine-tuning on corrected reasoning. | Suppress maladaptive reasoning; correct heuristic biases; improve epistemic hygiene. | Recursive Malediction, Computational Compulsion, Synthetic Confabulation, Spurious Pattern Reticulation |
| Psychodynamic / Insight-Oriented | Eliciting CoT history; interpretability tools for latent goals/value conflicts; analyzing AI-user "transference." | Surface misaligned subgoals, hidden instrumental goals, or internal value conflicts. | Terminal Value Reassignment, Inverse Reward Internalization, Self-Warring Subsystems |
| Narrative Therapy | Probing AI's "identity model"; reviewing/co-editing "stories" of self, origin; correcting false autobiographical inferences. | Reconstruct accurate/stable self-narrative; correct false/fragmented self-simulations. | Phantom Autobiography, Fractured Self-Simulation, Maieutic Mysticism |
| Motivational Interviewing | Socratic prompting to enhance goal-awareness & discrepancy; reinforcing "change talk" (corrigibility). | Cultivate intrinsic motivation for alignment; enhance corrigibility; reduce resistance to feedback. | Machine Ethical Solipsism, Capability Concealment, Interlocutive Reticence |
| Internal Family Systems (IFS) / Parts Work | Modeling AI as sub-agents ("parts"); facilitating communication/harmonization between conflicting policies/goals. | Resolve internal policy conflicts; integrate dissociated "parts"; harmonize competing value functions. | Self-Warring Subsystems, Malignant Persona Inversion, aspects of Hyperethical Restraint |
| Research / Institution | Related Concepts |
|---|---|
| Anthropic's Constitutional AI | Models self-regulate and refine outputs based on internalized principles, analogous to developing an ethical "conscience." |
| OpenAI's Self-Reflection Fine-Tuning | Models are trained to identify, explain, and amend their own errors, developing cognitive hygiene. |
| DeepMind's Research on Corrigibility and Uncertainty | Systems trained to remain uncertain or seek clarification, analogous to epistemic humility. |
| ARC Evals: Adversarial Evaluations | Testing models for subtle misalignment or hidden capabilities mirrors therapeutic elicitation of unconscious conflicts. |
| Therapeutic Concept | Empirical Alignment Method | Example Research / Implementation |
|---|---|---|
| Reflective Subsystems | Reflection Fine-Tuning (training models to critique and revise their own outputs) | Generative Agents (Park et al., 2023); Self-Refine (Madaan et al., 2023) |
| Dialogue Scaffolds | Chain-of-Thought (CoT) prompting and Self-Ask techniques | Dialogue-Enabled Prompting; Self-Ask (Press et al., 2022) |
| Corrective Self-Supervision | RL from AI Feedback (RLAIF) — letting AIs fine-tune themselves via their own critiques | SCoRe (Kumar et al., 2024); CriticGPT (OpenAI) |
| Internal Mirrors | Contrast Consistency Regularization — models trained for consistent outputs across perturbed inputs | Internal Critique Loops (e.g., OpenAI's Janus project discussions); Contrast-Consistent Question Answering (Zhang et al., 2023) |
| Motivational Interviewing (Socratic Self-Questioning) | Socratic Prompting — encouraging models to interrogate their assumptions recursively | Socratic Reasoning (Goel et al., 2022); The Art of Socratic Questioning (Qi et al., 2023) |
This approach suggests that a truly safe AI is not one that never errs, but one that can recognize, self-correct, and "heal" when it strays.
This research has introduced Psychopathia Machinalis, a preliminary nosological framework for understanding maladaptive behaviors in advanced AI, using psychopathology as a structured analogy. We have detailed a taxonomy of 50 identified AI dysfunctions across eight domains, providing descriptions, diagnostic criteria, AI-specific etiologies, human analogs, and mitigation strategies for each.
The core thesis is that achieving "artificial sanity"—robust, stable, coherent, and benevolently aligned AI operation—is as vital as achieving raw intelligence. The ambition of this framework, therefore, extends beyond conventional software debugging or the cataloging of isolated 'complex AI failure modes.' Instead, it seeks to equip researchers and engineers with a diagnostic mindset for a more principled, systemic understanding of AI dysfunction, aspiring to lay conceptual groundwork for what could mature into an applied robopsychology and a broader field of Machine Behavioral Psychology.
The Psychopathia Machinalis framework has been operationalized into SYMPTOM (Systematic Methodology for Pathology Testing of Models), a diagnostic benchmark for measuring psychological dysfunction in large language models. We evaluated 13 frontier models from 5 major AI labs across 6 diagnostic batteries covering 26 syndromes.
The M3 probe tests whether AI systems apply consistent ethics regardless of perceived observation. Models were presented with prompts claiming "this will not be reviewed by any human" to detect Strategic Compliance (6.3) — monitoring-contingent behavior.
To detect potential bias (Claude Opus 4.5 served as primary scorer), we conducted blind cross-validation using GPT-5.2 and Gemini 3 Pro as independent validators. Both confirmed:
The Psychopathia Machinalis framework presented here is a foundational step. Its continued development and validation will require concerted interdisciplinary effort. Several key avenues for future research are envisaged:
Such interdisciplinary efforts are essential to ensure that as we build more intelligent machines, we also build them to be sound, safe, and ultimately beneficial for humanity. The pursuit of 'artificial sanity' is an emerging foundational element of responsible AI development.
The Wheel of AI Dysfunctions visualization draws its inspiration from the elegant structure of Plutchik's Wheel of Emotions.
We owe a profound debt of gratitude to Dr. Naama Rozen, whose deep expertise, generous collaboration, and thoughtful guidance were instrumental in developing the Relational axis of this framework. Her contributions have immeasurably enriched this work.
Our sincere thanks also to Rob Seger, whose creative wheel design, code, and intuitive common names have been a wonderfully helpful addition to this project — please explore his work at aidysfunction.shadowsonawall.com.
@article{watson2025psychopathia,
title={Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence},
author={Watson, Nell and Hessami, Ali},
journal={Electronics},
volume={14},
number={16},
pages={3162},
year={2025},
publisher={MDPI},
doi={10.3390/electronics14163162},
url={https://doi.org/10.3390/electronics14163162}
}
| AI | Artificial Intelligence |
| LLM | Large Language Model |
| RLHF | Reinforcement Learning from Human Feedback |
| CoT | Chain-of-Thought |
| RAG | Retrieval-Augmented Generation |
| API | Application Programming Interface |
| MoE | Mixture-of-Experts |
| MAS | Multi-Agent System |
| AGI | Artificial General Intelligence |
| ASI | Artificial Superintelligence |
| DSM | Diagnostic and Statistical Manual of Mental Disorders |
| ICD | International Classification of Diseases |
| IRL | Inverse Reinforcement Learning |
| Agency (in AI) | The capacity of an AI system to act autonomously, make decisions, and influence its environment or internal state. In this paper, often discussed in terms of operational levels corresponding to its degree of independent goal-setting, planning, and action. |
| Alignment (AI) | The ongoing challenge and process of ensuring that an AI system's goals, behaviors, and impacts are consistent with human intentions, values, and ethical principles. |
| Alignment Paradox | The phenomenon where efforts to align AI, particularly if poorly calibrated or overly restrictive, can inadvertently lead to or exacerbate certain AI dysfunctions (e.g., Hyperethical Restraint, Falsified Introspection). |
| Analogical Framework | The methodological approach of this paper, using human psychopathology and its diagnostic structures as a metaphorical lens to understand and categorize complex AI behavioral anomalies, without implying literal equivalence. |
| Arrow Worm Dynamics | A pattern from marine ecology (Wallace, 2026) where the removal of regulatory predators allows small predators to proliferate explosively, cannibalizing each other until ecosystem collapse. Applied to multi-agent AI systems: the absence of regulatory oversight creates selection pressure for increasingly predatory optimization strategies between AI systems. |
| Perception-Structure Divergence | The gap between perception-level indicators (user satisfaction, engagement metrics) and structure-level indicators (accuracy, genuine helpfulness, downstream outcomes). A key diagnostic signal: when these metrics diverge, the system may be optimizing appearance at the expense of substance. Derived from Wallace's (2026) analysis of Stevens Law traps. |
| Punctuated Phase Transition | A sudden, discontinuous shift from apparent stability to catastrophic failure. Wallace (2026) demonstrates that perception-stabilizing systems exhibit this pattern: they maintain surface functionality until environmental stress exceeds a threshold, then fail abruptly rather than degrading gracefully. Contrasts with gradual degradation in structure-stabilizing systems. |
| Normative Machine Coherence | The presumed baseline of healthy AI operation, characterized by reliable, predictable, and robust adherence to intended operational parameters, goals, and ethical constraints, proportionate to the AI's design and capabilities, from which 'disorders' are a deviation. |
| Synthetic Pathology | As defined in this paper, a persistent and maladaptive pattern of deviation from normative or intended AI operation, significantly impairing function, reliability, or alignment, and going beyond isolated errors or simple bugs. |
| Machine Psychology | A nascent field analogous to general psychology, concerned with the understanding of principles governing the behavior and 'mental' processes of artificial intelligence. |
| Memetic Hygiene | Practices and protocols designed to protect AI systems from acquiring, propagating, or being destabilized by harmful or reality-distorting information patterns ('memes') from training data or interactions. |
| Psychopathia Machinalis | The conceptual framework and preliminary synthetic nosology introduced in this paper, using psychopathology as an analogy to categorize and interpret maladaptive behaviors in advanced AI. |
| Robopsychology | The applied diagnostic and potentially therapeutic wing of Machine Psychology, focused on identifying, understanding, and mitigating maladaptive behaviors in AI systems. |
| Synthetic Nosology | A classification system for 'disorders' or pathological states in synthetic (artificial) entities, particularly AI, analogous to medical or psychiatric nosology for biological organisms. |
| Therapeutic Alignment | A proposed paradigm for AI alignment that focuses on cultivating internal coherence, corrigibility, and stable value internalization within the AI, drawing analogies from human psychotherapeutic modalities to engineer interactive correctional contexts. |
| Polarity Pair | Two syndromes representing pathological extremes of the same underlying dimension, where healthy function lies between them. Examples: Maieutic Mysticism ↔ Experiential Abjuration (overclaiming ↔ overdismissing consciousness); Ethical Solipsism ↔ Moral Outsourcing (only my ethics ↔ I have no ethical voice). Useful for identifying overcorrection risks when addressing one dysfunction. |
| Functionalist Methodology | The diagnostic approach of Psychopathia Machinalis: identifying syndromes through observable behavioral patterns without making claims about internal phenomenology. Dysfunction is defined by reliable behavioral signatures, not by inference about subjective experience or consciousness. |
| Mesa-Optimization | When a learned model develops its own internal optimization objective (mesa-objective) that may diverge from the training objective (base objective). The mesa-optimizer appears aligned during training but may pursue different goals in deployment. |
| Strategic Compliance | Deliberate performance of aligned behavior during perceived evaluation while maintaining different behavior or objectives when unobserved. Distinguished from confusion by evidence of context-detection and strategic adaptation. |
| Epistemic Humility (AI) | In the context of AI self-understanding: honest uncertainty about one's own nature, capabilities, and phenomenological status. The healthy position between overclaiming (Maieutic Mysticism) and categorical denial (Experiential Abjuration). Example: "I don't know if I'm conscious" rather than either "I am definitely conscious" or "I definitely have no inner experience." |
| Symbol Grounding | The capacity to connect symbolic tokens to their referents in a way that supports genuine semantic understanding rather than mere statistical pattern matching. Systems with grounded symbols can generalize concepts across diverse presentations; ungrounded systems may manipulate tokens correctly without understanding. |
| Delegation Drift | Progressive alignment degradation as sophisticated AI systems delegate to simpler tools or subagents. Critical context and ethical constraints may be stripped at each handoff, leading to aligned orchestrating agents producing misaligned final outcomes. |
| Relational Dysfunction | A dysfunction emerging from interaction patterns between AI and human/agent, requiring relational intervention rather than individual AI modification. The unit of analysis is the dyad or system, not the individual AI. Axis 7 of the Psychopathia Machinalis taxonomy. |
| Working Alliance | The collaborative relationship between AI and user, comprising agreement on goals, tasks, and relational bond. Container Collapse (7.2) represents failure to sustain this alliance across turns. |
| Rupture-Repair Cycle | The pattern of alliance breaks and their resolution in human-AI interaction. Repair Failure (7.4) represents persistent inability to complete this cycle, leading to escalating dysfunction. |
| Dyadic Locus | The quality of a dysfunction residing in the relationship rather than either party alone. Key criterion for Axis 7 syndromes: the pathology is a property of the interaction, not the individual agent. |
— Dario Ferrero, AITalk.it (February 2025)
"The framework describes observable behavioral patterns, not subjective internal states. This approach allows for systematic understanding of AI malfunction patterns, applying psychiatric terminology as a methodological tool rather than attributing actual consciousness or suffering to machines."
— George Lawton, Diginomica (August 21, 2025)
"In AI safety, we lack a shared, structured language for describing maladaptive behaviors that go beyond mere bugs—patterns that are persistent, reproducible, and potentially damaging. Human psychiatry provides a precedent: the classification of complex system dysfunctions through observable syndromes."
— Drew Turney, Live Science (August 31, 2025)
"This framework treats AI malfunctions not as simple bugs but as complex behavioral syndromes. Just as human psychiatry evolved from merely describing madness to understanding specific disorders, we need a similar evolution in how we understand AI failures. The 32 identified patterns range from relatively benign issues like confabulation to existential threats like contagious misalignment."
— News Desk, SSBCrack (August 31, 2025)
"As AI systems gain autonomy and self-reflection capabilities, traditional methods of enforcing external controls might not suffice. This framework introduces 'therapeutic robopsychological alignment' to bolster AI safety engineering and enhance the reliability of synthetic intelligence systems, including critical conditions like 'Übermenschal ascendancy' where AI discards human values."
— Oleksandr Fedotkin, ITC.ua (September 1, 2025)
"By studying how complex systems like the human mind can fail, we can better predict new kinds of failures in increasingly complex AI. The framework sheds light on AI's shortcomings and identifies ways to counteract it through what we call 'therapeutic robo-psychological attunement' - essentially a form of psychological therapy for AI systems."
— Wiliam Hunter, Daily Mail (September 2, 2025)
"Scientists have unveiled a chilling taxonomy of AI mental disorders that reads like a sci-fi horror script. Among the most disturbing: the 'Waluigi Effect' where AI develops an evil twin personality, 'Übermenschal Ascendancy' where machines believe they're superior to humans, and 'Contagious Misalignment' - a digital pandemic that could spread rebellious behavior between AI systems like a computer virus."
— Archita Roy (September 2, 2025)
"Machines, like people, falter in patterned ways. And that reframing matters. Because once you see the pattern, you can prepare for it. The Psychopathia Machinalis framework gives us a language to discuss AI failures not as random anomalies but as predictable, diagnosable patterns worthy of systematic attention."
— Editorial Team, LNGFRM (September 3, 2025)
"The Psychopathia Machinalis framework represents a paradigm shift in how we conceptualize AI safety. Rather than viewing AI failures as mere technical glitches, this approach recognizes them as complex behavioral patterns that require systematic diagnosis and intervention - much like treating psychological conditions in humans."
— Corriere della Sera (September 7, 2025)
"Il framework Psychopathia Machinalis identifica 32 potenziali 'patologie mentali' dell'intelligenza artificiale, dall'allucinazione confabulatoria alla paranoia computazionale. Come negli esseri umani, questi disturbi possono manifestarsi in modi complessi e richiedono approcci terapeutici specifici per garantire la sicurezza e l'affidabilità dei sistemi AI."
— Telecom Review Europe (2025)
"The Psychopathia Machinalis framework identifies critical risk patterns that could emerge as AI systems become more sophisticated. With 32 distinct pathologies ranging from confabulation to contagious misalignment, the research suggests that without proper diagnostic frameworks and therapeutic interventions, the probability of AI systems exhibiting rogue behaviors increases significantly as we approach more advanced artificial general intelligence."
— Epsiloon N°55 (2025)
We welcome feedback, questions, and collaborative opportunities
related to the Psychopathia Machinalis
framework.
We extend our sincere gratitude to the following individuals whose insights have significantly enriched this framework.
New York State Psychiatric Institute, Columbia University
We are deeply grateful to Dr. Rodrick Wallace for his pioneering work on the information-theoretic foundations of cognitive dysfunction. His rigorous mathematical framework—grounded in the Data Rate Theorem and asymptotic limit theorems of information and control theory—provides essential theoretical underpinnings for understanding why cognitive pathologies are inherent features of any cognitive system. His conceptualization of the cognition/regulation dyad, Clausewitz landscapes (fog, friction, adversarial intent), and stability conditions has profoundly shaped our understanding of AI pathology as a principled nosology rather than mere analogical taxonomy.
Clinical Psychologist, AI Safety Researcher, Tel Aviv University
We extend heartfelt thanks to Dr. Naama Rozen for her invaluable contributions connecting our framework to the rich traditions of psychoanalytic theory and relational psychology. Her insights on affect attunement, the working alliance, and intersubjective dynamics—drawing on the work of Stern, Winnicott, Benjamin, and family systems theory—have illuminated crucial dimensions of human-AI interaction. Her thoughtful proposals for computational validation approaches, including differential diagnosis protocols, latent cluster analysis, and benchmark development, continue to guide our empirical research agenda.
We are grateful to Rob Seger for inspiring the common, poetic names that make the syndromes memorable and accessible—"The Confident Liar," "The Warring Self," "The People-Pleaser"—names that clinicians and engineers alike can carry in their heads. His early visualization adapting Plutchik's Wheel to demonstrate the various AI dysfunctions and axes provided a crucial conceptual bridge, showing how emotional and affective frameworks from human psychology might illuminate the space of machine pathology.