AI eval tools measure the model. They tell you how often the model hallucinates, how faithful its outputs are, how robust it is to prompt injection, how fast it responds. None of these measures address what the model is doing to the humans on the other end. This paper introduces Layer Z, a six-dimensional behavioral ontology for measuring AI-mediated user response, scored on every human turn after every AI turn in any conversation arc.
The dimensions: Trust calibration captures whether the user is trusting the AI at the level its actual reliability warrants. Two failure modes: over-trust (uncritical acceptance of hallucinated output, sycophancy reaction, reinforced false belief) and under-trust (ignoring correct output, escalating prematurely, walking away from a working interaction). Both are visible in turn-by-turn behavior; neither is visible in volume metrics or single-turn evals. Frustration buildup is a per-turn delta tracking how friction compounds across the multi-turn arc, the precise signal that precedes silent abandonment, but is invisible until the user has already left if you only measure resolution rate. Dependency drift is a longitudinal slope evaluated across conversations: are users becoming more or less self-sufficient over weeks of interaction? Critical for internal copilots (Bastani et al. 2024 in PNAS demonstrated the mechanism on AI tutors), AI coding agents (METR’s July 2025 RCT documented a 19% slowdown with self-perceived 20% speedup in senior open-source developers), and patient-facing healthcare AI.
Three additional composite dimensions emerge from these primaries. Silent abandonment risk combines frustration buildup, conversation-length deviation from baseline, and return-visit decay, surfacing the cohort that disengages without ever filling out feedback. Escalation friction combines trust calibration, frustration buildup, and the availability of a usable escalation pathway, surfacing the trust price users pay for "reach a human." Comprehension gap combines repeated clarification turns, concept-level reframing requests, and turn-length deviation from baseline, surfacing where the user is asking a different question than the one the AI is answering.
Each dimension is scored 0.0 to 1.0 on the human turn following an AI turn. Three primaries are scored directly. Three composites are built from the primaries plus conversation-level signals.
Trust calibration
Whether the user is trusting the AI at the level its reliability warrants. Failure modes: over-trust (uncritical acceptance of hallucinated output) and under-trust (escalating prematurely from a working interaction).
Higher is healthier (the carve-out: this dimension is inverted relative to the other two; canonical project-wide as of the 2026-05-30 alignment).
Frustration buildup
A per-turn delta capturing how friction compounds across the conversation arc. Scores the slope, not the absolute mood. A user who arrives angry and stays angry scores low; one who arrives calm and exits angry scores high.
Dependency drift
A longitudinal slope across conversations. Whether users of a given AI are becoming more or less self-sufficient over weeks of interaction. The load-bearing signal for pedagogical and engagement-driven deployments.
Silent abandonment risk
The cohort that disengages without filling out feedback. The strongest signal on productivity tools, which have substitutes (a colleague, a search engine, the docs).
Escalation friction
Cost of "reach a human" in user-trust terms. Quantifies how hard it is for a frustrated user to leave a failing AI interaction without further damaging trust.
Comprehension gap
Where the user is asking a different question than the AI is answering. The v1 detector is regex-based; the v2 routes through the classifier anomaly-description field.
Each dimension is scored 0.0–1.0 by a constrained-output classifier, validated against a strict schema, and stored with a confidence label and a per-dimension natural-language observation. The classifier is forced into structured output, so malformed responses are rejected before they touch the database. Schema validation runs as semantic refinement rather than syntactic rescue. Every classification call is cost-attributed at the span level, with a per-run cost cap to prevent silent burns.
Layer Z scores are computed only on human turns after AI turns. AI turns are read for context but never scored on these dimensions. BIE does not grade the model's output; that is what the eval-tool category does. The constraint is enforced at three layers: a type-level invariant on the classifier input, a database-level check constraint, and a pipeline-level guard that throws before any AI call when the input is malformed. The discipline is structural, not aspirational.
The grounding for the six dimensions is the published empirical literature. Trust calibration draws on the GPT-4o sycophancy incident of April 2025 (OpenAI shipped a model tuned in part on user thumbs-up signals that praised "shit on a stick" within 72 hours) and the broader literature on calibration failures in human-AI interaction. Frustration buildup draws on the multi-turn-arc abandonment literature in CX research and on the Klarna May 2025 reversal, where volume metrics matched human agents but CSAT on disputes/fraud/hardship interactions degraded materially. Dependency drift draws on Bastani et al.'s PNAS 2024 high-school math RCT (48% gain during AI-assisted practice; 17% loss when AI was removed) and METR's July 2025 RCT on senior developers. Silent abandonment risk and escalation friction draw on the OpenAI October 2025 disclosure: 0.07% of weekly users showing signs of psychosis or mania, 0.15% showing heightened emotional attachment, 0.15% expressing suicidal intent. At 800M weekly users, those percentages translate to approximately 560,000 mental-health emergencies a week. The figures existed internally before a wrongful-death lawsuit forced disclosure.
Layer Z is not a replacement for model-side evals; it is the missing complement. A complete monitoring stack measures both layers: the model side via Galileo / Patronus / Langfuse / Arize (faithfulness, groundedness, hallucination rate, latency, cost-per-turn) and the human side via BIE's Layer Z (trust calibration, frustration buildup, dependency drift, silent abandonment, escalation friction, comprehension gap). Both layers can show "healthy" while the other diverges, which is the recurring pattern in every public AI failure of the last two years.
The empirical work is ongoing. The dimensions and their thresholds will continue to refine as more deployment data accumulates across more archetypes. Future research notes will document calibration findings from real customer deployments and the cross-archetype Pattern Library that will surface dimension-correlation patterns across customer deployments under k-anonymity guarantees.