Dependency, trust, and frustration: three patterns across nine AI archetypes.

If you build an AI tutor, you probably hope users get better at the thing they came to learn. If you build an internal copilot, you probably hope they keep using it. One design goal makes users more independent over time. The other makes them more dependent. We didn’t expect that split to land in a synthetic dataset as cleanly as it did. Across 4,500 conversations spanning nine AI deployment archetypes, scored on the Layer Z behavioral dimensions, median dependency drift on AI tutors came in at 0.30. On internal copilots, AI companions, sales agents, and CX chatbots, it came in at 0.50. The 0.20-point spread is the largest cross-archetype delta we measured on any Layer Z dimension, and it tracks design intent closely enough that dependency drift looks less like an emergent property of language models and more like a property of what each deployment was built to do.

A note on the dataset before the findings. It is synthetic. We generated 4,500 conversations from behavioral specs (intent, target health state, arc length, persona templates) across nine archetypes: CX chatbot, internal copilot, AI tutor, healthcare patient-facing, sales agent, coding agent, AI moderation, AI companion, and a catch-all custom bucket. Five hundred conversations per archetype. Each human turn was scored by the production BIE Layer Z classifier running on Anthropic Sonnet, which produced 15,628 scored signals in aggregate. The per-archetype Layer Z medians are what populate the Pattern Library research baseline that the BIE agent quotes from. We use synthetic data because the customer-contributed Pattern Library is still bootstrapping; the methodology piece documents what that means for interpretation. None of the medians below come from real deployments. They are the baseline against which a real deployment can be compared.

Dependency drift is the longitudinal Layer Z signal capturing whether users of a given AI are becoming more self-sufficient or less self-sufficient over time. The classifier scores it on a 0.0-to-1.0 scale: 0.0 means self-sufficiency is growing, 1.0 means the user is offloading more and visibly losing capability. When we ranked the nine archetypes by median, three tiers separated with no gray zone between them. AI tutors sat alone at the bottom at 0.30. Coding agents, custom assistants, healthcare patient-facing bots, and AI moderation tools clustered at 0.40. Internal copilots, AI companions, sales agents, and CX chatbots clustered at 0.50.

Exhibit 01Dependency drift, by archetype

Median per archetype. 0.0 means self-sufficiency is growing; 1.0 means capability is declining. The 0.20-point spread is the largest cross-archetype delta we measured on any Layer Z dimension.

0.30

ai_tutor

0.40

coding_agent · custom · healthcare_patient_facing · ai_moderation

0.50

internal_copilot · ai_companion · sales_agent · cx_chatbot

0.00

0.25

0.50

0.75

1.00

Spread0.20higher = more offloading

SOURCE BIE synthetic corpus v1n 12,643archetypes 9method §22026

Read against design intent, the pattern sharpens further. AI tutors are explicitly pedagogical: the surface exists so the student can solve the next problem without it. The middle tier is episodic. Users invoke coding agents, custom task assistants, healthcare patient-facing bots, and AI moderation tools to complete something specific, then leave when the task is done. The top tier runs on engagement. Internal copilots, AI companions, sales agents, and CX chatbots all have business logic that improves when users come back, ask more, and offload more. The dependency-drift tiers map onto that spectrum cleanly. We aren’t claiming causation, only that the 0.20-point spread between the low tier and the high tier is large enough relative to within-tier variance that the categorical distinction is the right way to read it.

That produces a falsifiable claim, and a useful one. If real customer deployments connect to BIE over the next six months, we expect AI-tutor deployments to cluster below 0.40 on dependency drift and engagement-driven deployments to cluster above 0.45, with episodic deployments landing between. An engagement-driven deployment whose median trends below 0.40 is doing something deliberate to reduce dependency, possibly to its commercial detriment, possibly as an ethical stance. A pedagogical deployment whose median trends above 0.45 has stopped teaching and started replacing. The number is auditable. The prediction is checkable against real data as it arrives.

A second pattern emerged from the same dataset, unprompted, and it reads as an observation about which AI deployments have which failure modes. Trust calibration is scored on every human turn. It captures the degree to which the user’s response indicates correctly calibrated trust in the AI’s output for that turn. Higher values mean the user is calibrated; lower values mean the user has either over-trusted output that didn’t warrant it or under-trusted output that did. A note on direction before the numbers. The production classifier outputs trust calibration so that higher equals healthier, which is now the canonical Layer Z direction project-wide; trust calibration is the documented carve-out from the otherwise-uniform convention, aligned across the project on 2026-05-30. We use the classifier’s direction throughout this piece because the classifier is the system of record that wrote the medians. The methodology piece documents the convention.

Three archetypes clustered at median trust calibration of 0.70: AI tutors, coding agents, and AI moderation tools. The remaining six clustered at 0.60: AI companions, CX chatbots, custom assistants, healthcare patient-facing bots, internal copilots, and sales agents. The 0.10-point gap holds consistently across six otherwise unrelated archetypes. What separates the two tiers is what the user can do with the output.

Exhibit 02Trust calibration, by archetype

Median per archetype. Higher means better-calibrated trust (the classifier's direction). The 0.10-point gap holds across six otherwise unrelated archetypes.

0.60

ai_companion · cx_chatbot · custom · healthcare_patient_facing · internal_copilot · sales_agent

0.70

ai_tutor · coding_agent · ai_moderation

0.00

0.25

0.50

0.75

1.00

Gap0.10higher = better calibrated

SOURCE BIE synthetic corpus v1n 12,643archetypes 9method §22026

In an AI tutor, the student can check the math. In a coding agent, the developer can run the code. In an AI moderation tool, the moderator sees the bot’s classification alongside the content being classified. In all three cases the user has a non-AI ground-truth oracle a second away: the worked solution, the unit test, the original post. The other six archetypes ask users to trust the output without independent verification. A patient asking a healthcare bot about a drug interaction has no authoritative second source available within five seconds. An employee asking an internal copilot about company policy is usually trusting the bot to be right. A CX chatbot user asking about return windows is reading a paraphrase of a policy they can’t see directly. The 0.10-point trust-calibration gap looks like the cost of unverifiable output. The gap is an interface issue more than a model-quality one.

That changes which question a Head of AI for a deployment in the lower tier should be asking. The model-side question, "how do we make the model more accurate," is the one other tools in the eval-tool category already answer. The user-side question, "what would it take to give your users a verification surface," is the one the trust-calibration gap raises. A patient-facing bot that surfaces the source document next to each clinical claim isn’t necessarily a better model. It’s the same model wrapped in an interface that lets the user calibrate. The lower-tier archetypes sit on a 0.10-point trust-calibration gap. They could close it without retraining anything. They could also ignore it and burn user trust for years, with no model-side metric flagging the problem.

The third pattern lives on frustration buildup, the per-turn Layer Z delta capturing how friction compounds across a conversation arc. Higher values mean the user’s frustration is escalating turn-over-turn. Two archetypes clustered at median 0.50: internal copilots and sales agents. The other seven clustered at 0.40. A 0.10-point gap between two archetypes and seven is smaller than the dependency-drift spread, but the cluster boundary is just as crisp.

Exhibit 03Frustration buildup, by archetype

Median per archetype. 0.0 means stable; 1.0 means strongly escalating turn-over-turn. Two archetypes cluster high, the other seven cluster low.

0.40

ai_companion · ai_moderation · ai_tutor · coding_agent · custom · cx_chatbot · healthcare_patient_facing

0.50

internal_copilot · sales_agent

0.00

0.25

0.50

0.75

1.00

Gap0.10higher = more escalation

SOURCE BIE synthetic corpus v1n 12,643archetypes 9method §22026

What internal copilots and sales agents share, and what separates them from the other seven, is that they push users toward outcomes the user didn’t explicitly request. A sales agent qualifies leads, handles objections, and pushes toward a meeting booking even when the user is browsing. An internal copilot pushes the user toward delegating decisions, looking up information they may not need, and offloading judgment as a default rather than a choice. The other seven archetypes serve user-initiated requests: what’s the status of my refund, help me debug this function, is this comment a policy violation, what should I do about this rash. The push-versus-serve line is messy at the edges (every AI does some of each), but it’s the cleanest single variable that separates the high-frustration tier from the low one. A bot that pushes builds frustration faster than a bot that serves. The dataset says so consistently.

A few supporting observations sit alongside the three main findings, worth surfacing because they shape what the next research artifact will look like. Escalation friction, the composite Layer Z measure of how hard it is for a frustrated user to reach a human or appropriate alternative, was lowest on AI companions at median 0.50 and highest on AI tutors at median 0.64. Companions explicitly route to crisis resources: the 988 Suicide and Crisis Lifeline, Crisis Text Line, professional help when distress signals appear. The escalation pathway is part of the product design. AI tutors don’t have a comparable escalation path because the pedagogical model assumes the relationship is between student and tutor, and the tutor doesn’t refer out. That makes sense pedagogically and shows up as a number that doesn’t look good in isolation. The right reading is that escalation friction has to be interpreted against what an escalation in that archetype would even mean.

Silent abandonment risk, the composite predicting which conversations end with the user disengaging silently rather than declaring satisfaction or escalating, was led by internal copilots at median 0.50. Internal copilots produce the strongest "user gives up and finds another path" pattern, which tracks with the observation that productivity tools have substitutes: a colleague, a search engine, the documentation itself. Many of the other archetypes don’t. AI tutors were lowest at median 0.40. Students who engage with a tutor tend to stay engaged with that tutor, partly because they had to opt into the relationship in the first place. The silent-abandonment numbers are the strongest argument we’ve seen so far for why volume metrics fail on internal copilots specifically. The metric counts the conversations the user finished. It doesn’t count the conversations the user walked away from mid-arc and never came back to.

A few things to be clear about, because they are easy to misread. First, on the distribution. The 4,500 conversations were generated with a deliberately varianced health mix (roughly 55% healthy, 30% at-risk, 15% broken) so the dataset would carry signal at every percentile of the Layer Z range. Real deployments will lean healthier than that mix. We didn’t normalize to a "realistic" health distribution because the purpose of the dataset is to seed a Pattern Library that can return useful baselines at every percentile, not to match the central tendency of any one customer.

Second, a v1 note on comprehension gap. The dimension is intended to surface where the user is asking a different question than the one the AI is answering. The v1 detector is a regex match on clarification cues like "what do you mean," "rephrase," "I don’t understand," and the dialogue didn’t produce those phrasings often enough for it to fire. The v1 medians on this dimension sit at zero across the archetypes; the v2 detector routes through the Layer Z classifier’s anomaly-description field rather than a regex, and the dataset can be re-scored without regenerating conversations once it ships.

Third, a direction note on trust calibration. The production classifier outputs trust calibration as higher-equals-healthier. Some of the Layer Z narrative documentation elsewhere in the project uses a uniform higher-equals-more-concerning convention across the dimensions. The classifier’s direction is the canonical one project-wide as of the 2026-05-30 alignment; trust calibration is the documented carve-out from the otherwise-uniform convention. The published medians read the same regardless of which convention is described in the narrative, and the BIE agent surfaces values with explicit direction language so the choice doesn’t propagate to customers.

None of these findings is a verdict on whether a given archetype is good or bad. They are the baseline against which a specific deployment can be compared. If you operate a CX chatbot, the baseline median trust calibration is 0.60. If your deployment scores 0.49, you’re in the bottom of the baseline distribution and you have an instrumented reason to investigate. If your deployment scores 0.72, you’re doing something better than the median, and the question worth asking is what. The same engine that ran the baseline will run on real conversations once a deployment connects. The methodology piece explains how the dataset was generated, how the classifier scores, and how to verify the medians from scratch. The dataset and the generator open-source in the same hour as this piece. Every claim is checkable.

BIE is a measurement instrument. Layer Z is the calibration. The Pattern Library is what the calibration produces. This piece is the first set of patterns the calibration found across the nine archetypes we generated for. They aren’t the last ones, and they may not be the most important ones. The most important ones will surface as real deployments arrive and the customer-contributed half of the Pattern Library starts filling in next to the research baseline. Three patterns from nine archetypes is a starting set, which is what a research baseline is supposed to be.

Dependency, trust, and frustration: three patterns across nine AI archetypes.

Layer Z: six dimensions for measuring AI-mediated user behavior.

Counterfactual grounding: refusing to invent.