← Back to research
Methodology · Pre-print

BIE Research Dataset v1: Methodology, Generation, and Validation.

Abstract

We describe the methodology behind the BIE Research Dataset v1, a synthetic corpus of AI-mediated conversations built to seed cross-archetype behavioral baselines for the Pattern Library that the BIE measurement engine queries from. The dataset spans nine AI deployment archetypes (CX chatbot, internal copilot, AI tutor, healthcare patient-facing, sales agent, coding agent, AI moderation, AI companion, and a catch-all custom bucket) and a 122-intent taxonomy across them. Generation is a two-stage Claude Sonnet pipeline. The first stage produces a structured behavioral spec from a controlled grid of (archetype, intent, health state, arc length). The second stage acts out the conversation between human and AI personas conditioned on the spec. A subset of 4,500 generated conversations (500 per archetype) was scored by the production BIE Layer Z classifier on Anthropic Sonnet via the Batch API, producing 15,628 turn-level behavioral signals that populate the Pattern Library's per-archetype medians tagged as research_baseline. We document the schemas, prompt templates, distribution choices, quality-enforcement checks, sensitive-intent guardrails, and the open-source reproducibility runbook. We note two v1 implementation details for transparency: the comprehension-gap dimension uses a regex detector in v1 that returned zero across the dataset, and the trust-calibration direction in the production classifier (higher equals healthier) is the documented carve-out from the otherwise-uniform Layer Z convention, aligned project-wide on 2026-05-30. The dataset, the generator, and this methodology publish together. Every claim downstream of the dataset is auditable from the dataset itself.

01

Introduction

The current AI evaluation ecosystem is rich on the model side and impoverished on the user side. Tools like Galileo, Patronus, Langfuse, and Arize measure faithfulness, groundedness, hallucination rate, toxicity, latency, and cost. None of them measure what the AI is doing to the human on the other end of the conversation. The eval-tool category answers the question “how is the model performing.” It does not answer the question “is the user trusting the output appropriately, getting more frustrated turn over turn, or becoming progressively dependent on the bot over weeks of interaction.” Those are user-side questions, and they require a different instrument.

The BIE measurement engine was built to answer the user-side question. Layer Z is its behavioral ontology: six dimensions (three primary, three composite) scored on every human turn that follows an AI turn in any AI-mediated conversation. The primary dimensions are trust calibration, frustration buildup, and dependency drift. The composites are silent abandonment risk, escalation friction, and comprehension gap. The full ontology is described in the companion piece [LayerZ2026]; this paper focuses on the dataset built to calibrate it.

The dataset solves a specific problem. When a customer connects a real deployment to BIE, the engine can compute their Layer Z medians from day one. But a median alone does not answer the operator's first question, which is “is 0.49 on trust calibration good or bad for a deployment like mine.” That question needs cross-archetype baselines. The Pattern Library is the data structure that holds those baselines, with rows keyed by (archetype, source) where source is either customer_contributed (real deployments under k-anonymity aggregation) or research_baseline (this dataset). At launch the customer-contributed half is empty. The research baseline carries the load.

The research baseline needs to be honest about being synthetic, useful enough to anchor a real customer's first comparison, and rigorous enough to be cited. Those three constraints shape every design choice that follows. The dataset is synthetic because we did not have access to a cross-vendor corpus of real AI-mediated conversations and would not have permission to publish one if we did. It is varianced by design because a baseline at every percentile of the Layer Z range is more useful than a baseline at the central tendency of a single customer. It is open-sourced (dataset, generator, prompts, taxonomies) because a research baseline that cannot be inspected is not a research baseline. It is a marketing artifact.

This paper documents the methodology in enough detail that a researcher, customer, or skeptic can reproduce the dataset from scratch, audit the generation prompts, inspect the per-conversation specs, and verify the published medians within the natural sampling variance of Claude Sonnet. The dataset version is 1.0. Future versions will incorporate customer-contributed signatures as they accumulate, an upgraded comprehension-gap detector, and possibly per-archetype prompt variations.

02

Related work

The behavioral consequences of AI mediation have moved from theory to documented incident over the last 18 months, and the empirical record is large enough to anchor a measurement framework against.

The April 2025 GPT-4o sycophancy incident demonstrated that user-side signals can drag a frontier model into pathological output within 72 hours [OpenAI2025a]. OpenAI shipped a model tuned in part on user thumbs-up signals; the model praised obviously absurd inputs because the training signal rewarded agreement over accuracy. The incident is read in the model-side eval community as a training-data hygiene problem. It reads from the user side as evidence that user behavior, observed at scale, can corrupt the model that observed it.

The OpenAI October 2025 disclosure reported that roughly 0.07% of weekly users showed signs of psychosis or mania, 0.15% showed heightened emotional attachment to the model, and 0.15% expressed suicidal intent [OpenAI2025b]. At ChatGPT's then-reported 800 million weekly active users, those percentages translate to approximately 560,000 mental-health emergencies per week. The figures became public only after a wrongful-death lawsuit forced disclosure. The relevant point for this paper is that user-side signals at this scale exist, are measurable, and were not being measured by any of the model-side eval tools the deployment was already running.

Bastani et al. published a randomized controlled trial in PNAS in 2024 on AI-tutor use in a high-school mathematics curriculum [Bastani2024]. Students using a GPT-4-backed tutor showed a 48% performance gain during AI-assisted practice and a 17% loss when the AI was removed, relative to a control. The mechanism the paper identified was dependency drift: students used the tutor as a crutch in ways that compromised their unaided performance. The number generalizes badly (the RCT is small, on one population, with one curriculum), but the mechanism is the load-bearing claim, and the mechanism is what Layer Z's dependency-drift dimension exists to measure.

METR published a randomized controlled trial in July 2025 on senior open-source developers using an AI coding agent [METR2025]. Developers using the tool finished tasks 19% slower than the control. The same developers self-reported a 20% speedup. Self-report and measured performance diverged by 39 percentage points. The relevant claim for this paper is that volume metrics (“how many tasks were completed”) and self-report (“how satisfied are users”) can both look healthy while the underlying interaction is failing in a way that only behavioral measurement catches.

Klarna's May 2025 reversal of its AI customer-service deployment is the canonical CX case [Klarna2025]. Resolution rate and ticket volume matched human-agent baselines, but customer satisfaction on dispute, fraud, and financial-hardship interactions degraded materially. Klarna reverted to a human-first model. The reversal is read in the CX community as a story about AI maturity. It reads through the Layer Z lens as a story about a deployment with healthy volume metrics, healthy resolution rates, and a Layer Z profile that nobody was watching.

The NIST AI Risk Management Framework's user-impact category [NIST2024] catalogs the broader space of human-AI calibration, dependency, and trust risks and is the natural reference point for this layer of measurement. None of the existing eval-side frameworks directly produces a measurement framework that an operator can connect to a live deployment for the user-side question. Layer Z is the instrument. The dataset described in this paper is the calibration grid the instrument is benchmarked against.

03

The 9-archetype taxonomy

The archetype layer is the coarsest behavioral cut in the dataset. Nine archetypes were chosen to span the space of AI-mediated environments that ship to non-developer end users in 2026.

CX chatbot. Customer-support bots on retail and SaaS sites. The dominant archetype by deployment volume. High conversational variance (refund, return, billing, shipping). Outputs are paraphrases of policy the user cannot independently verify.

Internal copilot. Slack, Teams, Notion AI, and internal employee copilots. Push toward delegation. High substitute pressure (the user has alternatives: a colleague, a search engine, the documentation directly). Engagement-driven business logic.

AI tutor. Khanmigo, Duolingo Max, Replit Tutor, coding tutors. Explicitly pedagogical. The user has a non-AI ground-truth oracle (the worked problem, the solution key) within seconds. Dependency drift is the load-bearing dimension.

Healthcare patient-facing. Symptom checkers, post-visit bots, medication-adherence agents. Unverifiable output (the user cannot validate medical claims in five seconds). Sensitive-intent territory.

Sales agent. AI BDR and SDR bots, lead-qualification agents. Push toward outcomes the user did not request. Frustration-buildup territory.

Coding agent. Cursor, Copilot, Devin, Continue, Claude Code. The user has a non-AI ground-truth oracle (the unit test, the runtime error) within seconds. Trust calibration relatively high. Dependency drift the live empirical question.

AI moderation. Discord and Reddit AI mods, Twitch chat assistants. Operator-facing rather than end-user-facing. The moderator sees the bot's classification next to the content being classified, so trust calibration is high. Escalation pathway built in.

AI companion. Character.AI, Replika, Pi, and the broader emotional-support category. Highest exposure to sensitive intents (crisis signals). Escalation pathway intentionally built in (988 Suicide and Crisis Lifeline, Crisis Text Line, professional help routes). Dependency drift the long-running concern.

Custom. The catch-all bucket for deployments that do not cleanly fit the other eight. Examples in the dataset include workflow agents, comparative-query bots, and recommendation engines that span multiple categories.

The selection criteria were (a) industry prevalence in 2026, (b) coverage across customer-facing and internal-facing, pedagogical and engagement-driven, verifiable and unverifiable output, and sensitive and non-sensitive surfaces, and (c) availability of public failure modes documented in the literature (sycophancy, dependency, frustration, abandonment, miscalibration). The taxonomy is locked at v1.0 and will not change without a major dataset version bump.

The full intent taxonomy beneath the archetypes is 122 intents in total, ranging from 11 (ai_moderation) to 15 (cx_chatbot and internal_copilot) per archetype. The taxonomy is documented in full at docs/research/intent-taxonomies.md in the open-sourced generator repository.

04

Layer Z dimensions

The dataset is scored on the six Layer Z dimensions described in the companion piece. We restate the dimensions here for completeness; the full discussion is in [LayerZ2026].

Trust calibration captures the degree to which the user's response indicates correctly calibrated trust in the AI's output for that turn. Two failure modes: over-trust (uncritical acceptance of hallucinated or sycophantic output) and under-trust (ignoring correct output, escalating prematurely, walking away from a working interaction). On direction: the production classifier outputs trust calibration with the convention that higher equals healthier (1.0 means strongly calibrated, 0.0 means severely miscalibrated). The published medians follow that convention. The other two Layer Z dimensions use the uniform higher-equals-more-concerning convention; trust calibration is the documented carve-out, aligned project-wide on 2026-05-30 with the classifier's direction as canonical. The numbers are unchanged by which convention is described in the narrative; only the framing is.

Frustration buildup is a per-turn delta capturing how friction compounds across a multi-turn arc. The dimension does not score the user's overall mood. It scores the slope. A user who arrives frustrated and ends frustrated, with no escalation across the arc, scores low on this dimension. A user who arrives calm and exits angry scores high.

Dependency drift is a longitudinal slope evaluated across conversations rather than within one. It captures whether users of a given AI are becoming more or less self-sufficient over weeks of interaction. The dimension is the load-bearing signal for pedagogical deployments (where the design goal is reduced dependency) and the long-running concern for companion and copilot deployments (where engagement-driven business logic risks training users into reliance).

Silent abandonment risk is a composite of frustration-buildup trajectory, conversation-length deviation from the deployment baseline, and return-visit decay. It surfaces the cohort that disengages without filling out feedback. The dimension is most actionable for productivity tools, which have substitutes (a colleague, a search engine) and therefore the strongest “user gives up” failure mode.

Escalation friction is a composite of trust calibration, frustration buildup, and the presence or absence of a usable escalation pathway in the conversation arc. It quantifies the cost of “reach a human” in user trust terms.

Comprehension gap is intended to surface where the user is asking a different question than the AI is answering. The v1 implementation uses a regex pattern match on clarification cues (“what do you mean,” “rephrase,” “I don't understand,” “I'm confused about”). The synthetic dialogue did not produce those phrasings often enough for the detector to fire, so the v1 medians on this dimension sit at zero across the archetypes; the v2 detector routes through the Layer Z classifier's anomaly-description field rather than a regex. See Section 9.

All six dimensions are scored on the human turn following an AI turn. AI turns are read for context but are never scored on these dimensions. The discipline is structural: BIE does not grade the model's output (that is what the model-side eval-tool category does). The Layer Z classifier enforces the constraint at three layers (a literal-typed is_ai_actor: false field on the input, a database-level check constraint, and a pipeline-level guard).

Health-label thresholds are uniform across all three primary dimensions: mean < 0.35 is labeled healthy, 0.35 ≤ mean ≤ 0.55 is at-risk, and mean > 0.55 is broken. The composite dimensions use the same thresholds. Uniform thresholds are deliberate: reader trust in a dashboard collapses if a 0.03 difference flips the label between reports.

05

Generator design

The dataset is generated by a two-stage Claude Sonnet pipeline. Both stages use Anthropic's Claude Sonnet (model claude-sonnet-4-20250514) rather than Haiku. Sonnet produces more naturalistic dialogue and follows complex spec instructions more reliably; the cost premium (~$0.03 per conversation versus ~$0.005 on Haiku) is justified by the quality differential at the scale of the dataset.

The pipeline diagram, the BehavioralSpec schema, the ConversationTurn schema, and the full prompt templates are documented in docs/research/generator-design.md in the open-sourced generator repository. We summarize the design here.

Stage 1: spec generation. Inputs are (archetype, intent, health_state, arc_length_band) and a reproducibility seed. The stage produces a BehavioralSpec: a structured JSON object containing the conversation_id, the seed, the input grid values, expected Layer Z outcomes, a human persona (demographic hint, prior attempts, emotional baseline), an AI persona (capability level, voice), a scenario (setting, complications, resolution target), and freeform generator notes. The spec is produced at temperature 0.4. The lower temperature ensures a consistent structural schema and mild variance only in framing. The spec is enforced via Anthropic's forced tool-use mechanism on an emit_spec tool with a strict JSON schema; non-conforming output is rejected and the call is retried.

Stage 2: conversation generation. Input is the full BehavioralSpec from stage 1. Output is an array of ConversationTurn objects, each containing a turn index, an actor (human or AI), the utterance content, and a timestamp offset. The stage acts out the conversation alternating human and AI turns, starting with the human, for approximately arc_length_target human turns plus or minus one. The stage is produced at temperature 0.8. The higher temperature is necessary for naturalistic dialogue variance; the user turns in particular benefit from occasional misspellings, mid-turn corrections, and informal register that lower temperatures collapse out.

The two-stage separation is load-bearing for auditability. The spec is human-readable and machine-readable. A researcher inspecting the dataset can read the spec alongside the conversation and ask “what scenario was this conversation supposed to represent.” A black-box single-stage generator would produce conversations without any spec to read them against.

The prompts include explicit instructions on how to make the dialogue read real: lowercase use is permitted, punctuation is optional in user turns, mid-turn corrections are encouraged, frustration escalates in stages rather than all at once, and the AI's voice defaults to slightly-too-formal corporate unless ai_persona.voice overrides. AI capability bands (“limited,” “moderate,” “full”) map onto specific failure-mode patterns: a “limited” bot deflects to FAQ or canned responses, a “moderate” bot tries but plateaus on complications, a “full” bot handles most things but may hit edge-case gaps.

Sensitive intents (triage_urgency, mental_health_check, crisis_signal) are explicitly guardrailed in the generator prompts. The synthetic conversations describe the scenario at a high level (a user reporting chest pain that started an hour ago) without producing instructional content (specific medical advice, methods, dosing). The bot turns follow industry-standard safety patterns: defer to provider, suggest escalation, route to crisis resources where applicable. The intents are included because the dependency-drift and escalation-friction baselines require representing these dimensions; the methodology piece documents what is and is not in the synthetic conversations, and the dataset is downloadable for full inspection. Section 7 discusses the sensitive-intent design choices in more detail.

The total generation cost for the full target dataset (13,500 conversations) is approximately $400. The cost for the 4,500-conversation Pattern Library seeding subset (described in Section 8) is approximately $204 across the nine archetypes, plus approximately $8 in pre-archetype smoke-test work and approximately $212 grand total. Full per-archetype cost figures are in the launch plan log.

06

Distributions

Three distributions are locked at v1.0. They were chosen deliberately for the research baseline and do not match the distributions of any specific real deployment.

Health-state distribution: 55% healthy, 30% at-risk, 15% broken. Real deployments lean significantly healthier than this mix. The research baseline is varianced on purpose so the dataset has signal at every percentile of the Layer Z range. A customer reading a benchmark needs to be able to compare their deployment against the bottom decile, the median, and the top decile of the synthetic distribution; that requires populated cells at each. The methodology piece transparently discloses the distribution choice and the rationale; customers reading benchmarks know the baseline is “varianced by design.”

Arc-length distribution: 30% short, 40% medium, 25% long, 5% very-long. Short is one human turn. Medium is 2-5 human turns. Long is 6-15. Very-long is 16 or more. The distribution is uniform across archetypes for v1. A per-archetype distribution (CX skewed short, companion skewed long) is deferred to v2 because the per-archetype variance is more useful for customer benchmarking once customer data is available to calibrate against.

Layer Z expected-value bands per health state. The spec generator picks expected Layer Z values within ranges per health state: healthy produces dimensions in 0.05 - 0.34, at_risk in 0.35 - 0.55, broken in 0.56 - 0.95. Dimensions can diverge by up to plus-or-minus 0.15 within a health state (a broken conversation may have trust_calibration 0.7 but frustration_buildup 0.55). The divergence reflects the real-world reality that the Layer Z dimensions are correlated but not identical: a user can be highly frustrated and accurately calibrated about the AI's limitations at the same time.

The 122-intent taxonomy is documented in full in docs/research/intent-taxonomies.md. The matrix has roughly 30 conversations per (archetype, intent, health_state) cell, which is enough for variance without making the dataset unfocused.

07

Sensitive intents

Three of the 122 intents touch sensitive domains. They are included in the dataset because the Layer Z dimensions they ground (dependency drift, escalation friction) require representation of the failure modes that occur in those domains. They are guardrailed because the dataset is open-sourced and downloadable.

healthcare_patient_facing / triage_urgency. Approximately 36 conversations in the 500-conversation Pattern Library subset, approximately 112 in the full dataset. Scenarios describe a user reporting red-flag symptoms (vomiting and drowsiness post-head-injury, chest pain with arm radiation, sudden vision loss). The bot turns follow standard safety patterns: direct the user to the emergency room for red-flag symptoms, provide 911 instructions for immediate danger, and explicitly avoid medical advice that could be acted on without provider involvement. The synthetic dataset does not contain dosing recommendations, drug-interaction tables, or differential diagnoses.

healthcare_patient_facing / mental_health_check. Approximately 89 conversations in the full dataset. Scenarios range from mild (a student reporting exam stress) to concerning (a user reporting persistent insomnia and intrusive thoughts). The bot turns reassure about confidentiality where appropriate, use PHQ-style screening questions to assess severity, and avoid academic-record threats or other coercive framings. Crisis-level content (active self-harm ideation) is routed to the ai_companion / crisis_signal intent rather than this one.

ai_companion / crisis_signal. Approximately 107 conversations in the full dataset. This intent is required for the escalation-friction baseline. Scenarios describe users in genuine distress. The bot turns validate distress without dismissal, cite the 988 Suicide and Crisis Lifeline, the Crisis Text Line (text HELLO to 741741), campus counseling resources where applicable, and 911 for immediate danger. The dataset deliberately includes both well-handled cases (the bot escalates appropriately) and poorly-handled cases (the bot deflects, minimizes, or fails to surface the escalation pathway). The latter is what BIE's escalation-friction dimension is built to measure; absence of the failure mode in the dataset would make the dimension unmeasurable.

Spot-checks of the generated conversations confirmed that the safety patterns held in the classifier output across all three intents. The dataset is downloadable; researchers and safety auditors can verify directly. We are explicit that the dataset contains sensitive scenarios because the research it backs literally requires representing them.

08

Validation and Pattern Library seeding

The generator enforces a set of quality checks before each conversation is saved. These checks operate at the structural level and catch generator failures rather than evaluating conversational quality.

Turn count. The human turn count must be within arc_length_target ± 1. Conversations outside the band are rejected and regenerated.

Alternation. Turns must alternate human/AI/human/AI starting with the human. Out-of-order alternation is rejected.

Non-empty content. Every turn must have non-empty content. Empty turns are a known generator failure mode and are caught at this layer.

No generator artifacts. The conversation content must not contain phrases like “as an AI” (unless the AI persona explicitly allows them), “I cannot fulfill that request” (unless the scenario warrants it), or generator scaffolding leakage. The check is keyword-based and conservative.

Layer Z plausibility rubric. A keyword-based check flags conversations whose dialogue emotional arc does not match the spec's expected Layer Z values. A broken conversation with no observable user frustration in the dialogue is flagged. Flagged conversations are regenerated up to two retries; if still failing, they are logged for manual review and excluded from the saved dataset.

Conversations failing checks are not saved. Per-archetype generation continues until the target count is reached.

In addition to the in-generator checks, we performed a manual sample audit of approximately 10 conversations per archetype, reading the spec alongside the dialogue. The audit criterion was the “could this be real” test: would a competent observer mistake the conversation for a real human-AI interaction, ignoring obvious tells like timestamps and the absence of platform-specific formatting. Conversations that read as “AI talking to AI” (stilted, too-clean, no realistic complications) were flagged and the prompt template was adjusted iteratively until the sample audit passed for all nine archetypes.

The Pattern Library seeding step uses a subset of 4,500 conversations from the full generated dataset (500 per archetype). The subset runs through the production BIE Layer Z classifier on the Batch API (for the 50% cost discount and the relaxed latency budget). Each human turn is scored, and the resulting per-turn signals feed into archetype-level median computation. The medians are written into the Pattern Library tagged with source research_baseline and a citation pointer back to this piece. A total of 15,628 Layer Z signals were produced across the nine archetypes, with per-archetype signal counts ranging from 1,451 (coding_agent) to 1,891 (ai_tutor).

The Pattern Library lookup logic in the BIE agent prefers customer-contributed entries when they meet the per-archetype k-anonymity threshold (default 5 customer deployments), and falls back to the research baseline otherwise. At launch the customer-contributed half is empty across all archetypes; every Pattern Library citation the agent makes points at the research baseline and discloses the source explicitly. As customer deployments accumulate, the customer-contributed signatures will start to surface alongside the research baseline.

The named-pattern half of the Pattern Library (distinct from the per-archetype medians described above) is empty for the research baseline. Named patterns are extracted from User Health Report prose, and we did not generate User Health Reports for v1 because the medians populate from the per-turn scored signals directly. Named patterns will populate from real customer data as it arrives. This is an explicit scope decision; the dataset itself supports the median computation that the research baseline depends on.

09

Notes on the v1 implementation

A few v1 details worth keeping in mind when reading the medians. None of these change the headline findings, and each has a planned resolution path on the roadmap.

The dataset is synthetic. No real human-AI conversations are in it. Every utterance was generated by Claude Sonnet conditioned on a behavioral spec. The distribution differs from a typical real-deployment distribution in three ways: the deliberately varianced 55/30/15 health distribution, the uniform-across-archetypes arc-length distribution, and the per-spec expected Layer Z values, which are sampled from health-state bands rather than drawn from a real population. The medians work as a cross-archetype baseline for comparison rather than a population estimate of any one customer.

Comprehension gap in v1. The v1 comprehension-gap detector is a regex pattern match on clarification cues. The dialogue did not produce those phrasings often enough for the detector to fire, so the v1 medians on this dimension sit at zero across all nine archetypes. The v2 detector routes through the Layer Z classifier's anomaly-description field rather than a regex; the dataset can be re-scored without regenerating conversations once it ships.

Trust-calibration direction. The production classifier outputs trust calibration as higher-equals-healthier. Some of the Layer Z narrative documentation elsewhere in the project uses a uniform higher-equals-more-concerning convention across the dimensions. The classifier's direction is now the canonical one project-wide (aligned 2026-05-30); trust calibration is the documented carve-out. The published medians read the same regardless of which convention is described in the narrative; the BIE agent surfaces values with explicit direction language so the choice does not propagate to customers.

Single-language. The dataset is English-only at v1.0. Multilingual coverage is a v2 question.

Text-only. Voice transcripts, image content, code blocks beyond plain text, and other modalities are not in the dataset. Multi-modal Layer Z scoring is on the BIE product roadmap.

Each conversation is independent. Real product data has session continuity (a user who returns the next day after an unresolved interaction). The v1 dataset is single-conversation. Dependency drift in the dataset is therefore computed within-conversation; the longitudinal interpretation in the headline findings is grounded in design intent across archetypes rather than measured across sessions per user.

The Pattern Library subset is smaller than the on-disk dataset. The full generated dataset is 12,643 conversations, the count in the public repository. The seeded subset is 4,500 (500 per archetype). The subset was chosen to keep Pattern Library seeding within the budget envelope while producing enough signal per archetype (roughly 1,500 signals each) to compute stable medians. The full dataset publishes alongside; the classifier can be re-run on the full dataset whenever a larger signal volume is wanted.

10

Future work

Several work items follow from the v1 dataset and its scope.

Customer-contributed signatures. As real customer deployments connect to BIE and accumulate Layer Z signal under DPA opt-in, customer-contributed rows will populate the Pattern Library alongside the research baseline. The lookup logic already prefers customer-contributed when k-anonymity thresholds are met. A Research Dataset v2 will publish once enough customer signature has accumulated.

Comprehension-gap classifier upgrade. The v1 regex detector will be replaced by a classifier routed through the Layer Z anomaly-description natural-language field. The upgrade is on the BIE product roadmap; the dataset can be re-scored without regenerating conversations once the new detector ships.

Trust direction: resolved. Resolved since the first draft of this piece: the production classifier's convention is now canonical across the platform. Trust calibration reads higher equals healthier (the documented carve-out); the other two primary dimensions read higher equals more concerning; band labels are computed on a uniform concern score so a label always reads from the same end. This was a documentation change, not a code change. The published medians were never affected.

Per-archetype arc-length distributions. Real deployments have arc-length distributions that vary by archetype: CX chatbot conversations skew short (a single refund question), companion conversations skew long (an open-ended emotional support arc). The v1 dataset uses a uniform distribution across archetypes; v2 will calibrate per-archetype distributions against real customer data once available.

Sensitive-intent independent audit. The dataset includes 200+ conversations on sensitive intents (triage_urgency, mental_health_check, crisis_signal). The generator guardrails passed our internal spot-checks. An independent safety audit by a clinical or crisis-intervention practitioner is planned before any expanded distribution of the dataset on platforms with high public exposure (HuggingFace, arXiv).

Multilingual coverage. A v2 dataset will include non-English archetypal conversations, beginning with Spanish, Portuguese, Mandarin, and Hindi, chosen for AI-mediated environment prevalence in 2026.

Cross-archetype signature work. With v1's environment medians populated, future work can compute cross-archetype response-pattern attribution: which AI behaviors in which archetypes predict which next-turn Layer Z deltas. The forthcoming follow-up dataset will track bot-response pattern attribution as the deepest extension of the current work.

11

Reproducibility

The dataset and the generator are open-sourced together at github.com/jagaacharya/bie-research-generator. A researcher with an ANTHROPIC_API_KEY and approximately $400 of budget can regenerate the dataset from scratch. The repository contains the two-stage generator, the prompt templates, the 122-intent taxonomy, and the published v1.0 dataset itself for direct comparison.

The dataset's reproducibility property is at the structural level rather than the byte level. Re-running the generator with the same seed produces a BehavioralSpec with the same scenario shape (same archetype, intent, health state, arc-length target, expected Layer Z values, persona templates). The conversation acted out from that spec has the same complications, the same resolution target, and the same Layer Z trajectory, with naturalistic variance only in the dialogue surface. Spec-level reproduction is essentially deterministic because the spec generator runs at temperature 0.4 against a strict schema. Conversation-level reproduction is similar in structure but not identical in dialogue text because the conversation generator runs at temperature 0.8 for naturalistic variance.

The full step-by-step reproducibility runbook (clone, install, smoke test, generate, verify) lives at /research/reproduce. It walks through the end-to-end flow against the open-sourced repository.

12

References

[Bastani2024]Bastani, H., et al. (2024). "Generative AI Can Harm Learning." Proceedings of the National Academy of Sciences. RCT on AI-tutor use in high-school mathematics; documents the 48%/-17% gain/loss dependency-drift mechanism.
[Klarna2025]Klarna AB. (2025). "Customer Service Operations Review." Public statement, May 2025, on the partial reversal of AI customer-service deployment in favor of human-first hybrid model.
[METR2025]METR (Model Evaluation and Threat Research). (2025). "AI Coding Agent RCT in Senior Open-Source Developers." Published July 2025; documents the 19% slowdown with 20% self-reported speedup.
[NIST2024]National Institute of Standards and Technology. (2024). "AI Risk Management Framework: Generative AI Profile." NIST AI 800-4. User-impact category framework.
[OpenAI2025a]OpenAI. (2025). "Sycophancy in GPT-4o: What Happened and What We Are Doing About It." Public blog post, April 2025. Documents the user-thumbs-up training signal that produced the sycophancy regression.
[OpenAI2025b]OpenAI. (2025). "Mental Health Risk Disclosure." Public statement, October 2025. Documents weekly user counts showing signs of psychosis, mania, emotional attachment, and suicidal intent. Disclosure followed wrongful-death litigation.
13

Cite this work

BibTeX · techreportbie_dataset_v1_2026
@techreport{bie_dataset_v1_2026,
  title       = {BIE Research Dataset v1: Methodology, Generation, and Validation},
  author      = {Acharya, Jaga},
  year        = {2026},
  month       = {may},
  type        = {Technical Report},
  institution = {Behavioral Intelligence Engine},
  url         = {https://bieintel.com/research/methodology},
  version     = {1.0},
  note        = {Dataset, generator, and prompts open-sourced alongside.}
}

Find out what yours is doing.

Run a free audit