Methodology·May 2026·6 min read

Counterfactual grounding: refusing to invent.

Jaga Acharya · Behavioral Intelligence Engine

When an AI deployment produces a failure turn, an AI response that immediately precedes a human turn flagged as anomalous, the most useful intelligence is not "the AI did badly here" but "the AI did badly here, and here is what we predict would have happened with these alternatives." This paper describes the discipline that makes that prediction trustworthy enough to ship.

The core challenge is hallucination. A naive implementation would prompt a model to "generate three alternative responses and predict their outcomes." This produces fluent text that reads as plausible but has no grounding, the predicted outcomes are inventions of the model, not derivations from evidence. We rejected this approach early. The risk to credibility outweighs the apparent value of producing predictions: if a customer ever traces a counterfactual prediction back to a hallucinated outcome, the entire surface loses trust, permanently.

BIE's counterfactual generator is a two-stage agent. Stage one (the generator) reads the conversation arc up to the failure turn, plus the deployment's archetype-specific voice and policy configuration, and produces 2–3 plausible alternative AI responses. Each alternative carries a label, the response text itself, and a rationale describing the behavioral lever it pulls (policy-first acknowledgement, proactive escalation, etc.). Stage one does no prediction. It produces alternatives.

Stage two (the outcome predictor) takes each Stage-one alternative plus the actual subsequent human turn plus a corpus of similar prior conversation arcs from the same deployment, and predicts the trust-calibration delta and frustration-buildup delta the alternative would have produced. Each prediction carries a confidence range (low/high bounds, never a single point estimate), an evidence anchor citing the prior arcs, and optional caveats. The schema requires confidence_range and evidence_base_count on every alternative; the type system rejects payloads missing either.

PipelineFrom failure turn to grounded counterfactual

Two stages. The generator produces alternatives. The outcome predictor scores them against same-deployment evidence. Either stage refuses to run on thin evidence.

Stage 1 inputs

Conversation arc, up to the failure turn

Archetype-specific voice and policy configuration

Stage 1

Generator

Produces 2 to 3 plausible alternative AI responses. No prediction; only candidates.

Alternative #1

Alternative #2

Alternative #3

Stage 2 inputs

The alternative

The actual subsequent human turn

Bounded set of similar prior arcs from the same deployment

Stage 2

Outcome predictor

Scores each alternative against same-deployment evidence. Predicts confidence-ranged behavioral delta.

Output

Trust-delta and frustration-delta with confidence range

Plus evidence-base count, citation anchors to the prior arcs, and an explicit caveat block. Confidence range is always a band, never a point estimate.

Failure modeIf no comparable prior arc exists on the dimension being analyzed, the pipeline refuses to generate before any AI call is made. The customer sees an empty-state message; no hallucinated prediction is produced.

The load-bearing discipline is what happens when the evidence base is thin. The pipeline retrieves a bounded set of similar prior arcs from the same deployment. If no comparable arc exists for the dimension being analyzed, the pipeline refuses to generate before any AI call is made. The customer sees an empty-state message: "Not enough comparable data yet. Counterfactuals require at least one similar prior conversation arc on this dimension in your deployment." We refuse to generate rather than risk a hallucinated prediction.

Customer-facing copy is locked. The UI panel always shows the confidence range, always shows the evidence base count, and the footer disclaimer reads "Predictions are directional and confidence-ranged. They are not guarantees." We never present a counterfactual as deterministic. The combined effect (strict grounding, mandatory confidence labels, refusal-on-thin-evidence) keeps the surface credible enough to be the first place customers go after a flagged anomaly.

Validation runs before the surface opens to general customers. The bar: for ten flagged conversations, generate 2–3 counterfactuals each, judge each as (a) a plausible alternative the bot could actually produce with current capabilities, and (b) a directionally correct prediction. The threshold for opening the surface to all customers is 70% hit rate across both criteria. The Pattern Library, once populated under DPA opt-in, will allow cross-customer arc retrieval, expanding the evidence base for deployments that do not yet have a deep history of their own.

Pattern Library Findings

Dependency, trust, and frustration: three patterns across nine AI archetypes.

Headline findings from the BIE Research Dataset v1: 4,500 synthetic AI-mediated conversations producing 15,628 Layer Z signals across nine deployment archetypes. Dependency drift tracks design intent, with a 0.20-point spread between pedagogical and engagement-driven bots. Trust calibration tracks whether the user can verify the output. Frustration buildup splits push-vs-serve. Every median is open. The dataset and the generator publish alongside.

Read paper →

Behavioral Ontology

Layer Z: six dimensions for measuring AI-mediated user behavior.

A structured behavioral ontology for AI-mediated environments. Trust calibration, frustration buildup, dependency drift, silent abandonment, escalation friction, and comprehension gap, scored on every human turn after every AI turn, validated against the published RCT literature.

Read paper →