When an AI deployment produces a failure turn, an AI response that immediately precedes a human turn flagged as anomalous, the most useful intelligence is not "the AI did badly here" but "the AI did badly here, and here is what we predict would have happened with these alternatives." This paper describes the discipline that makes that prediction trustworthy enough to ship.
The core challenge is hallucination. A naive implementation would prompt a model to "generate three alternative responses and predict their outcomes." This produces fluent text that reads as plausible but has no grounding, the predicted outcomes are inventions of the model, not derivations from evidence. We rejected this approach early. The risk to credibility outweighs the apparent value of producing predictions: if a customer ever traces a counterfactual prediction back to a hallucinated outcome, the entire surface loses trust, permanently.
BIE's counterfactual generator is a two-stage agent. Stage one (the generator) reads the conversation arc up to the failure turn, plus the deployment's archetype-specific voice and policy configuration, and produces 2–3 plausible alternative AI responses. Each alternative carries a label, the response text itself, and a rationale describing the behavioral lever it pulls (policy-first acknowledgement, proactive escalation, etc.). Stage one does no prediction. It produces alternatives.
Stage two (the outcome predictor) takes each Stage-one alternative plus the actual subsequent human turn plus a corpus of similar prior conversation arcs from the same deployment, and predicts the trust-calibration delta and frustration-buildup delta the alternative would have produced. Each prediction carries a confidence range (low/high bounds, never a single point estimate), an evidence anchor citing the prior arcs, and optional caveats. The schema requires confidence_range and evidence_base_count on every alternative; the type system rejects payloads missing either.
Two stages. The generator produces alternatives. The outcome predictor scores them against same-deployment evidence. Either stage refuses to run on thin evidence.
The load-bearing discipline is what happens when the evidence base is thin. The pipeline retrieves a bounded set of similar prior arcs from the same deployment. If no comparable arc exists for the dimension being analyzed, the pipeline refuses to generate before any AI call is made. The customer sees an empty-state message: "Not enough comparable data yet. Counterfactuals require at least one similar prior conversation arc on this dimension in your deployment." We refuse to generate rather than risk a hallucinated prediction.
Customer-facing copy is locked. The UI panel always shows the confidence range, always shows the evidence base count, and the footer disclaimer reads "Predictions are directional and confidence-ranged. They are not guarantees." We never present a counterfactual as deterministic. The combined effect (strict grounding, mandatory confidence labels, refusal-on-thin-evidence) keeps the surface credible enough to be the first place customers go after a flagged anomaly.
Validation runs before the surface opens to general customers. The bar: for ten flagged conversations, generate 2–3 counterfactuals each, judge each as (a) a plausible alternative the bot could actually produce with current capabilities, and (b) a directionally correct prediction. The threshold for opening the surface to all customers is 70% hit rate across both criteria. The Pattern Library, once populated under DPA opt-in, will allow cross-customer arc retrieval, expanding the evidence base for deployments that do not yet have a deep history of their own.