StrategyMay 2026

Watch the human, not the model.

Every AI eval tool measures the model. Faithfulness, hallucination rate, toxicity, prompt-injection resistance, latency. Every one of them does something real and useful. None of them measure what the model is doing to the humans on the other end. That gap is the entire reason BIE exists.

The observation that BIE rests on is uncomfortable: self-report is uncorrelated with reality. METR ran a randomized controlled study with sixteen experienced developers in July 2025. AI tools made them 19% slower. The same developers had predicted a 24% speedup before the study and post-hoc believed they had been sped up by 20%. A 39-percentage-point gap between perceived and measured productivity, in a population that should have been hardest to fool. If senior open-source developers cannot accurately assess whether AI is helping them, no NPS survey is going to give you the truth about your AI product.

The same dynamic shows up in education. Bastani et al. ran a high-school math RCT in Turkey, published in PNAS. Students with GPT-4 base scored 48% higher during AI-assisted practice. The same students scored 17% lower when AI was removed for testing. A 65-percentage-point swing. The model was helping the student get the answer; the student was not learning the underlying skill. The students themselves did not notice this happening to them. They could not have told you about it.

And in customer support: Klarna rolled back its OpenAI-powered assistant fifteen months after launching it as a 700-agent replacement. The CEO publicly admitted the company had overestimated AI capabilities and underappreciated the human aspects of service delivery. Resolution rate was up. Time-to-first-response was up. CSAT and NPS on disputes, fraud, and hardship interactions had degraded materially. The volume metrics looked great. The user-side reality had diverged.

These are not edge cases. They are the predictable outcome of measuring the wrong layer. When you measure the model and ask the user how it went, you get a story that is consistent with itself but is not load-bearing on what is actually happening to the people interacting with the product.

BIE is the missing layer. Every human turn after every AI turn gets scored on six dimensions we call Layer Z: trust calibration (are users trusting the AI at the level its actual reliability warrants?), frustration buildup (is friction compounding across turns?), dependency drift (are users getting more or less self-sufficient over time?), silent abandonment (which cohorts are quietly disengaging?), escalation friction (what is the trust price of "reach a human"?), and comprehension gap (is the user asking a different question than the one the AI is answering?). These are scored 0–1 with confidence and observation, validated against a Zod schema, and stored with reasoning chains you can inspect.

The customer-visible deliverable is a weekly memo — the User Health Report — written by a senior-analyst-voice agent that reads the deployment data and tells you the one specific thing to fix this week. It includes three falsifiable predictions for the coming week, and each weekly memo evaluates last week’s predictions against this week’s data. You can disagree with the reasoning, not just the conclusion. Anomalies fire in real time to your Slack within sixty seconds. Counterfactuals get generated for failure turns: "your AI said X, here is what we predict would have happened with these alternatives, grounded in N similar prior arcs in your own deployment." None of this is invented; nothing is unexplained.

We are not trying to replace your eval stack. Galileo, Patronus, Langfuse, Datadog all measure something real on the model side. BIE measures the human side. Both layers belong in your stack. Most VPs of AI we talk to already have the model-side instrumentation and have not noticed the human-side gap until they see what BIE produces on their own data. The fastest way to see that is to upload up to 10K conversations to the Free Behavioral Audit. No login. No card. The full report is the free report.

Product

If you fix this one thing.

Why every weekly User Health Report ends with a single prioritized action — the line VPs of AI forward to their CEOs.

Voice

The environmental-first principle.

Why every BIE report is written about the deployment, not about individuals — and what we will not let the engine say about end-users.