About

We don't ask the user. We watch the user.

An AI eval tool measures what the model says. We measure what happens next.

The thesis

What your AI is doing to people is unmeasured.

Every AI product team ships with a benchmark suite, a hallucination check, an eval pipeline. Then a CSAT survey. Then a dashboard for ticket volume.

None of that is the question.

The question is what your AI is doing to the people on the other end.

The premise everything here is built on

That is what we built. An engine that watches the human side of the conversation, on dimensions nobody else is watching for yet: trust calibration, frustration buildup, and the quieter signals that go uninstrumented until the user is already gone.

The model layer is crowded. Faithfulness, toxicity, latency, cost per turn, every vendor measures the output. We sit one layer over, on the side of the conversation where the human actually is, and we read what the output is doing to them turn by turn.

How we operate

Four rules. No exceptions.

01
Environment first.

The deployment is the subject. Individuals are evidence. We report on what is happening to the people inside your product, never a dossier on any one of them.

02
Falsifiable, always.

Every claim carries the thing that would prove it wrong. If a finding cannot be broken by evidence, the engine does not print it. A prediction earns its place only when it has staked out the conditions under which it would fail.

03
No fabricated intelligence.

If the data doesn't support it, the engine doesn't say it. We characterize the environment before we analyze it, and we refuse to manufacture a finding the evidence won't carry.

04
Operators decide.

We draft. We watch. We never deliver to your users. The engine writes the report and surfaces the anomaly. A person on your team reads it and decides what goes out.

Why now

AI is shipping faster than the layer that watches it.

Every week another AI product ships with no instrument for what it's doing to people. The model gets measured a dozen ways before launch. The human on the other end gets a thumbs-up button and a survey nobody fills out.

We're building the instrument.

ColophonEdition 2026 · open by default
Display
Fraunces sets every finding, title, and verdict. Cut austere, optical sizing on, authority through restraint rather than weight.
Text
Newsreader sets the prose. Seventeen on sixty-two, measure capped near sixty-six characters, the register of a memo a senior analyst would sign.
Interface
Inter carries the chrome. Navigation, buttons, the working surfaces. It stays quiet so the document can speak.
Apparatus
JetBrains Mono renders every label, timestamp, dimension name, and number. All numerals are tabular and lining, so a column reads as a column.
Dataset
The research baseline is a published corpus across 9 deployment archetypes. The findings, the methodology, and the generator that produced the data are open. Every median is checkable from the source it points at. github.com/bie/research
Generator
The synthetic-conversation generator is open-source, so anyone can reproduce the baseline rather than take our word for it. github.com/bie/generator
Stance
Environmental-first: the deployment is the subject. Falsifiable: a claim states how it could be wrong. And no clinical language about the people we measure, ever. We use the dimension names, not diagnoses.

Set in Fraunces, Newsreader, Inter, and JetBrains Mono.
No customer logos. No testimonials we cannot stand behind. Research is the evidence.
Built with conviction. © 2026 Behavioral Intelligence Engine.

The audit is free

Show us yours.

Send the conversations your AI is already having. We'll read the human side and tell you what landed.

up to 10K conversations · about 30 min to your inbox · no card