Reproducibility runbook

Regenerate the dataset from scratch.

The dataset, the generator, and the prompt templates are open. A researcher with an Anthropic API key and roughly $400 of budget can regenerate the BIE Research Dataset v1.0 from scratch and inspect every step the pipeline took.

What you'll need

An Anthropic API key with sufficient credit. The full dataset run targets approximately 13,500 conversations against Claude Sonnet at temperatures 0.4 and 0.8 (the published v1 run yielded 12,643 after validation discards). Estimated cost is approximately $400 if run sequentially without the Batch API, or approximately $200 if routed through the Batch API for the 50% discount.

A machine with Node.js 20 or later installed. The generator uses native ES modules; no other system dependencies are required.

Approximately 6 hours of wall-clock time if running sequentially against one archetype at a time. Approximately 1.5 hours if you parallelize across archetypes. The Batch API path takes up to 24 hours of queue time per submission but is significantly cheaper.

The public repository at https://github.com/jagaacharya/bie-research-generator, which contains the generator source, the prompt templates, the 122-intent taxonomy, and the published v1.0 dataset for direct comparison.

Clone and install

clone + installbash

git clone https://github.com/jagaacharya/bie-research-generator
cd bie-research-generator
npm install

Then set your Anthropic API key:

set api keybash

export ANTHROPIC_API_KEY=sk-ant-...

You can put it in a .env.local file at the repo root instead; the generator reads from process.env via the standard dotenv path if present.

Smoke test: five hand-picked samples

Before running the full dataset, generate five hand-picked variety samples to confirm everything works end-to-end. This costs roughly $0.50 and takes two minutes.

smoke testbash

npm run samples

The samples land at data/_samples/. Open them in a text editor and confirm the conversations read like a real person talking to a bot, not two models trading stilted lines. If they look right, the pipeline is configured correctly and you can proceed to the full run.

Generate one archetype

Start with a single archetype to estimate cost and runtime against your specific Anthropic account latency.

one archetypebash

npx tsx generate.ts \
  --archetype cx_chatbot \
  --count 100 \
  --seed-start 1

Output lands at data/cx_chatbot/. Each conversation is a single JSON file containing { spec, turns }. The per-archetype _index.json summarizes the run: counts, distributions, token usage, and total cost.

Compare to the published per-archetype index in the repo at data/cx_chatbot/_index.json. Your distribution numbers should match within sampling variance; the actual conversation content will differ because Claude Sonnet is not strictly deterministic on seed input.

Generate the full dataset

Once the single-archetype run looks right, generate the full dataset across all nine archetypes. The target is 1,500 conversations per archetype (13,500 total); expect slightly fewer after validation discards. The published v1 run produced 12,643.

full datasetbash

npx tsx generate.ts --all --count 1500

The runner processes archetypes sequentially. If you want to parallelize, run the per-archetype command in nine separate shells, each with a different --archetype flag.

For the lowest cost path, use the Batch API mode (50% discount, up to 24h queue time):

batch modebash

npx tsx generate.ts --all --count 1500 --batch

What reproducibility means here

The dataset's reproducibility property is at the structural level, not the byte level. Re-running the generator with the same seed input produces a BehavioralSpec with the same scenario shape (same archetype, intent, health state, arc-length target, expected Layer Z values, persona templates). The conversation acted out from that spec has the same complications, the same resolution target, and the same Layer Z trajectory, with naturalistic variance only in the dialogue surface itself.

Spec-level reproduction is essentially deterministic because the spec generator runs at temperature 0.4 against a strict Zod schema. Conversation-level reproduction is similar in structure but not identical in dialogue text because the conversation generator runs at temperature 0.8 for naturalistic variance.

To confirm your regenerated dataset matches the published one at the structural level: pick any seed, generate the spec, and diff against the published spec field in the matching c_<archetype>_NNNN.json file. Field values should be either identical or within the documented variance bands.

To confirm at the Layer Z level: run a representative subset through a behavioral classifier and aggregate to archetype-level medians. The medians from the published dataset are documented at /research/methodology; your regenerated dataset should produce medians within Sonnet's natural sampling variance.

Inspect, fork, or extend

The full design document for the generator pipeline is at docs/generator-design.md in the repository. It describes the two-stage architecture, the BehavioralSpec schema, the prompt templates, the temperature choices, and the quality-enforcement checks. Read this before modifying the generator.

The intent taxonomy (122 intents across 9 archetypes) is at docs/intent-taxonomies.md for human reading and taxonomies.ts for machine reading. Adding a new intent requires updating both.

If you fork the generator and produce a derived dataset, cite the original methodology piece (see the BibTeX entry at /research) and note your modifications. The license is CC BY 4.0 for the dataset and MIT for the code; both permit commercial and derivative use with attribution.

If something doesn't reproduce

If a regenerated spec diverges from the published spec for the same seed in a way that exceeds the documented variance bands, that is a reproducibility issue worth reporting. Open an issue on the GitHub repo with the seed, the published spec, and the regenerated spec, and we will investigate.

If the regenerated conversations look qualitatively different from the published ones (stilted, AI-flavored, missing complications), check that your Anthropic SDK version matches the one pinned in package.json and that the prompt templates have not been locally modified.

Now run it on real users.

Run a free audit →