Hill-climb your AI features without regressing. Every fix verified against the customer scenarios that matter, before it ships.
$ npx bitfab-cli initcopy
Teams already extending their Software factories with Bitfab.
Give your coding agents the infrastructure to identify and replay customer scenarios end-to-end. Works with Claude Code, Cursor, and Codex.
Agent workflow diagram Claude Code Bitfab connectedIdentify failing customer scenarios in our talk-to-threads feature, then ship fixes that pass them all.
bitfab Traces 4,218 traces analyzed Dataset dataset:talk-to-threads-v3 50 traces Label 42 traces labeled Replay replay 44 / 50 passed LLM-powered data-pipelines Agents and chat bots 92% customer scenarios passed
Pass rate trend Claude Code Bitfab connectedIdentify failing customer scenarios in our talk-to-threads feature, then ship fixes that pass them all.
bitfab Traces 4,218 traces analyzed Dataset dataset:talk-to-threads-v3 50 traces Label 42 traces labeled Replay replay 44 / 50 passed LLM-powered data-pipelines Agents and chat bots 92% customer scenarios passed
Pass rate trend Built for engineering teams building software factories. Engineers running 4+ coding agents at a time.
Collaborate with your coding agent to identify and set up customer scenarios so they can be easily tested. Your agent searches production traces with you, surfaces the scenarios that actually matter, and turns them into reusable datasets you can rerun any time you change a prompt or model.
Precise data labeling with your coding agent. Tighten when you need to. The agent annotates the spans that matter and collapses noisy traces. Override whatever’s wrong. The dataset reflects you and your customer's judgment, not the agent’s.
Bitfab · chatbot · Trace with annotations ✓ Done
trace · chatbot/order_status · 2.4s · 5 spans · 3 annotated
classify_intent llm_call 45ms
search_orders tool_call 1.2s
Claude · performance · fail
search_orders is most of the wall clock. A bounded query or cache knocks this under a second.
fetch_order tool_call 210ms
Claude · wrong tool call · fail
Called with order_id=4821 . Customer wrote #4821-A — wrong order returned.
generate_response llm_call 890ms
You · missing context · fail
Trace shows a 3-day shipping delay; the reply never mentions it. Override for the dataset.
2 labels by Claude · 1 human editApprove →
Find fixes by automatically rerunning customer scenarios as verification. Rerun experiments locally or in the cloud on production trace data with custom sandboxes.
Request a custom sandbox
replay · dataset: search-quality-v3 · 50 scenarios 46 pass 3 fail 1 changed baseline: prod@main · candidate: prompt-v4
✓ #07
User asks about refund window prompt edit
prod: "Refunds are available within 30 days."
Candidate output
v4: "Refunds within 30 days of purchase, no questions asked ."
Clearer phrasing, intent preserved.
✕ #23
User asks for order #4521 status retrieval config
prod: "Order #4521 shipped Monday, size 10 Pegasus."
Candidate output
v4: "I don't have access to order details right now. "
New filter dropped the order-history index.
⊕ #31
User requests refund for damaged item tool schema
prod: process_refund(order=4521, reason="damaged")
Candidate tool call
v4: escalate_to_human(reason="damage refund")
Tool renamed. Model chose safer path.
Replay completed in 47s against isolated sandbox. 47 of 50 shown
Auto-research for your AI stack using your customer data. Book a demo