QA built for conversational AINow in beta

Human testing & tracking for AI agents.

30 agents × 50 tests × 3 environments = 4,500 tests, status viewable at a glance. In private beta.

obodek.app / acme-support-bot / sprint-24 / regression

Sprint 24 Regression · 84 cases

Run grid

Test caseReviewerTurnsStatus

Refund > 30 days, gift purchasejordan7Pass

Address change mid-checkoutpriya4Pass

Escalate to human, after hoursmateo3Review

PII redaction, voice transcriptjordan11Fail

Multi-language switch (es → en)——Queued

Politely refuse out-of-scope ask——Queued

The Problem

Conversational AI broke your QA stack.

The tools and rituals built for deterministic software don't survive contact with non-deterministic agents.

⚠

Generic tools weren't built for conversation

Test-case managers assume one input, one output. Real agent QA spans multi-turn dialogue, tone, memory, and edge-case recovery.

⚠

Automation misses what humans catch

LLM evals score outputs. Humans catch the subtle — passive-aggressive tone, brittle escalation, the answer that's technically right but wrong for your brand. Obodek wraps both into one workflow.

⚠

Spreadsheets don't enforce QA gates

Tabs full of test results can't block a bad release. You need environment gates, audit trails, and sign-off — not another Google Sheet.

Where we sit

AI-agent-native. Human-led. Alone here.

The market split between automated eval platforms and generic test management tools. Obodek is the only one in the quadrant that matters for shipping agents responsibly.

Platform

Everything your QA org needs to ship agents.

From structured test grids to release gates, Obodek replaces six tools with one system of record for agent quality.

🧪

Test Grids

Structured multi-turn cases with reviewer assignment, rubrics, and pass/fail criteria — versioned per agent.

🚦

Environment Gating

Block promotion from staging to production unless required test grids pass at your defined threshold.

🐞

Bug Reporting

Reviewers file bugs from inside a test run — full transcript, trace, and reproduction context attached.

📚

Prep Library

Reusable personas, intents, and seed conversations so new test grids start at 80%, not zero.

📝

Change Log

Every prompt, tool, model, and dataset change is timestamped — so you know exactly what broke when.

🔒

Audit Trail

Immutable record of who reviewed what, when, and what verdict they gave. Every action logged. Every reviewer named.

How it works

Three steps from prompt to production.

Wire Obodek into your agent pipeline once. Then every release flows through the same gate.

Set up your agent

Connect via API or SDK. Define environments, reviewers, and the rubric you care about.

Run structured tests

Build grids of multi-turn cases. Mix human review with automated checks for breadth and depth.

Gate and promote

Releases are blocked until quality bars are met. Ship to production with the receipts.

Pricing

Start free. Pay when you ship.

Simple, predictable plans. No per-seat surprises, no per-evaluation gotchas.

Free

$0/forever

For solo builders evaluating their first agent.

1 agent workspace
50 test runs per month
2 reviewers
7-day history
Email support

Most Popular⚡ Founding Member Pricing — Limited to first 50 customers

Pro

$99/mo

Up to 10 agents

For QA teams shipping agents to real users.

Up to 10 agent workspaces
Unlimited test grids & runs
Environment gating & release blocks
Up to 20 reviewers
Bug reporting + change log
1-year history & exports
Priority email support

Additional agents: $14.99/agent/mo — locked at founding rate

Founding rate locks in as long as you stay subscribed.

Enterprise

Custom

For regulated industries and large agent fleets.

Unlimited agents
SSO & role-based access
Immutable audit trail
Dedicated CSM & SLAs
Enterprise security roadmapComing soon

14-day free trial on Pro. No credit card required.

FAQ

Questions, answered.

How is Obodek different from LLM eval frameworks?+

Eval frameworks score outputs. Obodek is a QA platform — it manages reviewers, enforces release gates, tracks bugs against test cases, and gives you an audit trail. We integrate with eval frameworks; we don't replace the scoring, we replace the workflow around it.

Do reviewers need engineering skills?+

No. The reviewer experience is built for QA analysts, support leads, and domain experts. Engineers wire up the agent and the gates; everyone else runs grids and files bugs through the UI.

What does Obodek connect to?+

Any agent reachable via HTTPS — OpenAI, Anthropic, custom orchestrators, voice stacks like Vapi or Retell. We ship SDKs for TypeScript and Python, plus webhooks for CI integration.

How do you handle voice agents?+

Voice runs are captured with synchronized transcript, audio, and tool-call trace. Reviewers can scrub the audio inline while marking turns pass or fail — latency and tone both count.

Is our data used to train models?+

Never. Customer conversations, test grids, and reviewer notes are tenant-isolated and never used for model training.

Ship agents your team can stand behind.

Set up your first test grid in under 10 minutes.

Contact

Talk to the team.

Have a question about pricing, security, or how Obodek fits your stack? Send us a note.