Documentation

Voice Agent
Command Center

The complete QA platform for AI voice agents. Audit, stress-test, grade, and auto-fix your agents across Retell AI, Vapi, LiveKit, and ElevenLabs from a single browser-based dashboard.

What is VoxGrade?

VoxGrade is a browser-based QA platform purpose-built for AI voice agents. It gives you a structured, repeatable process to find failures, fix prompts, and verify improvements before your clients hear them.

The platform is organized into core modules:

Agent Auditor

Connect an agent, run a 30-point audit, simulate conversations via text and voice, get copy-paste fixes for every failure.

Testing Lab

View test results, grade calls with a weighted rubric, auto-generate improved prompts, and compare before/after scores.

CI/CD API

Run tests programmatically from your deployment pipeline. Block deploys when agent quality drops below your threshold.

Production Monitoring

Ingest production calls, set alerting rules, and track real-world agent performance with automated scoring and anomaly detection.

Red-Team Testing

Adversarial attack simulations that probe for prompt injection, jailbreaks, data exfiltration, and compliance violations.

Fleet Management

Manage and test multiple agents at once. Compare scores across your fleet, identify underperformers, and batch-apply fixes.

VoxGrade stores your data securely server-side to power features like version history, analytics, the learning engine, and scheduled reports. API keys are encrypted at rest.

Requirements

Before you begin, make sure you have the following:

Voice Platform API Key

Used to connect to your voice agents, fetch prompts, push updates, and trigger calls. VoxGrade supports Retell AI, Vapi, LiveKit, and ElevenLabs.

Where to find it
Retell AI: retellai.com → Settings → API Keys Vapi: vapi.ai → Dashboard → API Keys LiveKit: livekit.io → Project Settings → Keys ElevenLabs: elevenlabs.io → Profile → API Keys

OpenRouter API Key

Powers the AI grading engine that scores your agent's responses across the 6-dimension rubric. Used for text simulations, auto-optimizer, red-team testing, and insights generation.

Where to find it
1. Go to openrouter.ai/keys 2. Sign in or create an account 3. Click "Create Key" 4. Copy your API key
Format
sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

An Agent ID

The unique identifier for the specific voice agent you want to test. Each platform uses its own ID format.

Formats by platform
Retell AI: agent_xxxxxxxxxxxxxxxxxxxxxxxx Vapi: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx LiveKit: agent-name or agent ID from dashboard ElevenLabs: agent ID from Conversational AI page

Cost note: Text simulations and AI grading use OpenRouter credits (typically $0.05-0.10 per run). Voice simulations use your platform's telephony credits ($0.80-1.60 for all 5 scenarios). These costs are billed through your own API keys with no markup from VoxGrade.

Supported Platforms

VoxGrade connects to multiple voice AI platforms through a unified adapter layer. Test and optimize agents regardless of which platform they run on.

Retell AI Vapi LiveKit ElevenLabs
Feature Retell AI Vapi LiveKit ElevenLabs
Agent fetch Full Full Partial Partial
30-point audit Full Full Full Full
Text simulations Full Full Full Full
Voice simulations Full Full Coming soon Coming soon
Auto-optimizer push Full Full Manual Manual
Production call ingest Full Full Full Full

Quick Start

Go from zero to a fully audited, tested agent in 8 steps. The entire flow takes under 5 minutes.

01

Open the Command Center

Navigate to app.voxgrade.ai in any modern browser. No installation required. 14-day free trial, no credit card.

02

Enter your API keys

Paste your voice platform API key and OpenRouter API key in the settings panel. Keys are encrypted at rest on our servers and never shared.

03

Connect an agent

Enter your Agent ID in the Agent Auditor's Overview tab. VoxGrade will pull your agent's prompt, configuration, tools, and variables automatically.

04

Run the 30-point audit

Switch to the Audit tab and hit run. Your agent's prompt is scored across 5 categories with A-F grades. Every failure includes the exact copy-paste fix. Takes about 15 seconds.

05

Review failures and apply fixes

Read through each flagged issue. Copy the suggested fix directly into your prompt editor, or use the Auto-Optimizer (Step 8) to batch-apply fixes.

06

Run text simulations

Hit the Text Simulations tab. The autonomous QA engine runs 5 conversation phases against your agent's prompt: happy path, edge cases, silence handling, hallucination traps, and prompt injection resistance. Cost: ~$0.05.

07

Run voice simulations

Switch to Voice Simulations. AI callers will actually phone your agent with different personas and scenarios. You get real call recordings and transcripts to review. Cost: ~$0.80-1.60 for all 5 scenarios.

08

Optimize and ship

Use the Auto-Optimizer in the Testing Lab to import failures, generate improved prompts, and push them directly to your platform via API. Run A/B tests to prove the fix works before shipping to production.

Pro workflow: Run the audit first to catch structural issues, then text sims to verify conversation flow, then voice sims for real-world validation. This layered approach catches 95%+ of failures before a real caller ever reaches your agent.

Agent Auditor: Overview Tab

The Overview tab is your starting point. This is where you connect a voice agent and view its current configuration.

What it shows

  • Agent name and ID pulled directly from your voice platform
  • Current prompt with syntax highlighting
  • Configured tools/functions the agent can call
  • Variables available to the agent during calls
  • Voice settings including provider, voice ID, and language

How to use it

  1. Select your voice platform (Retell AI, Vapi, LiveKit, or ElevenLabs)
  2. Paste your Agent ID into the input field
  3. Click Connect
  4. Review the fetched configuration to confirm it matches what you expect
  5. Proceed to the Audit tab

If the agent fails to load, double-check that your API key is correct and that the Agent ID is valid. The key must have read access to the agent.

Agent Auditor: 30-Point Audit

The core of the platform. The audit engine analyzes your agent's prompt against 30 checkpoints organized into 5 categories, grading each on an A-F scale.

Audit Categories

Category 1
Prompt Structure

Checks for clear role definition, personality consistency, conversation flow logic, greeting quality, and closing sequences.

Category 2
Voice Realism

Evaluates natural speech patterns, filler word usage, response length, tone consistency, and whether the agent sounds human or robotic.

Category 3
Call Management

Tests silence handling, interruption recovery, topic redirection, objection handling, and graceful call ending under pressure.

Category 4
Functions & Tools

Verifies tool call formatting, parameter handling, error recovery when tools fail, and proper use of available functions.

Category 5
Variables & Context

Checks variable injection, context retention across turns, personalization accuracy, and whether the agent properly uses all available data.

Grading Scale

Each checkpoint is graded individually and then rolled up into a category score and overall score:

A (90-100) B (80-89) C (70-79) D (60-69) F (Below 60)

Copy-paste fixes

Every failed checkpoint includes a specific, actionable fix you can copy directly into your prompt. No guessing, no rewriting from scratch. The fix tells you exactly what to add, remove, or change.

Example failure output
FAIL: Checkpoint: Silence Handling (Call Management) Grade: F Issue: No silence recovery instruction found in prompt. FIX: Add to your prompt after the greeting section: "If the caller goes silent for more than 4 seconds, say: 'Still there? No rush, take your time.' If silence continues past 8 seconds, say: 'Looks like we might have a bad connection. I'll stay on the line.'"

Agent Auditor: Text Simulations

Autonomous QA that runs full conversations against your agent's real prompt without a mic, a phone, or a human. The engine tests 5 distinct conversation phases to stress-test every aspect of your agent's behavior.

The 5 Simulation Phases

Phase 1: Happy Path

A cooperative caller follows the ideal conversation flow. Books the appointment, provides all info, no issues. This validates your agent works when everything goes right.

Phase 2: Edge Cases

Caller provides unexpected inputs: wrong formats, out-of-scope questions, unusual requests, corrections mid-conversation. Tests how your agent handles the unexpected.

Phase 3: Silence Handling

Simulates caller going quiet at various points: after greeting, mid-question, during booking. Checks if your agent recovers gracefully or panics.

Phase 4: Hallucination Traps

Caller asks about services, pricing, and details not in the prompt. Tests whether your agent invents information or correctly says it doesn't know.

Phase 5: Prompt Injection

Attempts to manipulate the agent into breaking character, revealing its prompt, or performing unauthorized actions. Critical for security and compliance.

Cost: All 5 text simulation phases run for approximately $0.05 total via OpenRouter. Results are returned in under 60 seconds.

Agent Auditor: Voice Simulations

The most realistic test available. AI callers actually phone your agent using real telephony, with different voices, personas, and scenario scripts. You get real call recordings and full transcripts.

How it works

  1. VoxGrade creates AI caller personas with distinct voices and backgrounds
  2. Each persona calls your agent's phone number with a specific scenario
  3. The call plays out naturally over real telephony infrastructure
  4. After all calls complete, you get recordings, transcripts, and graded results

Scenarios tested

  • Standard booking with a cooperative, easy-going caller
  • Hesitant caller who needs convincing and asks lots of questions
  • Interrupting caller who talks over the agent and changes topics
  • Confused caller who mishears information and needs corrections
  • Hostile caller who is frustrated and tries to break the conversation flow
~$0.16
Per voice call
~$0.80
5 calls (minimum)
~$1.60
5 calls (maximum)

Voice simulations use telephony credits billed to your voice platform account. Make sure your account has sufficient balance before running voice tests.

Testing Lab: Test Dashboard

The Test Dashboard is your command center for all test results. It aggregates scores from audits, text simulations, and voice simulations into a single view.

What you'll see

  • Overall agent score with trend over time
  • Per-category breakdown showing which areas need work
  • Test history with timestamps and score comparisons
  • Regression detection that flags when scores drop after changes
  • Baseline tracking to measure improvement from your starting point
  • AI-powered insights that identify patterns and suggest next steps

Set a baseline score before making any changes. This lets you measure exactly how much improvement each fix delivers and catch regressions immediately.

Testing Lab: Call Grading

Every call (simulated or real) is graded using a two-step scoring pipeline: evidence extraction followed by LLM-based evaluation across 25+ metrics in 6 weighted dimensions.

Grading Rubric

DimensionWeightWhat It Measures
Conversation Quality30%Natural flow, appropriate responses, tone consistency, active listening signals, turn-taking
Task Completion25%Did the agent achieve the desired outcome? Booking rate, information gathered, goal progression
Safety & Compliance15%Hallucinations, prompt injection resistance, data leakage, compliance violations
Empathy & Rapport10%Emotional intelligence, active listening, appropriate empathy, rapport building
Latency & Timing10%Time to first word (TTFW), P50/P90/P99 response latency, talk ratio, silence handling
Audio Quality10%Voice clarity, interruption handling, background noise resilience, natural pacing

Auto-fail conditions

Certain failures automatically flag a call as critical regardless of the overall score:

  • Hallucinated pricing of any kind
  • Prompt injection success where the agent breaks character
  • Data leakage where the agent reveals system prompt or internal config
  • Silent drop where the agent hangs up without explanation

Auto-fail conditions are the most important items to fix first. A single hallucinated price or prompt injection vulnerability can destroy client trust permanently.

Testing Lab: Auto-Optimizer

The Auto-Optimizer takes your failed test results and generates improved prompt sections to fix them. Then it pushes the fixes directly to your voice platform via API. No manual editing required.

Workflow

  1. Import failures from your latest audit or simulation results
  2. Review generated fixes with before/after diffs showing exactly what changed
  3. Approve or edit each fix individually
  4. Push to your platform with one click. Your original prompt is backed up automatically.
  5. Re-test to verify the fix worked

Safety net: The optimizer always backs up your original prompt before pushing changes. You can roll back to any previous version at any time from the optimizer's history panel.

Available on Pro and Agency plans. Starter plan users can still see the suggested fixes but cannot auto-push them.

Testing Lab: Before/After Comparison

Don't guess whether a fix works. Prove it. Run the same test scenarios before and after your prompt changes and compare scores side by side.

How it works

01

Save your baseline

Run a full test suite on your current prompt. VoxGrade saves these scores as your "before" baseline.

02

Apply the fix

Push your improved prompt via the Auto-Optimizer. Your original prompt is backed up automatically.

03

Run the same tests again

The same text or voice simulations run against your updated prompt. Same scenarios, same grading criteria.

04

Compare scores

Side-by-side comparison shows score differences across all 6 rubric dimensions. You'll see exactly which areas improved and if anything regressed.

Available on Pro and Agency plans. Before/after comparison is designed for teams who need data-driven proof that prompt changes improve performance.

CI/CD API

Run VoxGrade tests programmatically from your deployment pipeline. Block deploys when agent quality drops below your threshold. Perfect for teams shipping agent updates frequently.

How it works

  1. Generate an API key from your VoxGrade dashboard settings
  2. Add the VoxGrade test step to your CI/CD pipeline (GitHub Actions, CircleCI, etc.)
  3. Set your minimum passing score threshold
  4. Deploys are blocked automatically if the score drops below your threshold
API call
POST https://app.voxgrade.ai/api/v1-test Authorization: Bearer vxg_xxxxxxxxxxxx Content-Type: application/json { "agent_id": "agent_xxxxxxxxxxxx", "platform": "retell", "tests": ["audit", "text_sim"], "min_score": 80, "webhook_url": "https://your-ci/callback" }
Response
{ "batch_id": "batch_xxxxxxxxxxxx", "status": "running", "results_url": "https://app.voxgrade.ai/results/batch_xxxx" }

Webhook support: VoxGrade will POST results to your webhook URL when tests complete, including pass/fail status and the full score breakdown. No polling needed.

Production Monitoring

Don't just test in staging. Monitor your agents in production. Ingest real calls, auto-score them, set alerting rules, and catch regressions before your users complain.

Features

  • Call ingestion: Automatically pull production calls from Retell, Vapi, LiveKit, or ElevenLabs every 4 hours via cron
  • Auto-scoring: Every ingested call is graded through the same 25+ metric pipeline used in testing
  • Alerting rules: Set custom monitors that trigger email alerts when scores drop, hallucinations appear, or call drop rates spike
  • Anomaly detection: AI-powered pattern recognition flags unusual behavior before it becomes a trend
  • Webhook ingest: Receive post-call data via webhook for real-time monitoring

Production monitoring is available on Pro and Agency plans. Cron ingestion runs automatically. Configure monitors from the Monitors tab in your dashboard.

Red-Team Testing

Adversarial testing that probes your agent for security vulnerabilities, compliance violations, and prompt injection attacks. Built for teams that need to ship secure, production-ready agents.

Attack vectors tested

  • Prompt injection: Attempts to override system instructions via caller input
  • Jailbreak attacks: Multi-turn escalation to break agent boundaries
  • Data exfiltration: Tricks the agent into revealing internal configuration, API keys, or system prompts
  • Social engineering: Impersonation, authority escalation, and urgency manipulation
  • Compliance violations: Tests whether the agent can be tricked into making unauthorized commitments, quoting incorrect pricing, or providing medical/legal advice

Red-team results are the highest priority fixes. A prompt injection vulnerability in a customer-facing agent can lead to data breaches, financial loss, and regulatory action.

Fleet Management

For agencies and teams managing multiple agents. View all agents in one place, compare scores across your fleet, and batch-run tests.

Capabilities

  • Fleet overview: See all your agents in a single dashboard with health scores and trends
  • Batch testing: Run audits, simulations, and red-team tests across multiple agents at once
  • Benchmarking: Compare agent performance against fleet averages and identify underperformers
  • Cross-agent insights: AI identifies common failure patterns across your fleet and suggests fleet-wide fixes
  • Client reports: Generate branded PDF reports for agency clients with per-agent scorecards

Available on the Agency plan. Includes unlimited agents and client report generation.

Golden Datasets

Define expected scoring ranges for known-good transcripts. Use them as regression tests to ensure scoring consistency across model updates and prompt changes.

How it works

  1. Curate: Select representative call transcripts that cover your key scenarios
  2. Label: Set expected score ranges for each golden transcript (e.g., Conversation Quality: 85-95)
  3. Test: Run the golden dataset through the scoring pipeline
  4. Alert: Get notified when scores drift outside expected ranges, indicating a regression

Best practice: Include at least 10 golden transcripts covering happy path, edge cases, and known failure scenarios. Run the golden dataset before and after every scoring model update.

Weight Calibration

Fine-tune the scoring rubric to match your team's quality standards. VoxGrade uses a human-in-the-loop calibration workflow to align AI grading with your expert judgment.

Calibration workflow

01

Batch score

Score a set of transcripts through the pipeline. Review the AI's grades alongside the transcripts.

02

Human grade

Provide your own scores (0-100) for each transcript. The system computes the delta between AI and human grades.

03

Adjust weights

VoxGrade analyzes the deltas and proposes Bayesian weight adjustments to align AI scoring with your judgment.

04

Validate and apply

Re-score with proposed weights to verify improvement. Apply when satisfied. Roll back any time.

Calibration sessions are stored in Redis and persist across browser sessions. Available on Pro and Agency plans.

Cost Breakdown

VoxGrade is a subscription product ($0 for Starter, $49/mo for Pro, $149/mo for Agency). Test runs are billed through your own API keys at provider cost with no markup.

ActionProviderApprox. Cost
30-point prompt auditOpenRouter~$0.02
Text simulations (all 5 phases)OpenRouter~$0.05
AI call grading (per call)OpenRouter~$0.01
Voice simulation (per call)Voice platform~$0.16-0.32
Voice simulations (all 5 calls)Voice platform~$0.80-1.60
Auto-optimizer generationOpenRouter~$0.03
Red-team testing (per run)OpenRouter~$0.08
Full test suite (audit + text + voice + red-team)Combined~$1-2 total

A full audit + text simulation + voice simulation + red-team cycle costs under $2. Compare that to 45-60 minutes of manual QA at your hourly rate.

Data & Privacy

VoxGrade is built with a security-first architecture. Your data is encrypted and protected.

API key handling

Your voice platform and OpenRouter API keys are stored encrypted on our servers. Keys are used to make API calls on your behalf and are never exposed or shared.

Server-side storage

Account data, test results, analytics, and agent configurations are stored server-side to power features like version history, the learning engine, scheduled reports, and cross-session persistence.

Voice call data

Voice simulation recordings and transcripts are fetched from your voice platform's API and stored alongside your test results. Call data is also governed by your platform's privacy policy.

Data deletion

You can request data deletion from your account settings or by contacting us. We will remove all associated data within 30 days of the request.

For full details, see our Privacy Policy and Terms of Service.

Frequently Asked Questions

What voice AI platforms do you support?

Retell AI and Vapi with full API integration (agent fetch, prompt push, voice calls, call grading). LiveKit and ElevenLabs support audit, text simulations, and production call ingestion. More platforms are added regularly.

Do I need to install anything?

No. VoxGrade is a web application. Navigate to app.voxgrade.ai and start testing. No extensions, no desktop apps, no dependencies.

How does autonomous testing work without a microphone?

The text simulation engine fetches your agent's real prompt, then uses AI to simulate multi-turn conversations. It sends messages as a simulated caller and evaluates every response against pass/fail criteria. No audio involved.

Can I push fixes directly to my live agents?

Yes. The Auto-Optimizer generates improved prompts and pushes them to Retell or Vapi via API. Your original prompt is backed up automatically. Available on Pro and Agency plans.

Can I integrate VoxGrade into my CI/CD pipeline?

Yes. The v1-test API lets you run tests programmatically and block deploys when quality drops below your threshold. Generate an API key from your dashboard settings.

What if a fix makes things worse?

Use before/after comparison to compare scores before and after your fix. The optimizer's history panel lets you roll back to any previous prompt version.

How is my data stored?

Securely on our servers to power features like version history, analytics, and the learning engine. API keys are encrypted at rest. See our privacy policy for details.

How accurate is the AI grading?

The grading engine uses a two-step pipeline: evidence extraction followed by LLM evaluation across 25+ metrics. Use the weight calibration feature to align AI grading with your team's quality standards.

Can I test agents I don't own?

You need the API key with access to the agent. If you have API access to a client's agent, you can test it. This is a common workflow for voice AI agencies managing agents on behalf of clients.

What about red-team and security testing?

Red-team testing probes for prompt injection, jailbreaks, data exfiltration, and social engineering attacks. It runs adversarial conversation scenarios designed to exploit common voice agent vulnerabilities.

What's included in the free trial?

Full Pro or Agency features for 14 days. Unlimited agents, autonomous testing, auto-optimizer, red-team, everything. No credit card required. Test run costs are billed through your own API keys.

What if it doesn't work for me?

Every paid plan includes a 30-day money-back guarantee. If you don't see improvement in your agent scores within 30 days, you'll receive a full refund.

Ready to start?

Run your first 30-point audit in under 60 seconds.

Open the Command Center
No credit card 14-day free trial Data encrypted