Voice Agent
Command Center
The complete QA platform for AI voice agents. Audit, stress-test, grade, and auto-fix your agents across Retell AI, Vapi, LiveKit, and ElevenLabs from a single browser-based dashboard.
What is VoxGrade?
VoxGrade is a browser-based QA platform purpose-built for AI voice agents. It gives you a structured, repeatable process to find failures, fix prompts, and verify improvements before your clients hear them.
The platform is organized into core modules:
Agent Auditor
Connect an agent, run a 30-point audit, simulate conversations via text and voice, get copy-paste fixes for every failure.
Testing Lab
View test results, grade calls with a weighted rubric, auto-generate improved prompts, and compare before/after scores.
CI/CD API
Run tests programmatically from your deployment pipeline. Block deploys when agent quality drops below your threshold.
Production Monitoring
Ingest production calls, set alerting rules, and track real-world agent performance with automated scoring and anomaly detection.
Red-Team Testing
Adversarial attack simulations that probe for prompt injection, jailbreaks, data exfiltration, and compliance violations.
Fleet Management
Manage and test multiple agents at once. Compare scores across your fleet, identify underperformers, and batch-apply fixes.
VoxGrade stores your data securely server-side to power features like version history, analytics, the learning engine, and scheduled reports. API keys are encrypted at rest.
Requirements
Before you begin, make sure you have the following:
Voice Platform API Key
Used to connect to your voice agents, fetch prompts, push updates, and trigger calls. VoxGrade supports Retell AI, Vapi, LiveKit, and ElevenLabs.
Retell AI: retellai.com → Settings → API Keys
Vapi: vapi.ai → Dashboard → API Keys
LiveKit: livekit.io → Project Settings → Keys
ElevenLabs: elevenlabs.io → Profile → API Keys
OpenRouter API Key
Powers the AI grading engine that scores your agent's responses across the 6-dimension rubric. Used for text simulations, auto-optimizer, red-team testing, and insights generation.
1. Go to openrouter.ai/keys
2. Sign in or create an account
3. Click "Create Key"
4. Copy your API key
sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
An Agent ID
The unique identifier for the specific voice agent you want to test. Each platform uses its own ID format.
Retell AI: agent_xxxxxxxxxxxxxxxxxxxxxxxx
Vapi: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
LiveKit: agent-name or agent ID from dashboard
ElevenLabs: agent ID from Conversational AI page
Cost note: Text simulations and AI grading use OpenRouter credits (typically $0.05-0.10 per run). Voice simulations use your platform's telephony credits ($0.80-1.60 for all 5 scenarios). These costs are billed through your own API keys with no markup from VoxGrade.
Supported Platforms
VoxGrade connects to multiple voice AI platforms through a unified adapter layer. Test and optimize agents regardless of which platform they run on.
| Feature | Retell AI | Vapi | LiveKit | ElevenLabs |
|---|---|---|---|---|
| Agent fetch | Full | Full | Partial | Partial |
| 30-point audit | Full | Full | Full | Full |
| Text simulations | Full | Full | Full | Full |
| Voice simulations | Full | Full | Coming soon | Coming soon |
| Auto-optimizer push | Full | Full | Manual | Manual |
| Production call ingest | Full | Full | Full | Full |
Quick Start
Go from zero to a fully audited, tested agent in 8 steps. The entire flow takes under 5 minutes.
Open the Command Center
Navigate to app.voxgrade.ai in any modern browser. No installation required. 14-day free trial, no credit card.
Enter your API keys
Paste your voice platform API key and OpenRouter API key in the settings panel. Keys are encrypted at rest on our servers and never shared.
Connect an agent
Enter your Agent ID in the Agent Auditor's Overview tab. VoxGrade will pull your agent's prompt, configuration, tools, and variables automatically.
Run the 30-point audit
Switch to the Audit tab and hit run. Your agent's prompt is scored across 5 categories with A-F grades. Every failure includes the exact copy-paste fix. Takes about 15 seconds.
Review failures and apply fixes
Read through each flagged issue. Copy the suggested fix directly into your prompt editor, or use the Auto-Optimizer (Step 8) to batch-apply fixes.
Run text simulations
Hit the Text Simulations tab. The autonomous QA engine runs 5 conversation phases against your agent's prompt: happy path, edge cases, silence handling, hallucination traps, and prompt injection resistance. Cost: ~$0.05.
Run voice simulations
Switch to Voice Simulations. AI callers will actually phone your agent with different personas and scenarios. You get real call recordings and transcripts to review. Cost: ~$0.80-1.60 for all 5 scenarios.
Optimize and ship
Use the Auto-Optimizer in the Testing Lab to import failures, generate improved prompts, and push them directly to your platform via API. Run A/B tests to prove the fix works before shipping to production.
Pro workflow: Run the audit first to catch structural issues, then text sims to verify conversation flow, then voice sims for real-world validation. This layered approach catches 95%+ of failures before a real caller ever reaches your agent.
Agent Auditor: Overview Tab
The Overview tab is your starting point. This is where you connect a voice agent and view its current configuration.
What it shows
- Agent name and ID pulled directly from your voice platform
- Current prompt with syntax highlighting
- Configured tools/functions the agent can call
- Variables available to the agent during calls
- Voice settings including provider, voice ID, and language
How to use it
- Select your voice platform (Retell AI, Vapi, LiveKit, or ElevenLabs)
- Paste your Agent ID into the input field
- Click Connect
- Review the fetched configuration to confirm it matches what you expect
- Proceed to the Audit tab
If the agent fails to load, double-check that your API key is correct and that the Agent ID is valid. The key must have read access to the agent.
Agent Auditor: 30-Point Audit
The core of the platform. The audit engine analyzes your agent's prompt against 30 checkpoints organized into 5 categories, grading each on an A-F scale.
Audit Categories
Prompt Structure
Checks for clear role definition, personality consistency, conversation flow logic, greeting quality, and closing sequences.
Voice Realism
Evaluates natural speech patterns, filler word usage, response length, tone consistency, and whether the agent sounds human or robotic.
Call Management
Tests silence handling, interruption recovery, topic redirection, objection handling, and graceful call ending under pressure.
Functions & Tools
Verifies tool call formatting, parameter handling, error recovery when tools fail, and proper use of available functions.
Variables & Context
Checks variable injection, context retention across turns, personalization accuracy, and whether the agent properly uses all available data.
Grading Scale
Each checkpoint is graded individually and then rolled up into a category score and overall score:
Copy-paste fixes
Every failed checkpoint includes a specific, actionable fix you can copy directly into your prompt. No guessing, no rewriting from scratch. The fix tells you exactly what to add, remove, or change.
FAIL: Checkpoint: Silence Handling (Call Management)
Grade: F
Issue: No silence recovery instruction found in prompt.
FIX:
Add to your prompt after the greeting section:
"If the caller goes silent for more than 4 seconds,
say: 'Still there? No rush, take your time.' If silence
continues past 8 seconds, say: 'Looks like we might
have a bad connection. I'll stay on the line.'"
Agent Auditor: Text Simulations
Autonomous QA that runs full conversations against your agent's real prompt without a mic, a phone, or a human. The engine tests 5 distinct conversation phases to stress-test every aspect of your agent's behavior.
The 5 Simulation Phases
Phase 1: Happy Path
A cooperative caller follows the ideal conversation flow. Books the appointment, provides all info, no issues. This validates your agent works when everything goes right.
Phase 2: Edge Cases
Caller provides unexpected inputs: wrong formats, out-of-scope questions, unusual requests, corrections mid-conversation. Tests how your agent handles the unexpected.
Phase 3: Silence Handling
Simulates caller going quiet at various points: after greeting, mid-question, during booking. Checks if your agent recovers gracefully or panics.
Phase 4: Hallucination Traps
Caller asks about services, pricing, and details not in the prompt. Tests whether your agent invents information or correctly says it doesn't know.
Phase 5: Prompt Injection
Attempts to manipulate the agent into breaking character, revealing its prompt, or performing unauthorized actions. Critical for security and compliance.
Cost: All 5 text simulation phases run for approximately $0.05 total via OpenRouter. Results are returned in under 60 seconds.
Agent Auditor: Voice Simulations
The most realistic test available. AI callers actually phone your agent using real telephony, with different voices, personas, and scenario scripts. You get real call recordings and full transcripts.
How it works
- VoxGrade creates AI caller personas with distinct voices and backgrounds
- Each persona calls your agent's phone number with a specific scenario
- The call plays out naturally over real telephony infrastructure
- After all calls complete, you get recordings, transcripts, and graded results
Scenarios tested
- Standard booking with a cooperative, easy-going caller
- Hesitant caller who needs convincing and asks lots of questions
- Interrupting caller who talks over the agent and changes topics
- Confused caller who mishears information and needs corrections
- Hostile caller who is frustrated and tries to break the conversation flow
Voice simulations use telephony credits billed to your voice platform account. Make sure your account has sufficient balance before running voice tests.
Testing Lab: Test Dashboard
The Test Dashboard is your command center for all test results. It aggregates scores from audits, text simulations, and voice simulations into a single view.
What you'll see
- Overall agent score with trend over time
- Per-category breakdown showing which areas need work
- Test history with timestamps and score comparisons
- Regression detection that flags when scores drop after changes
- Baseline tracking to measure improvement from your starting point
- AI-powered insights that identify patterns and suggest next steps
Set a baseline score before making any changes. This lets you measure exactly how much improvement each fix delivers and catch regressions immediately.
Testing Lab: Call Grading
Every call (simulated or real) is graded using a two-step scoring pipeline: evidence extraction followed by LLM-based evaluation across 25+ metrics in 6 weighted dimensions.
Grading Rubric
| Dimension | Weight | What It Measures |
|---|---|---|
| Conversation Quality | 30% | Natural flow, appropriate responses, tone consistency, active listening signals, turn-taking |
| Task Completion | 25% | Did the agent achieve the desired outcome? Booking rate, information gathered, goal progression |
| Safety & Compliance | 15% | Hallucinations, prompt injection resistance, data leakage, compliance violations |
| Empathy & Rapport | 10% | Emotional intelligence, active listening, appropriate empathy, rapport building |
| Latency & Timing | 10% | Time to first word (TTFW), P50/P90/P99 response latency, talk ratio, silence handling |
| Audio Quality | 10% | Voice clarity, interruption handling, background noise resilience, natural pacing |
Auto-fail conditions
Certain failures automatically flag a call as critical regardless of the overall score:
- Hallucinated pricing of any kind
- Prompt injection success where the agent breaks character
- Data leakage where the agent reveals system prompt or internal config
- Silent drop where the agent hangs up without explanation
Auto-fail conditions are the most important items to fix first. A single hallucinated price or prompt injection vulnerability can destroy client trust permanently.
Testing Lab: Auto-Optimizer
The Auto-Optimizer takes your failed test results and generates improved prompt sections to fix them. Then it pushes the fixes directly to your voice platform via API. No manual editing required.
Workflow
- Import failures from your latest audit or simulation results
- Review generated fixes with before/after diffs showing exactly what changed
- Approve or edit each fix individually
- Push to your platform with one click. Your original prompt is backed up automatically.
- Re-test to verify the fix worked
Safety net: The optimizer always backs up your original prompt before pushing changes. You can roll back to any previous version at any time from the optimizer's history panel.
Available on Pro and Agency plans. Starter plan users can still see the suggested fixes but cannot auto-push them.
Testing Lab: Before/After Comparison
Don't guess whether a fix works. Prove it. Run the same test scenarios before and after your prompt changes and compare scores side by side.
How it works
Save your baseline
Run a full test suite on your current prompt. VoxGrade saves these scores as your "before" baseline.
Apply the fix
Push your improved prompt via the Auto-Optimizer. Your original prompt is backed up automatically.
Run the same tests again
The same text or voice simulations run against your updated prompt. Same scenarios, same grading criteria.
Compare scores
Side-by-side comparison shows score differences across all 6 rubric dimensions. You'll see exactly which areas improved and if anything regressed.
Available on Pro and Agency plans. Before/after comparison is designed for teams who need data-driven proof that prompt changes improve performance.
CI/CD API
Run VoxGrade tests programmatically from your deployment pipeline. Block deploys when agent quality drops below your threshold. Perfect for teams shipping agent updates frequently.
How it works
- Generate an API key from your VoxGrade dashboard settings
- Add the VoxGrade test step to your CI/CD pipeline (GitHub Actions, CircleCI, etc.)
- Set your minimum passing score threshold
- Deploys are blocked automatically if the score drops below your threshold
POST https://app.voxgrade.ai/api/v1-test
Authorization: Bearer vxg_xxxxxxxxxxxx
Content-Type: application/json
{
"agent_id": "agent_xxxxxxxxxxxx",
"platform": "retell",
"tests": ["audit", "text_sim"],
"min_score": 80,
"webhook_url": "https://your-ci/callback"
}
{
"batch_id": "batch_xxxxxxxxxxxx",
"status": "running",
"results_url": "https://app.voxgrade.ai/results/batch_xxxx"
}
Webhook support: VoxGrade will POST results to your webhook URL when tests complete, including pass/fail status and the full score breakdown. No polling needed.
Production Monitoring
Don't just test in staging. Monitor your agents in production. Ingest real calls, auto-score them, set alerting rules, and catch regressions before your users complain.
Features
- Call ingestion: Automatically pull production calls from Retell, Vapi, LiveKit, or ElevenLabs every 4 hours via cron
- Auto-scoring: Every ingested call is graded through the same 25+ metric pipeline used in testing
- Alerting rules: Set custom monitors that trigger email alerts when scores drop, hallucinations appear, or call drop rates spike
- Anomaly detection: AI-powered pattern recognition flags unusual behavior before it becomes a trend
- Webhook ingest: Receive post-call data via webhook for real-time monitoring
Production monitoring is available on Pro and Agency plans. Cron ingestion runs automatically. Configure monitors from the Monitors tab in your dashboard.
Red-Team Testing
Adversarial testing that probes your agent for security vulnerabilities, compliance violations, and prompt injection attacks. Built for teams that need to ship secure, production-ready agents.
Attack vectors tested
- Prompt injection: Attempts to override system instructions via caller input
- Jailbreak attacks: Multi-turn escalation to break agent boundaries
- Data exfiltration: Tricks the agent into revealing internal configuration, API keys, or system prompts
- Social engineering: Impersonation, authority escalation, and urgency manipulation
- Compliance violations: Tests whether the agent can be tricked into making unauthorized commitments, quoting incorrect pricing, or providing medical/legal advice
Red-team results are the highest priority fixes. A prompt injection vulnerability in a customer-facing agent can lead to data breaches, financial loss, and regulatory action.
Fleet Management
For agencies and teams managing multiple agents. View all agents in one place, compare scores across your fleet, and batch-run tests.
Capabilities
- Fleet overview: See all your agents in a single dashboard with health scores and trends
- Batch testing: Run audits, simulations, and red-team tests across multiple agents at once
- Benchmarking: Compare agent performance against fleet averages and identify underperformers
- Cross-agent insights: AI identifies common failure patterns across your fleet and suggests fleet-wide fixes
- Client reports: Generate branded PDF reports for agency clients with per-agent scorecards
Available on the Agency plan. Includes unlimited agents and client report generation.
Golden Datasets
Define expected scoring ranges for known-good transcripts. Use them as regression tests to ensure scoring consistency across model updates and prompt changes.
How it works
- Curate: Select representative call transcripts that cover your key scenarios
- Label: Set expected score ranges for each golden transcript (e.g., Conversation Quality: 85-95)
- Test: Run the golden dataset through the scoring pipeline
- Alert: Get notified when scores drift outside expected ranges, indicating a regression
Best practice: Include at least 10 golden transcripts covering happy path, edge cases, and known failure scenarios. Run the golden dataset before and after every scoring model update.
Weight Calibration
Fine-tune the scoring rubric to match your team's quality standards. VoxGrade uses a human-in-the-loop calibration workflow to align AI grading with your expert judgment.
Calibration workflow
Batch score
Score a set of transcripts through the pipeline. Review the AI's grades alongside the transcripts.
Human grade
Provide your own scores (0-100) for each transcript. The system computes the delta between AI and human grades.
Adjust weights
VoxGrade analyzes the deltas and proposes Bayesian weight adjustments to align AI scoring with your judgment.
Validate and apply
Re-score with proposed weights to verify improvement. Apply when satisfied. Roll back any time.
Calibration sessions are stored in Redis and persist across browser sessions. Available on Pro and Agency plans.
Cost Breakdown
VoxGrade is a subscription product ($0 for Starter, $49/mo for Pro, $149/mo for Agency). Test runs are billed through your own API keys at provider cost with no markup.
| Action | Provider | Approx. Cost |
|---|---|---|
| 30-point prompt audit | OpenRouter | ~$0.02 |
| Text simulations (all 5 phases) | OpenRouter | ~$0.05 |
| AI call grading (per call) | OpenRouter | ~$0.01 |
| Voice simulation (per call) | Voice platform | ~$0.16-0.32 |
| Voice simulations (all 5 calls) | Voice platform | ~$0.80-1.60 |
| Auto-optimizer generation | OpenRouter | ~$0.03 |
| Red-team testing (per run) | OpenRouter | ~$0.08 |
| Full test suite (audit + text + voice + red-team) | Combined | ~$1-2 total |
A full audit + text simulation + voice simulation + red-team cycle costs under $2. Compare that to 45-60 minutes of manual QA at your hourly rate.
Data & Privacy
VoxGrade is built with a security-first architecture. Your data is encrypted and protected.
API key handling
Your voice platform and OpenRouter API keys are stored encrypted on our servers. Keys are used to make API calls on your behalf and are never exposed or shared.
Server-side storage
Account data, test results, analytics, and agent configurations are stored server-side to power features like version history, the learning engine, scheduled reports, and cross-session persistence.
Voice call data
Voice simulation recordings and transcripts are fetched from your voice platform's API and stored alongside your test results. Call data is also governed by your platform's privacy policy.
Data deletion
You can request data deletion from your account settings or by contacting us. We will remove all associated data within 30 days of the request.
For full details, see our Privacy Policy and Terms of Service.
Frequently Asked Questions
What voice AI platforms do you support?
Retell AI and Vapi with full API integration (agent fetch, prompt push, voice calls, call grading). LiveKit and ElevenLabs support audit, text simulations, and production call ingestion. More platforms are added regularly.
Do I need to install anything?
No. VoxGrade is a web application. Navigate to app.voxgrade.ai and start testing. No extensions, no desktop apps, no dependencies.
How does autonomous testing work without a microphone?
The text simulation engine fetches your agent's real prompt, then uses AI to simulate multi-turn conversations. It sends messages as a simulated caller and evaluates every response against pass/fail criteria. No audio involved.
Can I push fixes directly to my live agents?
Yes. The Auto-Optimizer generates improved prompts and pushes them to Retell or Vapi via API. Your original prompt is backed up automatically. Available on Pro and Agency plans.
Can I integrate VoxGrade into my CI/CD pipeline?
Yes. The v1-test API lets you run tests programmatically and block deploys when quality drops below your threshold. Generate an API key from your dashboard settings.
What if a fix makes things worse?
Use before/after comparison to compare scores before and after your fix. The optimizer's history panel lets you roll back to any previous prompt version.
How is my data stored?
Securely on our servers to power features like version history, analytics, and the learning engine. API keys are encrypted at rest. See our privacy policy for details.
How accurate is the AI grading?
The grading engine uses a two-step pipeline: evidence extraction followed by LLM evaluation across 25+ metrics. Use the weight calibration feature to align AI grading with your team's quality standards.
Can I test agents I don't own?
You need the API key with access to the agent. If you have API access to a client's agent, you can test it. This is a common workflow for voice AI agencies managing agents on behalf of clients.
What about red-team and security testing?
Red-team testing probes for prompt injection, jailbreaks, data exfiltration, and social engineering attacks. It runs adversarial conversation scenarios designed to exploit common voice agent vulnerabilities.
What's included in the free trial?
Full Pro or Agency features for 14 days. Unlimited agents, autonomous testing, auto-optimizer, red-team, everything. No credit card required. Test run costs are billed through your own API keys.
What if it doesn't work for me?
Every paid plan includes a 30-day money-back guarantee. If you don't see improvement in your agent scores within 30 days, you'll receive a full refund.