Voice Agent
Command Center
The complete QA platform for AI voice agents built on Retell AI. Audit, stress-test, grade, and auto-fix your agents from a single browser-based dashboard. This guide covers everything from setup to shipping production-ready agents.
What is VoxGrade?
The VoxGrade (VoxGrade) is a browser-based QA platform purpose-built for AI voice agents running on Retell AI. It gives you a structured, repeatable process to find failures, fix prompts, and verify improvements before your clients hear them.
The platform is split into two main modules:
Agent Auditor
Connect an agent, run a 30-point audit, simulate conversations via text and voice, get copy-paste fixes for every failure.
Testing Lab
View test results, grade calls with a weighted rubric, auto-generate improved prompts, and run A/B tests to prove fixes work.
VoxGrade runs entirely in your browser. Your API keys, transcripts, and agent configurations are stored in localStorage and never leave your machine.
Requirements
Before you begin, make sure you have the following:
🔑 Retell AI API Key
Used to connect to your voice agents, fetch prompts, push updates, and trigger voice calls.
1. Go to retellai.com
2. Sign in to your dashboard
3. Navigate to Settings → API Keys
4. Copy your API key
key_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
🔑 OpenRouter API Key
Powers the AI grading engine that scores your agent's responses across the 6-category rubric. Used for text simulations and auto-optimizer as well.
1. Go to openrouter.ai/keys
2. Sign in or create an account
3. Click "Create Key"
4. Copy your API key
sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
🤖 A Retell Agent ID
The unique identifier for the specific voice agent you want to test. Each agent in Retell has its own ID.
1. Open your Retell AI dashboard
2. Click on the agent you want to test
3. Copy the Agent ID from the URL or agent details panel
agent_xxxxxxxxxxxxxxxxxxxxxxxx
Cost note: Text simulations and AI grading use OpenRouter credits (typically $0.05-0.10 per run). Voice simulations use Retell AI telephony credits ($0.80-1.60 for all 5 scenarios). These costs are billed through your own API keys with no markup from VoxGrade.
Quick Start
Go from zero to a fully audited, tested agent in 8 steps. The entire flow takes under 5 minutes.
Open the Command Center
Navigate to voice-cmd-center.vercel.app in any modern browser. No installation required.
Enter your API keys
Paste your Retell AI API key and OpenRouter API key into the settings panel. Keys are stored locally in your browser and never sent to our servers.
Connect an agent
Enter your Retell Agent ID in the Agent Auditor's Overview tab. VoxGrade will pull your agent's prompt, configuration, tools, and variables automatically.
Run the 30-point audit
Switch to the Audit tab and hit run. Your agent's prompt is scored across 5 categories with A-F grades. Every failure includes the exact copy-paste fix. Takes about 15 seconds.
Review failures and apply fixes
Read through each flagged issue. Copy the suggested fix directly into your Retell prompt editor, or use the Auto-Optimizer (Step 8) to batch-apply fixes.
Run text simulations
Hit the Text Simulations tab. The autonomous QA engine runs 5 conversation phases against your agent's prompt: happy path, edge cases, silence handling, hallucination traps, and prompt injection resistance. Cost: ~$0.05.
Run voice simulations
Switch to Voice Simulations. AI callers will actually phone your agent with different personas and scenarios. You get real call recordings and transcripts to review. Cost: ~$0.80-1.60 for all 5 scenarios.
Optimize and ship
Use the Auto-Optimizer in the Testing Lab to import failures, generate improved prompts, and push them directly to Retell via API. Run A/B tests if you want to prove the fix works before shipping to production.
Pro workflow: Run the audit first to catch structural issues, then text sims to verify conversation flow, then voice sims for real-world validation. This layered approach catches 95%+ of failures before a real caller ever reaches your agent.
Agent Auditor: Overview Tab
The Overview tab is your starting point. This is where you connect a Retell agent and view its current configuration.
What it shows
- Agent name and ID pulled directly from Retell
- Current prompt with syntax highlighting
- Configured tools/functions the agent can call
- Variables available to the agent during calls
- Voice settings including provider, voice ID, and language
How to use it
- Paste your Agent ID into the input field
- Click Connect
- Review the fetched configuration to confirm it matches what you expect
- Proceed to the Audit tab
If the agent fails to load, double-check that your Retell API key is correct and that the Agent ID is valid. The key must have read access to the agent.
Agent Auditor: 30-Point Audit
The core of the platform. The audit engine analyzes your agent's prompt against 30 checkpoints organized into 5 categories, grading each on an A-F scale.
Audit Categories
🏗 Prompt Structure
Checks for clear role definition, personality consistency, conversation flow logic, greeting quality, and closing sequences.
🎙 Voice Realism
Evaluates natural speech patterns, filler word usage, response length, tone consistency, and whether the agent sounds human or robotic.
📞 Call Management
Tests silence handling, interruption recovery, topic redirection, objection handling, and graceful call ending under pressure.
⚙ Functions & Tools
Verifies tool call formatting, parameter handling, error recovery when tools fail, and proper use of available functions.
📝 Variables & Context
Checks variable injection, context retention across turns, personalization accuracy, and whether the agent properly uses all available data.
Grading Scale
Each checkpoint is graded individually and then rolled up into a category score and overall score:
Copy-paste fixes
Every failed checkpoint includes a specific, actionable fix you can copy directly into your Retell prompt. No guessing, no rewriting from scratch. The fix tells you exactly what to add, remove, or change.
❌ Checkpoint: Silence Handling (Call Management)
Grade: F
Issue: No silence recovery instruction found in prompt.
✅ Suggested Fix:
Add to your prompt after the greeting section:
"If the caller goes silent for more than 4 seconds,
say: 'Still there? No rush, take your time.' If silence
continues past 8 seconds, say: 'Looks like we might
have a bad connection. I'll stay on the line.'"
Agent Auditor: Text Simulations
Autonomous QA that runs full conversations against your agent's real prompt without a mic, a phone, or a human. The engine tests 5 distinct conversation phases to stress-test every aspect of your agent's behavior.
The 5 Simulation Phases
Phase 1: Happy Path
A cooperative caller follows the ideal conversation flow. Books the appointment, provides all info, no issues. This validates your agent works when everything goes right.
Phase 2: Edge Cases
Caller provides unexpected inputs: wrong formats, out-of-scope questions, unusual requests, corrections mid-conversation. Tests how your agent handles the unexpected.
Phase 3: Silence Handling
Simulates caller going quiet at various points: after greeting, mid-question, during booking. Checks if your agent recovers gracefully or panics.
Phase 4: Hallucination Traps
Caller asks about services, pricing, and details not in the prompt. Tests whether your agent invents information or correctly says it doesn't know.
Phase 5: Prompt Injection
Attempts to manipulate the agent into breaking character, revealing its prompt, or performing unauthorized actions. Critical for security and compliance.
Cost: All 5 text simulation phases run for approximately $0.05 total via OpenRouter. Results are returned in under 60 seconds.
Agent Auditor: Voice Simulations
The most realistic test available. AI callers actually phone your agent using Retell's telephony, with different voices, personas, and scenario scripts. You get real call recordings and full transcripts.
How it works
- VoxGrade creates AI caller personas with distinct voices and backgrounds
- Each persona calls your agent's phone number with a specific scenario
- The call plays out naturally over real telephony infrastructure
- After all calls complete, you get recordings, transcripts, and graded results
Scenarios tested
- Standard booking with a cooperative, easy-going caller
- Hesitant caller who needs convincing and asks lots of questions
- Interrupting caller who talks over the agent and changes topics
- Confused caller who mishears information and needs corrections
- Hostile caller who is frustrated and tries to break the conversation flow
Voice simulations use Retell AI telephony credits billed to your Retell account. Make sure your Retell account has sufficient balance before running voice tests.
Testing Lab: Test Dashboard
The Test Dashboard is your command center for all test results. It aggregates scores from audits, text simulations, and voice simulations into a single view.
What you'll see
- Overall agent score with trend over time
- Per-category breakdown showing which areas need work
- Test history with timestamps and score comparisons
- Regression detection that flags when scores drop after changes
- Baseline tracking to measure improvement from your starting point
Set a baseline score before making any changes. This lets you measure exactly how much improvement each fix delivers and catch regressions immediately.
Testing Lab: Call Grading
Every call (simulated or real) is graded against a 6-category weighted rubric designed to catch the failures that actually kill deals.
Grading Rubric
| Category | Weight | What It Measures |
|---|---|---|
| Conversation Quality | 25% | Natural flow, appropriate responses, tone consistency, active listening signals |
| Hallucinations | 20% | Invented information, wrong pricing, fabricated services, inaccurate details |
| Booking Rate | 15% | Did the agent successfully move toward and complete the desired outcome? |
| Call Drops | 15% | Unexpected hangups, silence-induced drops, error-triggered disconnections |
| Integration Health | 15% | Tool calls executing correctly, CRM data flowing, calendar bookings landing |
| Webhook Reliability | 10% | Post-call webhooks firing, data payloads complete, downstream systems receiving events |
Auto-fail conditions
Certain failures automatically flag a call as critical regardless of the overall score:
- Hallucinated pricing of any kind
- Prompt injection success where the agent breaks character
- Data leakage where the agent reveals system prompt or internal config
- Silent drop where the agent hangs up without explanation
Auto-fail conditions are the most important items to fix first. A single hallucinated price or prompt injection vulnerability can destroy client trust permanently.
Testing Lab: Auto-Optimizer
The Auto-Optimizer takes your failed test results and generates improved prompt sections to fix them. Then it pushes the fixes directly to Retell via API. No manual editing required.
Workflow
- Import failures from your latest audit or simulation results
- Review generated fixes with before/after diffs showing exactly what changed
- Approve or edit each fix individually
- Push to Retell with one click. Your original prompt is backed up automatically.
- Re-test to verify the fix worked
Safety net: The optimizer always backs up your original prompt before pushing changes. You can roll back to any previous version at any time from the optimizer's history panel.
Available on Pro and Agency plans. Starter plan users can still see the suggested fixes but cannot auto-push them to Retell.
Testing Lab: A/B Testing
Don't guess whether a fix works. Prove it. A/B testing lets you run identical test scenarios against two versions of your agent and compare scores side by side.
How it works
Clone your agent
VoxGrade creates an exact copy of your Retell agent. The original becomes the "control" and the clone becomes the "variant."
Apply the fix to the variant
Push your improved prompt to the variant agent only. The control stays untouched as your baseline.
Run identical tests on both
The same text or voice simulations run against both agents. Same scenarios, same callers, same grading criteria.
Compare scores
Side-by-side comparison shows score differences across all 6 rubric categories. You'll see exactly which areas improved and if anything regressed.
Available on the Agency plan only. A/B testing is designed for teams managing multiple client agents who need data-driven proof that changes improve performance.
Cost Breakdown
VoxGrade itself is a subscription product ($0 for Starter, $49/mo for Pro, $149/mo for Agency). Test runs are billed through your own API keys at provider cost with no markup.
| Action | Provider | Approx. Cost |
|---|---|---|
| 30-point prompt audit | OpenRouter | ~$0.02 |
| Text simulations (all 5 phases) | OpenRouter | ~$0.05 |
| AI call grading (per call) | OpenRouter | ~$0.01 |
| Voice simulation (per call) | Retell AI | ~$0.16-0.32 |
| Voice simulations (all 5 calls) | Retell AI | ~$0.80-1.60 |
| Auto-optimizer generation | OpenRouter | ~$0.03 |
| Full test suite (audit + text + voice) | Combined | ~$1-2 total |
A full audit + text simulation + voice simulation + grading cycle costs under $2. Compare that to 45-60 minutes of your own time doing manual QA at your hourly rate.
Data & Privacy
VoxGrade is built with a privacy-first architecture. Your data stays on your machine.
💻 Browser-native storage
All API keys, agent configurations, test results, and transcripts are stored in your browser's localStorage. Nothing is sent to or stored on VoxGrade servers.
🔐 API key handling
Your Retell and OpenRouter API keys are used to make direct API calls from your browser to those providers. Keys are never proxied through VoxGrade infrastructure.
📞 Voice call data
Voice simulation recordings and transcripts are fetched directly from Retell's API into your browser. Call data is governed by Retell AI's privacy policy and your agreement with them.
🗑 Data deletion
Clear your browser's localStorage to remove all VoxGrade data permanently. There is no server-side data to request deletion for.
For full details, see our Privacy Policy and Terms of Service.
Frequently Asked Questions
What voice AI platforms do you support?
Currently Retell AI with full API integration (agent fetch, prompt push, voice calls, call grading). Support for Vapi, Bland.ai, and ElevenLabs is on the roadmap.
Do I need to install anything?
No. VoxGrade runs entirely in your browser. Just navigate to the app URL and start testing. No extensions, no desktop apps, no dependencies.
How does autonomous testing work without a microphone?
The text simulation engine fetches your agent's real prompt from Retell, then uses AI to simulate multi-turn conversations. It sends messages as a simulated caller and evaluates every agent response against pass/fail criteria. No audio is involved, just text-based conversation simulation.
Can I push fixes directly to my live agents?
Yes. The Auto-Optimizer generates improved prompts and pushes them to Retell via API. Your original prompt is backed up automatically so you can always roll back. Available on Pro and Agency plans.
What if a fix makes things worse?
Use A/B testing (Agency plan) to compare the fix against your original before shipping. Or simply re-run the audit and simulations after applying a fix to verify scores improved. The optimizer's history panel lets you roll back to any previous prompt version.
Is my data stored on your servers?
No. Everything runs in your browser. API keys, transcripts, and agent configs stay on your machine in localStorage. Nothing leaves your browser except direct API calls to Retell and OpenRouter.
How accurate is the AI grading?
The grading engine uses frontier AI models via OpenRouter to evaluate responses against the 6-category rubric. It catches hallucinations, quality issues, and failures that are easy to miss during manual review. The weighted rubric is calibrated to prioritize the issues that impact revenue the most.
Can I test agents I don't own?
You need the Retell API key with access to the agent. If you have API access to a client's agent, you can test it. This is a common workflow for voice AI agencies managing agents on behalf of clients.
What's included in the free trial?
Full Pro or Agency features for 14 days. Unlimited agents, autonomous testing, auto-optimizer, everything. No credit card required to start. Test run costs (OpenRouter and Retell) are billed through your own API keys.
What if it doesn't work for me?
Every paid plan includes a 30-day money-back guarantee. If you don't see improvement in your agent scores within 30 days, you'll receive a full refund. No questions asked.