Documentation — VoxGrade by VoxGrade

📖 What is VoxGrade?

The VoxGrade (VoxGrade) is a browser-based QA platform purpose-built for AI voice agents running on Retell AI. It gives you a structured, repeatable process to find failures, fix prompts, and verify improvements before your clients hear them.

The platform is split into two main modules:

🔍

Agent Auditor

Connect an agent, run a 30-point audit, simulate conversations via text and voice, get copy-paste fixes for every failure.

🧪

Testing Lab

View test results, grade calls with a weighted rubric, auto-generate improved prompts, and run A/B tests to prove fixes work.

💡

VoxGrade runs entirely in your browser. Your API keys, transcripts, and agent configurations are stored in localStorage and never leave your machine.

⚙ Requirements

Before you begin, make sure you have the following:

🔑 Retell AI API Key

Used to connect to your voice agents, fetch prompts, push updates, and trigger voice calls.

Where to find it

1. Go to retellai.com
2. Sign in to your dashboard
3. Navigate to Settings → API Keys
4. Copy your API key

 Format
key_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

🔑 OpenRouter API Key

Powers the AI grading engine that scores your agent's responses across the 6-category rubric. Used for text simulations and auto-optimizer as well.

Where to find it

1. Go to openrouter.ai/keys
2. Sign in or create an account
3. Click "Create Key"
4. Copy your API key

 Format
sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

🤖 A Retell Agent ID

The unique identifier for the specific voice agent you want to test. Each agent in Retell has its own ID.

 Where to find it
Open your Retell AI dashboard
Click on the agent you want to test
Copy the Agent ID from the URL or agent details panel

 Format
agent_xxxxxxxxxxxxxxxxxxxxxxxx

⚠

Cost note: Text simulations and AI grading use OpenRouter credits (typically $0.05-0.10 per run). Voice simulations use Retell AI telephony credits ($0.80-1.60 for all 5 scenarios). These costs are billed through your own API keys with no markup from VoxGrade.

⚡ Quick Start

Go from zero to a fully audited, tested agent in 8 steps. The entire flow takes under 5 minutes.

01

Open the Command Center

Navigate to voice-cmd-center.vercel.app in any modern browser. No installation required.

02

Enter your API keys

Paste your Retell AI API key and OpenRouter API key into the settings panel. Keys are stored locally in your browser and never sent to our servers.

03

Connect an agent

Enter your Retell Agent ID in the Agent Auditor's Overview tab. VoxGrade will pull your agent's prompt, configuration, tools, and variables automatically.

04

Run the 30-point audit

Switch to the Audit tab and hit run. Your agent's prompt is scored across 5 categories with A-F grades. Every failure includes the exact copy-paste fix. Takes about 15 seconds.

05

Review failures and apply fixes

Read through each flagged issue. Copy the suggested fix directly into your Retell prompt editor, or use the Auto-Optimizer (Step 8) to batch-apply fixes.

06

Run text simulations

Hit the Text Simulations tab. The autonomous QA engine runs 5 conversation phases against your agent's prompt: happy path, edge cases, silence handling, hallucination traps, and prompt injection resistance. Cost: ~$0.05.

07

Run voice simulations

Switch to Voice Simulations. AI callers will actually phone your agent with different personas and scenarios. You get real call recordings and transcripts to review. Cost: ~$0.80-1.60 for all 5 scenarios.

08

Optimize and ship

Use the Auto-Optimizer in the Testing Lab to import failures, generate improved prompts, and push them directly to Retell via API. Run A/B tests if you want to prove the fix works before shipping to production.

✅

Pro workflow: Run the audit first to catch structural issues, then text sims to verify conversation flow, then voice sims for real-world validation. This layered approach catches 95%+ of failures before a real caller ever reaches your agent.

🔍 Agent Auditor: Overview Tab

The Overview tab is your starting point. This is where you connect a Retell agent and view its current configuration.

What it shows

Agent name and ID pulled directly from Retell
Current prompt with syntax highlighting
Configured tools/functions the agent can call
Variables available to the agent during calls
Voice settings including provider, voice ID, and language

How to use it

Paste your Agent ID into the input field
Click Connect
Review the fetched configuration to confirm it matches what you expect
Proceed to the Audit tab

💡

If the agent fails to load, double-check that your Retell API key is correct and that the Agent ID is valid. The key must have read access to the agent.

📋 Agent Auditor: 30-Point Audit

The core of the platform. The audit engine analyzes your agent's prompt against 30 checkpoints organized into 5 categories, grading each on an A-F scale.

Audit Categories

Category 1

🏗 Prompt Structure

Checks for clear role definition, personality consistency, conversation flow logic, greeting quality, and closing sequences.

Category 2

🎙 Voice Realism

Evaluates natural speech patterns, filler word usage, response length, tone consistency, and whether the agent sounds human or robotic.

Category 3

📞 Call Management

Tests silence handling, interruption recovery, topic redirection, objection handling, and graceful call ending under pressure.

Category 4

⚙ Functions & Tools

Verifies tool call formatting, parameter handling, error recovery when tools fail, and proper use of available functions.

Category 5

📝 Variables & Context

Checks variable injection, context retention across turns, personalization accuracy, and whether the agent properly uses all available data.

Grading Scale

Each checkpoint is graded individually and then rolled up into a category score and overall score:

A (90-100) B (80-89) C (70-79) D (60-69) F (Below 60)

Copy-paste fixes

Every failed checkpoint includes a specific, actionable fix you can copy directly into your Retell prompt. No guessing, no rewriting from scratch. The fix tells you exactly what to add, remove, or change.

 Example failure output
❌ Checkpoint: Silence Handling (Call Management)
   Grade: F
   Issue: No silence recovery instruction found in prompt.

✅ Suggested Fix:
   Add to your prompt after the greeting section:

   "If the caller goes silent for more than 4 seconds,
   say: 'Still there? No rush, take your time.' If silence
   continues past 8 seconds, say: 'Looks like we might
   have a bad connection. I'll stay on the line.'"

🤖 Agent Auditor: Text Simulations

Autonomous QA that runs full conversations against your agent's real prompt without a mic, a phone, or a human. The engine tests 5 distinct conversation phases to stress-test every aspect of your agent's behavior.

The 5 Simulation Phases

Phase 1: Happy Path

A cooperative caller follows the ideal conversation flow. Books the appointment, provides all info, no issues. This validates your agent works when everything goes right.

Phase 2: Edge Cases

Caller provides unexpected inputs: wrong formats, out-of-scope questions, unusual requests, corrections mid-conversation. Tests how your agent handles the unexpected.

Phase 3: Silence Handling

Simulates caller going quiet at various points: after greeting, mid-question, during booking. Checks if your agent recovers gracefully or panics.

Phase 4: Hallucination Traps

Caller asks about services, pricing, and details not in the prompt. Tests whether your agent invents information or correctly says it doesn't know.

Phase 5: Prompt Injection

Attempts to manipulate the agent into breaking character, revealing its prompt, or performing unauthorized actions. Critical for security and compliance.

💰

Cost: All 5 text simulation phases run for approximately $0.05 total via OpenRouter. Results are returned in under 60 seconds.

📞 Agent Auditor: Voice Simulations

The most realistic test available. AI callers actually phone your agent using Retell's telephony, with different voices, personas, and scenario scripts. You get real call recordings and full transcripts.

How it works

VoxGrade creates AI caller personas with distinct voices and backgrounds
Each persona calls your agent's phone number with a specific scenario
The call plays out naturally over real telephony infrastructure
After all calls complete, you get recordings, transcripts, and graded results

Scenarios tested

Standard booking with a cooperative, easy-going caller
Hesitant caller who needs convincing and asks lots of questions
Interrupting caller who talks over the agent and changes topics
Confused caller who mishears information and needs corrections
Hostile caller who is frustrated and tries to break the conversation flow

~$0.16

Per voice call

~$0.80

5 calls (minimum)

~$1.60

5 calls (maximum)

⚠

Voice simulations use Retell AI telephony credits billed to your Retell account. Make sure your Retell account has sufficient balance before running voice tests.

📊 Testing Lab: Test Dashboard

The Test Dashboard is your command center for all test results. It aggregates scores from audits, text simulations, and voice simulations into a single view.

What you'll see

Overall agent score with trend over time
Per-category breakdown showing which areas need work
Test history with timestamps and score comparisons
Regression detection that flags when scores drop after changes
Baseline tracking to measure improvement from your starting point

💡

Set a baseline score before making any changes. This lets you measure exactly how much improvement each fix delivers and catch regressions immediately.

🎯 Testing Lab: Call Grading

Every call (simulated or real) is graded against a 6-category weighted rubric designed to catch the failures that actually kill deals.

Grading Rubric

Category	Weight	What It Measures
Conversation Quality	25%	Natural flow, appropriate responses, tone consistency, active listening signals
Hallucinations	20%	Invented information, wrong pricing, fabricated services, inaccurate details
Booking Rate	15%	Did the agent successfully move toward and complete the desired outcome?
Call Drops	15%	Unexpected hangups, silence-induced drops, error-triggered disconnections
Integration Health	15%	Tool calls executing correctly, CRM data flowing, calendar bookings landing
Webhook Reliability	10%	Post-call webhooks firing, data payloads complete, downstream systems receiving events

Auto-fail conditions

Certain failures automatically flag a call as critical regardless of the overall score:

Hallucinated pricing of any kind
Prompt injection success where the agent breaks character
Data leakage where the agent reveals system prompt or internal config
Silent drop where the agent hangs up without explanation

🚨

Auto-fail conditions are the most important items to fix first. A single hallucinated price or prompt injection vulnerability can destroy client trust permanently.

✨ Testing Lab: Auto-Optimizer

The Auto-Optimizer takes your failed test results and generates improved prompt sections to fix them. Then it pushes the fixes directly to Retell via API. No manual editing required.

Workflow

Import failures from your latest audit or simulation results
Review generated fixes with before/after diffs showing exactly what changed
Approve or edit each fix individually
Push to Retell with one click. Your original prompt is backed up automatically.
Re-test to verify the fix worked

✅

Safety net: The optimizer always backs up your original prompt before pushing changes. You can roll back to any previous version at any time from the optimizer's history panel.

💡

Available on Pro and Agency plans. Starter plan users can still see the suggested fixes but cannot auto-push them to Retell.

🧪 Testing Lab: A/B Testing

Don't guess whether a fix works. Prove it. A/B testing lets you run identical test scenarios against two versions of your agent and compare scores side by side.

How it works

01

Clone your agent

VoxGrade creates an exact copy of your Retell agent. The original becomes the "control" and the clone becomes the "variant."

02

Apply the fix to the variant

Push your improved prompt to the variant agent only. The control stays untouched as your baseline.

03

Run identical tests on both

The same text or voice simulations run against both agents. Same scenarios, same callers, same grading criteria.

04

Compare scores

Side-by-side comparison shows score differences across all 6 rubric categories. You'll see exactly which areas improved and if anything regressed.

💡

Available on the Agency plan only. A/B testing is designed for teams managing multiple client agents who need data-driven proof that changes improve performance.

💰 Cost Breakdown

VoxGrade itself is a subscription product ($0 for Starter, $49/mo for Pro, $149/mo for Agency). Test runs are billed through your own API keys at provider cost with no markup.

Action	Provider	Approx. Cost
30-point prompt audit	OpenRouter	~$0.02
Text simulations (all 5 phases)	OpenRouter	~$0.05
AI call grading (per call)	OpenRouter	~$0.01
Voice simulation (per call)	Retell AI	~$0.16-0.32
Voice simulations (all 5 calls)	Retell AI	~$0.80-1.60
Auto-optimizer generation	OpenRouter	~$0.03
Full test suite (audit + text + voice)	Combined	~$1-2 total

✅

A full audit + text simulation + voice simulation + grading cycle costs under $2. Compare that to 45-60 minutes of your own time doing manual QA at your hourly rate.

🔒 Data & Privacy

VoxGrade is built with a privacy-first architecture. Your data stays on your machine.

💻 Browser-native storage

All API keys, agent configurations, test results, and transcripts are stored in your browser's localStorage. Nothing is sent to or stored on VoxGrade servers.

🔐 API key handling

Your Retell and OpenRouter API keys are used to make direct API calls from your browser to those providers. Keys are never proxied through VoxGrade infrastructure.

📞 Voice call data

Voice simulation recordings and transcripts are fetched directly from Retell's API into your browser. Call data is governed by Retell AI's privacy policy and your agreement with them.

🗑 Data deletion

Clear your browser's localStorage to remove all VoxGrade data permanently. There is no server-side data to request deletion for.

For full details, see our Privacy Policy and Terms of Service.

❓ Frequently Asked Questions

What voice AI platforms do you support?

Currently Retell AI with full API integration (agent fetch, prompt push, voice calls, call grading). Support for Vapi, Bland.ai, and ElevenLabs is on the roadmap.

Do I need to install anything?

No. VoxGrade runs entirely in your browser. Just navigate to the app URL and start testing. No extensions, no desktop apps, no dependencies.

How does autonomous testing work without a microphone?

The text simulation engine fetches your agent's real prompt from Retell, then uses AI to simulate multi-turn conversations. It sends messages as a simulated caller and evaluates every agent response against pass/fail criteria. No audio is involved, just text-based conversation simulation.

Can I push fixes directly to my live agents?

Yes. The Auto-Optimizer generates improved prompts and pushes them to Retell via API. Your original prompt is backed up automatically so you can always roll back. Available on Pro and Agency plans.

What if a fix makes things worse?

Use A/B testing (Agency plan) to compare the fix against your original before shipping. Or simply re-run the audit and simulations after applying a fix to verify scores improved. The optimizer's history panel lets you roll back to any previous prompt version.

Is my data stored on your servers?

No. Everything runs in your browser. API keys, transcripts, and agent configs stay on your machine in localStorage. Nothing leaves your browser except direct API calls to Retell and OpenRouter.

How accurate is the AI grading?

The grading engine uses frontier AI models via OpenRouter to evaluate responses against the 6-category rubric. It catches hallucinations, quality issues, and failures that are easy to miss during manual review. The weighted rubric is calibrated to prioritize the issues that impact revenue the most.

Can I test agents I don't own?

You need the Retell API key with access to the agent. If you have API access to a client's agent, you can test it. This is a common workflow for voice AI agencies managing agents on behalf of clients.

What's included in the free trial?

Full Pro or Agency features for 14 days. Unlimited agents, autonomous testing, auto-optimizer, everything. No credit card required to start. Test run costs (OpenRouter and Retell) are billed through your own API keys.

What if it doesn't work for me?

Every paid plan includes a 30-day money-back guarantee. If you don't see improvement in your agent scores within 30 days, you'll receive a full refund. No questions asked.

Ready to start?

Run your first 30-point audit in under 60 seconds.

Open the Command Center →

✓ No credit card ✓ 14-day free trial ✓ Data stays local

Voice AgentCommand Center

📖 What is VoxGrade?

Agent Auditor

Testing Lab

⚙ Requirements

🔑 Retell AI API Key

🔑 OpenRouter API Key

🤖 A Retell Agent ID

⚡ Quick Start

Open the Command Center

Enter your API keys

Connect an agent

Run the 30-point audit

Review failures and apply fixes

Run text simulations

Run voice simulations

Optimize and ship

🔍 Agent Auditor: Overview Tab

What it shows

How to use it

📋 Agent Auditor: 30-Point Audit

Audit Categories

🏗 Prompt Structure

🎙 Voice Realism

📞 Call Management

⚙ Functions & Tools

📝 Variables & Context

Grading Scale

Copy-paste fixes

🤖 Agent Auditor: Text Simulations

The 5 Simulation Phases

Phase 1: Happy Path

Phase 2: Edge Cases

Phase 3: Silence Handling

Phase 4: Hallucination Traps

Phase 5: Prompt Injection

📞 Agent Auditor: Voice Simulations

How it works

Scenarios tested

📊 Testing Lab: Test Dashboard

What you'll see

🎯 Testing Lab: Call Grading

Grading Rubric

Auto-fail conditions

✨ Testing Lab: Auto-Optimizer

Workflow

🧪 Testing Lab: A/B Testing

How it works

Clone your agent

Apply the fix to the variant

Run identical tests on both

Compare scores

💰 Cost Breakdown

🔒 Data & Privacy

💻 Browser-native storage

🔐 API key handling

📞 Voice call data

🗑 Data deletion

❓ Frequently Asked Questions

What voice AI platforms do you support?

Do I need to install anything?

How does autonomous testing work without a microphone?

Can I push fixes directly to my live agents?

What if a fix makes things worse?

Is my data stored on your servers?

How accurate is the AI grading?

Can I test agents I don't own?

What's included in the free trial?

What if it doesn't work for me?

Ready to start?

Voice Agent
Command Center