Learn VoxGrade

Step-by-step guides to test, monitor, and optimize your AI voice agents. From your first simulation to production monitoring.

Quick Start

Get Started in Minutes

Three essential tutorials to run your first tests and start improving your voice agents today.

1
💬

Your First Text Simulation

Set up an agent profile, write realistic conversation scenarios, and run your first LLM-vs-LLM simulation to see how well your agent handles objections.

⏱ 5 minutes
2
📞

Your First Voice Test

Connect your Retell or Vapi agent, add a test phone number, make a real call, and get instant grading on how the conversation went.

⏱ 10 minutes
3

Set Up Cron Monitoring

Schedule automated tests to run 24/7, catch regressions before customers do, and get alerts when your agents start failing.

⏱ 5 minutes
Feature Guides

Master Every Feature

Deep dives into each VoxGrade capability. Click any guide to expand the step-by-step instructions.

💬

Text Simulations

Test agent responses without making calls. Fast, scalable, and perfect for iteration.

Step 1
Go to Text Simulation tab. Enter your agent's system prompt or select a saved profile.
Step 2
Write a test scenario describing the customer's situation, objections, and tone. Example: "Price-conscious lead, budget concerns, comparing 3 vendors."
Step 3
Define expected agent behavior. What should the agent say? What objections should it handle? Be specific.
Step 4
Click Run Simulation. VoxGrade generates a realistic customer conversation using an LLM and tests your agent's responses.
Step 5
Review the similarity score (0-100). High scores mean your agent matched expected behavior. Low scores reveal gaps in handling edge cases.
Pro Tip
Run 10+ scenarios covering objections, angry customers, edge cases, and happy paths. Text sims are cheap and fast—iterate aggressively.
📞

Voice Testing

Make real phone calls to your agent. Test latency, interruptions, and human-like responses.

Step 1
Go to Voice Test tab. Connect your platform (Retell AI or Vapi) by entering API key and agent ID.
Step 2
Add a test phone number. This is the number your agent will call to test its behavior in a live conversation.
Step 3
Define the test scenario and grading rubric. What should the agent accomplish? How should it handle objections? Set success criteria.
Step 4
Click Start Call. VoxGrade initiates the call, records the conversation, and transcribes it in real-time.
Step 5
Review results: transcript, recording, latency metrics, interruption count, and an overall grade (A-F). See exactly where your agent failed.
Pro Tip
Run voice tests AFTER text sims. Text is cheap for iteration. Voice is for final validation before pushing to production.
🤖

Auto-Apply Optimization

AI analyzes failures and suggests prompt fixes. Apply improvements with one click.

How It Works
After running tests, VoxGrade's AI analyzes failures, identifies patterns, and generates specific prompt improvements to fix the issues.
Step 1
Run a test (text sim or voice call) that fails or scores below expectations. The more detail in your grading rubric, the better the fixes.
Step 2
Click Analyze & Fix. VoxGrade compares actual behavior vs. expected, identifies failure modes, and drafts prompt changes.
Step 3
Review suggested changes. You'll see before/after diff, explanation of why the change helps, and confidence score.
Step 4
Click Apply Fix. VoxGrade updates your agent prompt and automatically re-runs the test to confirm improvement.
Pro Tip
Use auto-apply for incremental improvements. Still manually review before pushing to production. AI is smart but not infallible.
🔬

A/B Prompt Comparison

Test two prompt versions side-by-side. Statistical significance built in.

Step 1
Go to A/B Testing tab. Enter your current prompt as "Control" and your new prompt as "Variant".
Step 2
Write test scenarios. Use identical scenarios for both prompts to isolate the impact of the prompt change.
Step 3
Set sample size (recommended: 20+ tests per variant for statistical significance). VoxGrade will split traffic 50/50.
Step 4
Click Start A/B Test. VoxGrade runs tests in parallel and tracks win rate, average score, and confidence interval for each variant.
Step 5
Review results. Look for statistically significant improvements (p-value < 0.05). If variant wins, deploy it. If no difference, keep iterating.
Pro Tip
A/B test one change at a time. Testing multiple changes simultaneously makes it impossible to isolate what worked.
📊

Shareable Reports

Generate public links for client demos, stakeholder reviews, and audits.

Step 1
Run any test (text sim, voice call, or A/B test). After completion, click Generate Report in the results panel.
Step 2
Choose what to include: full transcript, recordings (voice only), grading breakdown, improvement suggestions. Exclude sensitive data if needed.
Step 3
Click Create Public Link. VoxGrade generates a unique URL. Anyone with the link can view the report—no login required.
Step 4
Share the link with clients, stakeholders, or team members. Reports are mobile-responsive and include branding options (Pro+ plans).
Step 5
Manage reports from Reports tab. Revoke access, set expiration dates, or regenerate links if compromised.
Pro Tip
Use shareable reports for client QA sign-off. Show the work, build trust, and charge premium rates for transparency.

Cron Scheduling

Automated 24/7 testing. Catch regressions before customers complain.

Step 1
Go to Cron tab. Click New Cron Job. Select which agent profile and test scenarios to run automatically.
Step 2
Set frequency: hourly, every 6 hours, daily, or weekly. Higher-tier plans support more frequent runs and parallel agents.
Step 3
Configure alert thresholds. Example: "Alert me if score drops below 80% or if 3+ tests fail in a row." Connect Slack/email for notifications.
Step 4
Click Activate. VoxGrade runs tests on schedule and logs results. View history, trends, and anomalies from the Cron dashboard.
Step 5
Monitor long-term trends. Track score drift over time. Catch regressions from upstream model updates or config changes.
Pro Tip
Run cron tests after every prompt change. Set a baseline, deploy change, and monitor for 24-48 hours before calling it stable.
📱

Phone Management

Add test numbers, configure transfer behavior, and manage multi-agent setups.

Step 1
Go to Phone Numbers tab. Click Add Number. Enter the test number and assign it to an agent profile.
Step 2
Configure platform settings. For Retell: add API key and agent ID. For Vapi: add API key and assistant ID. Both are found in your platform dashboard.
Step 3
Test the connection. Click Test Call to verify VoxGrade can trigger calls and receive webhooks from your platform.
Step 4
Set up transfer testing (optional). If your agent transfers calls, add transfer numbers and configure transfer scenarios to test handoff behavior.
Pro Tip
Use separate numbers for staging vs. production. Never run automated tests against production agent numbers—use clones.
📈

Grading & Scoring

Understand how VoxGrade calculates scores, grades, and success criteria.

Scoring System
VoxGrade uses a 0-100 point scale. Text sims measure semantic similarity. Voice tests grade on accuracy, latency, interruptions, and goal completion.
Grade Breakdown
A (90-100): Production-ready. B (80-89): Minor tweaks needed. C (70-79): Significant issues. D (60-69): Major failures. F (<60): Do not deploy.
Text Simulation Scoring
Compares agent output vs. expected behavior using embeddings. High similarity = agent matched intent. Low similarity = agent went off-script or missed objections.
Voice Test Scoring
Multi-factor: accuracy (did it say the right things?), latency (response time), interruptions (did it talk over the customer?), goal completion (did it book the meeting?).
Custom Rubrics
Define your own success criteria. Example: "Must mention price within 30 seconds" or "Must handle 'not interested' objection." VoxGrade grades against your rules.
Pro Tip
Set a minimum passing score (e.g., 85+) and never deploy below that. Use A/B tests to raise the bar over time.
Integrations

Connect Your Platform

Step-by-step setup for Retell AI, Vapi, and webhook integrations.

🔗

Retell AI Setup

Connect your Retell AI agents for voice testing and automated QA.

  • Get API key from Retell dashboard
  • Copy agent ID from agent settings
  • Paste both into VoxGrade Settings > Integrations
  • Configure webhook URL for call events
  • Test connection with a sample call
🎙

Vapi Setup

Integrate Vapi assistants for real-time voice agent testing.

  • Get API key from Vapi dashboard
  • Copy assistant ID from assistant config
  • Add credentials to VoxGrade Settings > Integrations
  • Set up webhook endpoint for call logs
  • Run test call to verify connection
🔔

Webhook Integration

Receive test results in your own systems via webhook callbacks.

  • Generate webhook URL in Settings > Webhooks
  • Choose events: test_complete, test_failed, alert_triggered
  • Configure endpoint in your backend
  • Verify signature for security
  • Parse JSON payload for test results
FAQ

Frequently Asked Questions

What's the difference between text simulations and voice tests?

Text simulations test your agent's logic and prompt behavior without making phone calls. They're fast, cheap, and perfect for rapid iteration. Voice tests make actual phone calls to test real-world performance including latency, interruptions, and voice quality. Use text sims for iteration, voice tests for final validation.

How accurate are the similarity scores?

Similarity scores use semantic embeddings to compare agent output vs. expected behavior. They measure intent matching, not exact word-for-word accuracy. A score of 85+ typically means the agent hit all key points. 70-84 means some points were missed. Below 70 means significant deviation. Always read the transcript—scores are guidelines, not absolutes.

Can I test agents from other platforms besides Retell and Vapi?

Yes. VoxGrade supports any platform with an API. For voice tests, you need an API endpoint to trigger calls and webhooks to receive call data. For text simulations, just paste your prompt—no integration needed. Contact support for help setting up custom integrations.

How often should I run cron tests?

Depends on your deployment frequency and risk tolerance. High-stakes agents (sales, customer support): run hourly. Mid-stakes: every 6 hours. Low-stakes or stable agents: daily. Always run tests immediately after deploying prompt changes, then monitor for 24-48 hours.

What does each grade (A/B/C/D/F) mean?

A (90-100): Production-ready. Deploy with confidence. B (80-89): Minor improvements recommended but safe to deploy. C (70-79): Significant issues. Fix before deploying. D (60-69): Major failures. Do not deploy. F (<60): Critical failures. Agent is fundamentally broken. Grades are based on your custom rubric or default scoring criteria.

How does auto-apply prompt optimization work?

VoxGrade's AI analyzes test failures, identifies patterns (e.g., "agent always forgets to mention pricing"), and generates specific prompt changes to fix the issue. You review the suggested changes, apply with one click, and VoxGrade re-runs the test to confirm improvement. It's not magic—it's pattern recognition and targeted fixes based on your test data.

Can my team access the same dashboard?

Yes. Pro and Agency plans support team access with role-based permissions. Admins can invite team members, assign roles (viewer, tester, admin), and manage access. Everyone sees the same test history, reports, and cron jobs. Data stays synced across the team.

Is there an API?

Yes. Agency plans include full API access. Trigger tests programmatically, fetch results, manage agents, and integrate VoxGrade into your CI/CD pipeline. Full REST API with webhook support. API docs available in the dashboard under Settings > API.

🎬

Video Tutorials Coming Soon

We're building a full video course covering every VoxGrade feature. Get notified when it launches.

You're on the list! We'll email you when videos are ready.