Step-by-step guides to test, monitor, and optimize your AI voice agents. From your first simulation to production monitoring.
Three essential tutorials to run your first tests and start improving your voice agents today.
Set up an agent profile, write realistic conversation scenarios, and run your first LLM-vs-LLM simulation to see how well your agent handles objections.
Connect your Retell or Vapi agent, add a test phone number, make a real call, and get instant grading on how the conversation went.
Schedule automated tests to run 24/7, catch regressions before customers do, and get alerts when your agents start failing.
Deep dives into each VoxGrade capability. Click any guide to expand the step-by-step instructions.
Test agent responses without making calls. Fast, scalable, and perfect for iteration.
Text Simulation tab. Enter your agent's system prompt or select a saved profile.Run Simulation. VoxGrade generates a realistic customer conversation using an LLM and tests your agent's responses.Make real phone calls to your agent. Test latency, interruptions, and human-like responses.
Voice Test tab. Connect your platform (Retell AI or Vapi) by entering API key and agent ID.Start Call. VoxGrade initiates the call, records the conversation, and transcribes it in real-time.AI analyzes failures and suggests prompt fixes. Apply improvements with one click.
Analyze & Fix. VoxGrade compares actual behavior vs. expected, identifies failure modes, and drafts prompt changes.Apply Fix. VoxGrade updates your agent prompt and automatically re-runs the test to confirm improvement.Test two prompt versions side-by-side. Statistical significance built in.
A/B Testing tab. Enter your current prompt as "Control" and your new prompt as "Variant".Start A/B Test. VoxGrade runs tests in parallel and tracks win rate, average score, and confidence interval for each variant.Generate public links for client demos, stakeholder reviews, and audits.
Generate Report in the results panel.Create Public Link. VoxGrade generates a unique URL. Anyone with the link can view the report—no login required.Reports tab. Revoke access, set expiration dates, or regenerate links if compromised.Automated 24/7 testing. Catch regressions before customers complain.
Cron tab. Click New Cron Job. Select which agent profile and test scenarios to run automatically.Activate. VoxGrade runs tests on schedule and logs results. View history, trends, and anomalies from the Cron dashboard.Add test numbers, configure transfer behavior, and manage multi-agent setups.
Phone Numbers tab. Click Add Number. Enter the test number and assign it to an agent profile.Test Call to verify VoxGrade can trigger calls and receive webhooks from your platform.Understand how VoxGrade calculates scores, grades, and success criteria.
Step-by-step setup for Retell AI, Vapi, and webhook integrations.
Connect your Retell AI agents for voice testing and automated QA.
Settings > IntegrationsIntegrate Vapi assistants for real-time voice agent testing.
Settings > IntegrationsReceive test results in your own systems via webhook callbacks.
Settings > WebhooksText simulations test your agent's logic and prompt behavior without making phone calls. They're fast, cheap, and perfect for rapid iteration. Voice tests make actual phone calls to test real-world performance including latency, interruptions, and voice quality. Use text sims for iteration, voice tests for final validation.
Similarity scores use semantic embeddings to compare agent output vs. expected behavior. They measure intent matching, not exact word-for-word accuracy. A score of 85+ typically means the agent hit all key points. 70-84 means some points were missed. Below 70 means significant deviation. Always read the transcript—scores are guidelines, not absolutes.
Yes. VoxGrade supports any platform with an API. For voice tests, you need an API endpoint to trigger calls and webhooks to receive call data. For text simulations, just paste your prompt—no integration needed. Contact support for help setting up custom integrations.
Depends on your deployment frequency and risk tolerance. High-stakes agents (sales, customer support): run hourly. Mid-stakes: every 6 hours. Low-stakes or stable agents: daily. Always run tests immediately after deploying prompt changes, then monitor for 24-48 hours.
A (90-100): Production-ready. Deploy with confidence. B (80-89): Minor improvements recommended but safe to deploy. C (70-79): Significant issues. Fix before deploying. D (60-69): Major failures. Do not deploy. F (<60): Critical failures. Agent is fundamentally broken. Grades are based on your custom rubric or default scoring criteria.
VoxGrade's AI analyzes test failures, identifies patterns (e.g., "agent always forgets to mention pricing"), and generates specific prompt changes to fix the issue. You review the suggested changes, apply with one click, and VoxGrade re-runs the test to confirm improvement. It's not magic—it's pattern recognition and targeted fixes based on your test data.
Yes. Pro and Agency plans support team access with role-based permissions. Admins can invite team members, assign roles (viewer, tester, admin), and manage access. Everyone sees the same test history, reports, and cron jobs. Data stays synced across the team.
Yes. Agency plans include full API access. Trigger tests programmatically, fetch results, manage agents, and integrate VoxGrade into your CI/CD pipeline. Full REST API with webhook support. API docs available in the dashboard under Settings > API.
We're building a full video course covering every VoxGrade feature. Get notified when it launches.