A New Framework for Evaluation of Voice Agents (EVA)
Foto: Hugging Face Blog
As many as 20 tested Speech-to-Speech systems and Large Audio Language Models have demonstrated the same troubling pattern: the better artificial intelligence performs at precise task execution, the worse the user experience becomes during conversation. This critical trade-off between effectiveness and natural interaction has become the foundation of EVA (Evaluation of Voice Agents) – a new framework presented on March 24, 2026, by the ServiceNow-AI team. EVA is the first end-to-end tool that abandons the evaluation of isolated components in favor of analyzing complete, multi-turn voice conversations within a bot-to-bot architecture. The system generates two key metrics: EVA-A (Accuracy), measuring the correctness of goal fulfillment, and EVA-X (Experience), evaluating the conciseness and fluidity of the dialogue. By releasing an open dataset covering 50 scenarios from the aviation industry, the creators enable rigorous testing of agents in situations such as flight rebooking or voucher processing. For the global creative and business technology market, the implementation of EVA marks the end of the era of "tone-deaf" bots that, while substantively correct, irritate users with a lack of empathy or delays. This standard forces developers to optimize not only LLM models but the entire conversation dynamics, which in practice will translate into more intuitive and less frustrating Voice AI systems in everyday use. Open access to the code and judge prompts on GitHub allows every creator to objectively verify whether their voice assistant can truly listen, rather than just process data.
In the world of generative artificial intelligence, where text-based chatbots have become commonplace, industry attention is rapidly shifting toward voice agents. However, evaluating systems that must not only "think" but also "hear" and "speak" in real-time has proven extremely difficult. Previous testing methods were fragmented: speech synthesis quality was studied separately, language model logic separately, and conversation dynamics in yet another way. This lack of consistency ends now with the premiere of EVA (Evaluation of Voice Agents) – a new, comprehensive framework from the ServiceNow-AI team.
A paper published on March 24, 2026, by the research team (including Tara Bogavelli, Gabrielle Gauthier Melancon, and Hari Subramani) challenges the current status quo. EVA is the first end-to-end tool that evaluates complete, multi-turn voice conversations using a realistic bot-to-bot architecture. A key finding from the initial tests of 20 different systems is the existence of a "tragic tradeoff": agents that excel at task accuracy typically fail when it comes to conversational naturalness.
Two pillars of a modern agent: EVA-A and EVA-X
The creators of the framework rightly note that a voice agent faces a unique challenge. It must be simultaneously precise (Accuracy) and provide an appropriate user experience (Experience). In traditional text systems, these two worlds rarely collide, but in voice communication, they are inseparable. EVA introduces two main indicators that allow for an objective assessment of these parameters:
Read also
- EVA-A (Accuracy): Measures not only whether the task was completed but also the fidelity of the message. It includes deterministic verification of the database state after the call, an assessment of Faithfulness (whether the agent hallucinates company policy rules), and a unique Agent Speech Fidelity parameter, which checks if the system correctly pronounced key data such as reservation codes or amounts.
- EVA-X (Experience): Focuses on the quality of the interaction. It evaluates Conciseness – crucial when a user cannot "scan" a long statement with their eyes – and Turn-Taking flow, which involves avoiding interrupting the user or excessive silence.
Importantly, EVA does not rely on the subjective feelings of human testers, which usually slows down the development process. The framework utilizes an LLM-as-Judge and LALM-as-Judge (Large Audio Language Models) system, allowing for scalable and repeatable assessment of audio quality and content directly from recordings and conversation logs.
Bot-to-Bot Architecture and Reality Simulation
To make the tests representative, the ServiceNow-AI team created a closed ecosystem in which the voice agent under evaluation (built on the open-source Pipecat framework) talks to an advanced user simulator. This simulator is not just a passive recipient; it has a specific personality, goals, and instructions on how to react to agent errors. The entire process is based on five components:
- User Simulator: AI with an assigned goal and persona, operating on high-quality TTS models.
- Voice Agent: The system being tested, which can be a cascaded architecture (STT → LLM → TTS) or a native audio model (S2S).
- Tool Executor: An engine that executes Python functions, modifying the scenario's database.
- Validators: Automated metrics checking if the conversation is suitable for evaluation at all (whether the simulator behaved according to plan).
- Metrics Suite: A team of AI judges analyzing recordings, transcriptions, and tool logs.
Alongside the framework, the Airline Dataset debuted, containing 50 scenarios from the aviation industry. These include flight rebookings in crisis situations (IRROPS), voucher handling, or travel plan changes. These are demanding tests for temporal logic and strict adherence to corporate procedures.
The Precision Paradox: Why the smartest agents can be annoying?
The analysis of 20 systems conducted using EVA revealed a fascinating trend that the authors call the Accuracy-Experience tradeoff. It turns out that models achieving the highest scores in EVA-A (flawlessly performing technical tasks) often receive low marks in EVA-X. They are often too wordy, have unnatural pauses resulting from long logic processing times, or bombard the user with too much information at once.
"Misheard confirmation code renders perfect LLM reasoning meaningless" – this sentence from the source material best captures the nature of the problem. Even the most powerful GPT-5 or Claude model will not help if the ASR (Speech-to-Text) system mistakes one letter in a passenger's name, leading to an authorization error and a complete breakdown of the conversation.
Research also showed that the biggest barrier for modern voice agents is multi-step workflows. The most difficult task proved to be rebooking a flight while maintaining additional services, such as purchased luggage or selected seats. This is where agents most often "got lost" in the logic, suggesting that current models have trouble maintaining context while simultaneously orchestrating external tools.
Conclusions for the industry and the future of evaluation
The introduction of EVA is a turning point for AI engineers. Thanks to open access to code, datasets, and judge prompts on the GitHub platform (https://github.com/ServiceNow/eva), developers worldwide gain a tool for rigorously testing their solutions before releasing them to the market.
Key takeaways from the ServiceNow-AI publication are clear:
- Consistency is fundamental: The gap between pass@3 (best attempt) and pass^3 (all attempts successful) results is huge. Agents that can perform a task rarely do so repeatably, which is unacceptable in production systems.
- Audio-native is the future: Models that process sound directly (S2S) show potential for reducing latency but still struggle with precision compared to cascaded systems.
- Need for new calibration: System creators must stop optimizing only for "effectiveness" and start treating conciseness and response time as critical success parameters.
The EVA framework proves that the era of simple effectiveness tests has come to an end. In the world of voice agents, where the human voice is the interface, a technical error is just as costly as an image error resulting from poor conversation dynamics. The industry has just received a mirror in which it can see itself – and the results for many may be a painful but necessary impulse for change.
More from Models

Holotron-12B - High Throughput Computer Use Agent
The First Healthcare Robotics Dataset and Foundational Physical AI Models for Healthcare Robotics


