In the world of generative artificial intelligence, where text-based chatbots have become commonplace, industry attention is rapidly shifting toward voice agents. However, evaluating systems that must not only "think" but also "hear" and "speak" in real-time has proven extremely difficult. Previous testing methods were fragmented: speech synthesis quality was studied separately, language model logic separately, and conversation dynamics in yet another way. This lack of consistency ends now with the premiere of EVA (Evaluation of Voice Agents) – a new, comprehensive framework from the ServiceNow-AI team.

A paper published on March 24, 2026, by the research team (including Tara Bogavelli, Gabrielle Gauthier Melancon, and Hari Subramani) challenges the current status quo. EVA is the first end-to-end tool that evaluates complete, multi-turn voice conversations using a realistic bot-to-bot architecture. A key finding from the initial tests of 20 different systems is the existence of a "tragic tradeoff": agents that excel at task accuracy typically fail when it comes to conversational naturalness.

Two pillars of a modern agent: EVA-A and EVA-X

The creators of the framework rightly note that a voice agent faces a unique challenge. It must be simultaneously precise (Accuracy) and provide an appropriate user experience (Experience). In traditional text systems, these two worlds rarely collide, but in voice communication, they are inseparable. EVA introduces two main indicators that allow for an objective assessment of these parameters:

EVA-A (Accuracy): Measures not only whether the task was completed but also the fidelity of the message. It includes deterministic verification of the database state after the call, an assessment of Faithfulness (whether the agent hallucinates company policy rules), and a unique Agent Speech Fidelity parameter, which checks if the system correctly pronounced key data such as reservation codes or amounts.
EVA-X (Experience): Focuses on the quality of the interaction. It evaluates Conciseness – crucial when a user cannot "scan" a long statement with their eyes – and Turn-Taking flow, which involves avoiding interrupting the user or excessive silence.

Importantly, EVA does not rely on the subjective feelings of human testers, which usually slows down the development process. The framework utilizes an LLM-as-Judge and LALM-as-Judge (Large Audio Language Models) system, allowing for scalable and repeatable assessment of audio quality and content directly from recordings and conversation logs.

Bot-to-Bot Architecture and Reality Simulation

To make the tests representative, the ServiceNow-AI team created a closed ecosystem in which the voice agent under evaluation (built on the open-source Pipecat framework) talks to an advanced user simulator. This simulator is not just a passive recipient; it has a specific personality, goals, and instructions on how to react to agent errors. The entire process is based on five components:

User Simulator: AI with an assigned goal and persona, operating on high-quality TTS models.
Voice Agent: The system being tested, which can be a cascaded architecture (STT → LLM → TTS) or a native audio model (S2S).
Tool Executor: An engine that executes Python functions, modifying the scenario's database.
Validators: Automated metrics checking if the conversation is suitable for evaluation at all (whether the simulator behaved according to plan).
Metrics Suite: A team of AI judges analyzing recordings, transcriptions, and tool logs.

Alongside the framework, the Airline Dataset debuted, containing 50 scenarios from the aviation industry. These include flight rebookings in crisis situations (IRROPS), voucher handling, or travel plan changes. These are demanding tests for temporal logic and strict adherence to corporate procedures.

The Precision Paradox: Why the smartest agents can be annoying?

The analysis of 20 systems conducted using EVA revealed a fascinating trend that the authors call the Accuracy-Experience tradeoff. It turns out that models achieving the highest scores in EVA-A (flawlessly performing technical tasks) often receive low marks in EVA-X. They are often too wordy, have unnatural pauses resulting from long logic processing times, or bombard the user with too much information at once.

"Misheard confirmation code renders perfect LLM reasoning meaningless" – this sentence from the source material best captures the nature of the problem. Even the most powerful GPT-5 or Claude model will not help if the ASR (Speech-to-Text) system mistakes one letter in a passenger's name, leading to an authorization error and a complete breakdown of the conversation.

Research also showed that the biggest barrier for modern voice agents is multi-step workflows. The most difficult task proved to be rebooking a flight while maintaining additional services, such as purchased luggage or selected seats. This is where agents most often "got lost" in the logic, suggesting that current models have trouble maintaining context while simultaneously orchestrating external tools.

Conclusions for the industry and the future of evaluation

The introduction of EVA is a turning point for AI engineers. Thanks to open access to code, datasets, and judge prompts on the GitHub platform (https://github.com/ServiceNow/eva), developers worldwide gain a tool for rigorously testing their solutions before releasing them to the market.

Key takeaways from the ServiceNow-AI publication are clear:

Consistency is fundamental: The gap between pass@3 (best attempt) and pass^3 (all attempts successful) results is huge. Agents that can perform a task rarely do so repeatably, which is unacceptable in production systems.
Audio-native is the future: Models that process sound directly (S2S) show potential for reducing latency but still struggle with precision compared to cascaded systems.
Need for new calibration: System creators must stop optimizing only for "effectiveness" and start treating conciseness and response time as critical success parameters.

The EVA framework proves that the era of simple effectiveness tests has come to an end. In the world of voice agents, where the human voice is the interface, a technical error is just as costly as an image error resulting from poor conversation dynamics. The industry has just received a mirror in which it can see itself – and the results for many may be a painful but necessary impulse for change.

A New Framework for Evaluation of Voice Agents (EVA)

Two pillars of a modern agent: EVA-A and EVA-X

Read also

Bot-to-Bot Architecture and Reality Simulation

The Precision Paradox: Why the smartest agents can be annoying?

Conclusions for the industry and the future of evaluation

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

Comments

A New Framework for Evaluation of Voice Agents (EVA)

Two pillars of a modern agent: EVA-A and EVA-X

Read also

Bot-to-Bot Architecture and Reality Simulation

The Precision Paradox: Why the smartest agents can be annoying?

Conclusions for the industry and the future of evaluation

More from Models

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

Comments

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding