Speculative decoding is a recently discovered, yet absolutely essential technique in the world of fast language models. Instead of waiting for a model to generate tokens one by one, it allows simultaneous prediction of several future tokens using a faster auxiliary model, followed by verification by the main model. Sounds simple? Maybe, but in practice it's chaos — every research team tests it differently, on different data, under different conditions. Results are incomparable, and conclusions from one work often don't transfer to other environments. That's exactly why the NVIDIA team decided to put an end to this anarchy, introducing SPEED-Bench — the first truly comprehensive benchmark for evaluating speculative decoding under production-like conditions. This isn't another academic experiment on a small dataset. This is a tool that changes the way the industry thinks about accelerating LLM inference.

Speculative decoding has taken root in the heart of modern systems serving large language models, but its actual performance depends on hundreds of factors — from input semantics, through batch size, to hardware configuration. Previous benchmarks were either too narrow or too artificial. SPEED-Bench changes this fundamentally, offering two independent datasets and a unified measurement framework that integrates with production inference engines such as TensorRT-LLM, vLLM, and SGLang. This isn't a theoretical game — it's a tool created for practitioners who need to know what's actually happening in their systems.

Chaos in evaluating speculative decoding — how it started

Speculative decoding has a brilliantly simple idea: instead of waiting for a model to generate one token at a time (which is a bottleneck in inference), we use a lightweight auxiliary model to speculate several future tokens at once. Then the main model verifies them in parallel. If the speculation succeeds, we gain speed. If not, we roll back and continue correctly. Mathematically it's elegant — we exactly preserve the output distribution of the main model while potentially gaining significant speedup.

The problem arises when you need to assess how well it works. Every scientific publication tested it on different data, with different settings. One team used 100 prompts, another 1000. One tested short input sequences (less than 100 tokens), another worked with 32 thousand token context. Batch size one, batch size 512 — completely different results. Some benchmarks used random tokens as input, which completely distorted actual behavior. The result? Papers published speedups that didn't materialize in practice, and teams couldn't compare their algorithms in a meaningful way.

Additionally, most existing benchmarks didn't reflect real serving conditions. In production, models work with high concurrency, long input sequences, different semantic domains. Meanwhile, academic benchmarks often tested at batch size one with short prompts. It's like testing a sports car in laboratory conditions instead of on a highway — results can be completely misleading.

SPEED-Bench architecture — two datasets for two problems

SPEED-Bench solves this problem elegantly, separating two completely different aspects of speculative decoding. First, the quality of speculation — how well the auxiliary model predicts future tokens. Second, actual system acceleration — how many tokens per second can we generate under production conditions.

The first dataset is the Qualitative split. Its purpose is to measure speculation accuracy (so-called acceptance rates and acceptance lengths) across a broad spectrum of semantic domains. The NVIDIA team collected data from 18 publicly available sources and divided them into 11 categories: Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA. Each category contains 80 samples, totaling 880 prompts.

But wait — that sounds like every other benchmark, right? Not quite. The key innovation lies in how they select samples. Instead of randomly taking samples from each category, SPEED-Bench uses a selection algorithm based on text embeddings. Each candidate is converted to a dense vector using a pretrained embedder (OpenAI text-embedding-3-small). The algorithm then minimizes the average cosine similarity between samples in each category. In other words, it selects samples that are as semantically different from each other as possible.

Why does this matter? Because it reveals domain-dependent behavior in speculative decoding. Low-entropy domains (Coding, Math) behave completely differently from high-entropy domains (Roleplay, Writing). If a benchmark contains only similar samples from one category, you'll never discover this. Comparison with the previous benchmark (SpecBench) shows that SPEED-Bench achieves lower average semantic similarity between samples — meaning better coverage of real diversity within each domain.

Throughput split — simulating real serving conditions

The second dataset is the Throughput split, and here things get serious. Qualitative split measures speculation accuracy, but tells us nothing about real system performance. In production, what matters most is: how many tokens per second can we generate (Output TPS) and what's the latency for a single user (User TPS).

Throughput split is constructed with realistic serving conditions in mind. It contains buckets with fixed input sequence lengths (ISL) from 1k to 32k tokens — this reflects the growing importance of long-context applications such as coding assistants or retrieval-augmented generation. For each ISL bucket, prompts are aggregated into three difficulty categories corresponding to low-, mixed-, and high-entropy domains. Each bucket contains 1536 prompts (512 per difficulty category), providing sufficient volume to construct stable Pareto curves across a wide range of batch sizes — from 2 to 512.

This is critical because in production, batch size drastically affects whether inference is compute-bound or memory-bound. At small batch sizes, the GPU isn't fully utilized — it's compute-bound. At large batch sizes, we saturate memory bandwidth — it's memory-bound. Speculative decoding has completely different impact on performance in these two regimes. SPEED-Bench accounts for this by testing real batch size variants.

Additionally, SPEED-Bench avoids using random tokens for throughput benchmarking. This might seem like a detail, but the team showed that random tokens drastically distort acceptance behavior, routing in MoE models, and throughput measurements, leading to overly optimistic conclusions. Prompts are instead truncated or padded in a controlled manner, preserving their semantic content.

Unified measurement framework — end to incomparability

Here comes another subtle but critical issue: benchmarking speculative decoding across different inference engines. Different engines may apply different chat templates, handle BOS tokens differently, or tokenize inputs inconsistently. These differences can subtly change the speculated sequence, making comparisons between engines unreliable.

SPEED-Bench introduces a lightweight measurement framework that handles tokenization and prompt formatting externally. Inference engines receive pre-tokenized sequences, ensuring that all systems process identical inputs. This isolates the effects of speculative decoding algorithms and system optimizations from preprocessing artifacts.

The framework integrates with production engines: TensorRT-LLM, vLLM, and SGLang. It captures detailed timing information from streaming responses to calculate: acceptance behavior, step latency, tokens per second at user level, and total throughput. An example from a real run shows how this looks in practice — running Llama 3.3 70B Instruct as the target model with EAGLE3 as the auxiliary model on Qualitative split, using TensorRT-LLM with batch size 32 on 8 H100 GPUs.

The output shows exactly what practitioners need: histogram of acceptance lengths, conditional acceptance rates for each step, results broken down by category, Output TPS, end-to-end times, TTFT (Time To First Token), and detailed latency statistics. This is information you can actually use to make engineering decisions.

Domain-dependent behavior — where speculative decoding shines and where it stumbles

Results from SPEED-Bench reveal a fascinating pattern: acceptance length in speculative decoding is highly domain-dependent. Low-entropy domains such as Coding and Math consistently achieve higher acceptance rates — the auxiliary model can predict future tokens with greater certainty. High-entropy domains such as Roleplay or Writing show lower acceptance rates.

This has enormous practical significance. If you're building a system serving mostly code, speculative decoding will be a game-changer for you. If you're serving mostly creative writing, the speedup will be much more modest. None of the previous benchmarks showed this so clearly because they either tested too little data or had too low semantic diversity within categories.

Additionally, results show that speedups change drastically depending on batch size and input sequence length. At small batch sizes and short sequences, speculative decoding can offer 2-3x speedup. At large batch sizes and long sequences in the memory-bound regime, speedup can be much more modest — sometimes approaching 1x. This is again something previous benchmarks completely missed.

Implications for Polish creators and AI teams

For Polish teams working on models, inference optimization, or LLM serving systems, SPEED-Bench has concrete significance. First, it provides a standard reference point for comparing speculative decoding algorithms. If you publish research, you can now use SPEED-Bench instead of creating your own benchmark — this makes your work comparable with efforts from teams around the world.

Second, the benchmark reveals that speedups are highly situational. If you're building a system for a specific use case — coding assistant, RAG system, chatbot — you need to test on data representative of that domain. SPEED-Bench enables this. Instead of relying on general numbers, you can run SPEED-Bench on your hardware, with your models, and see exactly how speculative decoding affects your specific application.

Third, the measurement framework is open and integrated with popular engines. If you use vLLM or SGLang — and many Polish teams do — you can start using SPEED-Bench immediately without additional integration work.

Why this matters for the future of LLM inference

Speculative decoding isn't a transient optimization. It's a fundamental technique that will deepen in systems serving LLMs in the coming years. But like any technique, its value depends on whether we actually know how it works in practice. SPEED-Bench is the first benchmark that answers this question in a systematic, production-oriented way.

Standardizing speculative decoding benchmarking has consequences for the entire industry. Researchers will be able to iterate faster, knowing their results are comparable. Engineers will be able to make better deployment decisions. Hardware vendors will be able to optimize their engines for speculative decoding, knowing exactly which scenarios matter most in production.

In practice, SPEED-Bench represents a shift in how the industry thinks about accelerating inference. Instead of general numbers and theoretical analysis, we now have a tool to measure real scenarios. This is exactly what the industry needed, and it's exactly what NVIDIA delivered.

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Chaos in evaluating speculative decoding — how it started

Read also

SPEED-Bench architecture — two datasets for two problems

Throughput split — simulating real serving conditions

Unified measurement framework — end to incomparability

Domain-dependent behavior — where speculative decoding shines and where it stumbles

Implications for Polish creators and AI teams

Why this matters for the future of LLM inference

More from Models

A New Framework for Evaluation of Voice Agents (EVA)

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

Comments

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

Chaos in evaluating speculative decoding — how it started

Read also

SPEED-Bench architecture — two datasets for two problems

Throughput split — simulating real serving conditions

Unified measurement framework — end to incomparability

Domain-dependent behavior — where speculative decoding shines and where it stumbles

Implications for Polish creators and AI teams

Why this matters for the future of LLM inference

More from Models

A New Framework for Evaluation of Voice Agents (EVA)

Build a Domain-Specific Embedding Model in Under a Day

Nemotron 3 Content Safety 4B: Multimodal, Multilingual Content Moderation

What's New in Mellea 0.4.0 + Granite Libraries Release

Related Articles

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

TRL v1.0: Post-Training Library That Holds When the Field Invalidates Its Own Assumptions

Comments

Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding