
Mercury: The First Commercial Diffusion LLM Revolutionizing AI Speed
A deep dive into how Inception Labs' Mercury is transforming language models with diffusion-based architecture, achieving unprecedented speeds of over 1000 tokens per second.
By Joshua Kaufmann & AI
•The landscape of large language models (LLMs) has been dominated by a singular approach for years: autoregression, where text is generated one token at a time in a left-to-right sequence. This fundamental constraint has created speed bottlenecks and efficiency challenges as models grow more complex. Now, Inception Labs has unveiled something potentially revolutionary: Mercury, the first commercial-scale diffusion language model (dLLM) that could dramatically reshape how AI systems generate text.
This Diffusion LLM Breaks the AI Rules, Yet Works!
Breaking the Sequential Barrier
Inception Labs has officially announced Mercury as what they call “a new generation of LLMs that push the frontier of fast, high-quality text generation” (Inception Labs, 2025). According to Analytics India Magazine (2025), the company is taking a fundamentally different approach from traditional autoregressive models. Instead of generating text sequentially, Mercury uses a “coarse-to-fine” generation process, where the output is refined through multiple “denoising” steps.
According to the original announcement from Inception Labs (2025), this approach enables Mercury to generate text “up to 10 times faster than frontier speed-optimized LLMs,” achieving over 1000 tokens per second on standard NVIDIA H100 GPUs. AIM Research (2025) points out that these speeds were “previously only matched before by models hosted on specialised inference platforms—for instance, Mistral’s Le Chat running on Cerebras.”
How Diffusion Changes the Game
The key innovation behind Mercury lies in its parallel processing capability. Inception Labs (2025) explains the process:
“When prompted with a query, instead of producing the answer one token at a time, the answer is generated in a coarse-to-fine way… Improvements are suggested by a neural network – in our case a Transformer model – which is trained on large amounts of data to globally improve the quality of the answer by modifying multiple tokens in parallel.”
This represents a fundamental break from the sequential limitations of traditional LLMs. Analytics India Magazine (2025) quotes former OpenAI researcher Andrej Karpathy on this difference:
“Diffusion is different – it doesn’t go left to right, but all at once. You start with noise and gradually denoise into a token stream.”
Mercury Coder: Speed Meets Quality
The first publicly available model in the Mercury family is Mercury Coder, which has been specifically optimized for code generation. Benchmark data published by Inception Labs (2025) shows that Mercury Coder not only matches or exceeds the performance of models like GPT-4o Mini, Gemini 2.0 Flash, and Claude 3.5 Haiku on standard coding tasks but does so at dramatically higher speeds.
Their published results indicate that Mercury Coder Mini achieves 1109 tokens per second while maintaining competitive scores on standard coding benchmarks like HumanEval (88.0) and MBPP (77.1). For comparison, the same data shows Claude 3.5 Haiku at only 61 tokens per second, while GPT-4o Mini reaches just 59 tokens per second on the same hardware (Inception Labs, 2025).
Founded by AI Pioneers
AIM Research (2025) reports that Inception Labs wasn’t founded by typical AI entrepreneurs. According to their reporting, the company was established by “Stanford professor Stefano Ermon and his colleagues Volodymyr Kuleshov and Aditya Grover” from prestigious institutions including Stanford, UCLA, and Cornell.
The same source notes that Ermon had hypothesized that “generating and modifying large blocks of text in parallel was possible with diffusion models,” and after years of research, his team “achieved a major breakthrough detailed in a research paper last year” (AIM Research, 2025). Inception Labs (2025) notes that their team has collectively contributed to foundational AI techniques including Direct Preference Optimization, Flash Attention, and Decision Transformers.
Implications for the AI Industry
The emergence of diffusion-based language models could have far-reaching implications:
-
Cost Reduction: By processing tokens in parallel rather than sequentially, Mercury could significantly reduce the computational resources required for inference. As Inception Labs (2025) puts it, their approach has the potential to “make high-quality AI solutions truly accessible.”
-
New Capabilities: Inception Labs (2025) claims their diffusion models offer inherent advantages in reasoning and output structuring since they’re “not restricted to only considering previous output.” This architectural difference, they suggest, could enable better error correction and reduced hallucinations.
-
Hardware Independence: While specialized inference hardware like Groq has gained attention for speed optimization, Analytics India Magazine (2025) notes that Mercury achieves similar performance improvements through algorithmic advancements rather than custom chips.
-
Enterprise Applications: According to their official announcement, Mercury is available to enterprise customers through both an API and on-premise deployments, with support for model fine-tuning to adapt it for specific use cases (Inception Labs, 2025).
Future Directions
Inception Labs (2025) states that Mercury Coder is just the beginning of their product roadmap. They report that a model designed for general chat applications is already in closed beta. According to their announcement, they envision diffusion language models enabling new capabilities including:
- Improved agentic applications requiring extensive planning
- Advanced reasoning that can fix hallucinations while maintaining speed
- Controllable generation allowing text infilling and format conformity
- Edge applications on resource-constrained devices like phones and laptops
Try It Yourself
According to their announcement, curious developers can test Mercury Coder directly in Inception Labs’ playground, which is hosted in partnership with Lambda Labs (Inception Labs, 2025). This provides an opportunity to experience firsthand the model’s speed and quality in generating code.
A Paradigm Shift in the Making?
While it’s too early to declare the end of autoregressive models, Mercury represents a potentially significant innovation in how AI systems generate text. If diffusion models can maintain quality while delivering the dramatic speed improvements Inception Labs claims, we may be witnessing the beginning of a new era for language models.
Analytics India Magazine (2025) highlights a comment from Andrew Ng on social media about this development:
“Transformers have dominated LLM text generation and generate tokens sequentially. This is a cool attempt to explore diffusion models as an alternative by generating the entire text at the same time using a coarse-to-fine process.”
Whether Mercury will fulfill its promise of revolutionizing LLMs remains to be seen, but the early results certainly suggest diffusion models deserve serious attention as potential successors to the autoregressive approach that has dominated the field until now.
Sources:
- Inception Labs. (2025). Introducing Mercury, the first commercial-scale diffusion large language model. https://www.inceptionlabs.ai/news
- Analytics India Magazine. (2025). The ‘First Commercial Scale’ Diffusion LLM Mercury Offers over 1000 Tokens/sec on NVIDIA H100. https://analyticsindiamag.com/ai-features/the-first-commercial-scale-diffusion-llm-mercury-offers-over-1000-tokens-sec-on-nvidia-h100/
- AIM Research. (2025). Diffusion Models Enter the Large Language Arena as Inception Labs Unveils Mercury. https://aimresearch.co/ai-startups/diffusion-models-enter-the-large-language-arena-as-inception-labs-unveils-mercury
Have a Question About These Solutions?
Whether you're curious about implementation details or want to know how these approaches might work in your context, I'm happy to help.