Xiaomi MiMo: 1000 tokens/s LLM inference explained
Xiaomi's MiMo model achieves 1000 tokens per second with FP4 quantization. Analysis of the speed, trade-offs, and implications for builders. Includes community reactions from Hacker News.
Xiaomi's MiMo model claims to achieve 1000 tokens per second inference speed on a single GPU rack. The demo video shows text streaming faster than you can read. But as with any eye-popping number, the devil is in the details.
Xiaomi's MiMo Model: 1000 tokens/s Breakdown
According to a blog post discussed on Hacker News, the Xiaomi team describes a modified MiMo model (likely a mixture-of-experts design) using FP4 quantization and custom CUDA kernels to achieve unprecedented inference speeds. They benchmarked it against their own previous models and claim a 3x cost reduction over standard FP8 inference while maintaining acceptable quality. The key metrics:
- 1000 tokens/s generation on an 8-GPU rack (presumably NVIDIA A100 or H100)
- 3x cheaper per token compared to their own FP8 baseline
- Limited release – currently only available to select partners, not open source
A demo video shows the model answering prompts in real-time with virtually no latency. The HN community took notice.
Hacker News Reactions to the Speed Claim
The thread (about 50 points, 9 comments) is a mix of awe and scepticism. One commenter wrote:
"Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish."
Another focused on the speed:
"The generation speed in the demo video is 'crazy' at least and beyond my impressions to LLMs."
But not everyone is sold. A few raised concerns about FP4 quantization:
"FP4 Quantization, acceptable for some tasks but probably a waste for hard tasks."
And a comparison to Cerebras came up:
"Cerebras accomplished this with GLM 4.7 a while back, it was really nice but once GLM got to 5/5.1 they couldn't sustain this speed."
The limited release also puzzled some:
"Why the limited release? They should have no trouble scaling it if it runs on a single rack."
FP4 Quantization: The Real Innovation
This is an impressive engineering feat, but let's put it in perspective. 1000 tokens/s is fast – that's about 750 words per second, roughly the reading speed of most people. For real-time chat, it's overkill; for batch processing or agent loops, it's a game changer if quality holds.
The FP4 quantization is the real story. We've seen FP8 become mainstream, but FP4 is still niche. Xiaomi seems to have handled the trade-off well for many tasks, but as the commenter noted, hard tasks (math, reasoning, complex code) will likely degrade. The Cerebras comparison is apt – their wafer-scale hardware achieved similar speeds a while back, but accuracy suffered on harder versions of the same model.
What's missing from the blog post is clear benchmarks on standard tasks (MMLU, HumanEval, etc.) comparing FP4 to FP8 quality. Without that, we can't judge the real utility. For more on quantization techniques, see the Hugging Face quantization guide.
Practical Implications for Builders
If you're building latency-sensitive applications, like voice assistants or real-time co-pilots, 1000 tokens/s could dramatically improve user experience. For agentic systems that need many sequential calls (e.g., code generation pipelines), lower cost per token directly reduces operational expenses.
Here's a rough cost comparison using hypothetical numbers:
FP8 baseline: $0.50 per million tokens
FP4 MiMo: $0.17 per million tokens (3x cheaper)
Speed: 1000 vs ~200 tokens/s
But you need to test quality on your domain. If your task involves precise reasoning, FP4 might introduce errors. Consider a two-tier routing strategy:
def route_prompt(prompt):
if complex_reasoning_required(prompt):
return infer_fp8(prompt) # slower but accurate
else:
return infer_fp4(prompt) # fast and cheap
Xiaomi's approach may also inspire hardware-level optimizations for other model providers. Expect more companies to experiment with aggressive quantization and custom kernels.
Verdict: When to Use MiMo's Ultra-Fast Inference
Yes, if you run inference at scale – the cost and speed improvements are real, provided quality is acceptable for your use case. Maybe, if you're a startup – you could build products that were previously cost-prohibitive. No, if you need top-tier accuracy – wait for FP8 or higher precision. In short: this is a glimpse of the next frontier in LLM inference, but don't trade quality for speed without testing.