Gemma 4 Multi-Token Prediction: Faster Inference for Local AI
Google's new multi-token prediction drafters for Gemma 4 cut inference time dramatically, making local AI models more practical. Here's what it means for builders.
If you've run a large language model locally, you know the pain of watching tokens trickle out one by one. Google's latest release for Gemma 4 changes that, introducing multi-token prediction drafters that can cut inference time by up to 4x on supported tasks.
What's the story?
Google's blog post details a technique that makes Gemma 4 dramatically faster at inference. The key idea is to use a smaller "drafter" model that predicts multiple future tokens in a single forward pass. The main model then verifies those predictions, accepting or rejecting them. This reduces the number of autoregressive steps, slashing latency.
According to Google, this method can speed up generation by up to 4x on some tasks while maintaining quality. The drafter is lightweight and can be trained alongside the main model. This is a form of speculative decoding, but applied to a modern open-weight model.
Why it's blowing up on HN
The community is buzzing about real-world speed gains. One commenter noted:
It's not uncommon to see a gemma vs qwen comparison, where qwen does a bit better, but spent 22 minutes on the task, while gemma aligned the buttons wrong, but only spent 4 minutes on the same prompt.
This highlights that raw speed can be as important as benchmark scores—especially for local use where every second matters.
Another commenter likened the improvement to dial-up modem upgrades:
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200.
The hardware angle also came up. Several commenters discussed fitting Gemma 4 into limited VRAM. One wrote:
Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic. However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon.
And momentum is building: multi-token prediction support is already being added to llama.cpp, starting with Qwen models.
My take
This is a big deal for anyone running models on consumer hardware. The technique itself isn't novel—speculative decoding has been around—but applying it to a strong open model like Gemma 4 makes it practical. What stands out is Google's focus on efficiency rather than just raw benchmark supremacy.
Notice how the community isn't just celebrating the speed; they're also discussing memory trade-offs. The drafter adds a small overhead, and fitting both the model and drafter into 24GB VRAM requires clever quantization and maybe a second GPU. That's a real constraint, but it's manageable.
I also observe that this pushes local AI closer to real-time usability. For chatbots, code assistants, or interactive demos, latency is critical. A 4x speedup transforms the experience from "waiting for a modem" to "almost instant."
What this means for builders
If you deploy models locally—on a personal server, edge device, or even a laptop—this is directly relevant. You can now serve higher-quality models with acceptable latency. For example, with a 7B model and a small drafter, you might get Qwen-level quality at Gemma-level speed.
Here's a conceptual example of how you might configure inference with a drafter in vLLM (once support lands):
from vllm import LLM, SamplingParams
llm = LLM(
model="google/gemma-4-9b",
speculative_config={
"model": "google/gemma-4-9b-drafter",
"num_speculative_tokens": 4,
},
)
params = SamplingParams(temperature=0.7, max_tokens=512)
output = llm.generate("Explain multi-token prediction.", params)
print(output[0].outputs[0].text)
But there's a trade-off: memory usage increases slightly. You'll need to budget for both the main model and the drafter in VRAM. Quantization helps—Q4 or Q5 versions can fit Gemma 4 9B plus a small drafter into 24GB.
Also, not all tasks benefit equally. Heavy reasoning tasks where each token is unique may see less improvement than creative writing or code generation.
Should you care?
Yes, if you run open models locally and care about latency—especially for interactive applications like chatbots or real-time assistants. The speedup is real and immediately practical. If you only use cloud APIs via providers like OpenAI or Anthropic, you can ignore this for now. But if you value privacy, cost control, or want to experiment with local AI, multi-token prediction for Gemma 4 is a compelling upgrade.
Original story: Google Blog | HN Discussion | llama.cpp PR