Google’s Gemma 4 AI models increase the speed by 3x in predicting the future



Google has released its own Gemma 4 open colors this spring, it promises new power and functionality for native AI. Google’s adoption of edge AI may be accelerating with the release of Multi-Token Prediction (MTP) Gemma authors. Google says these test models take a hypothetical approach to predict the future, which can speed up generation compared to how models generate tokens on their own.

The latest Gemma models are built on the same technology that powers Google’s Gemini AI frontier, but are optimized for local navigation. Gemini is designed to run on Google TPU chipswhich works in large clusters with very fast connections and memory. One very powerful AI accelerator can run the largest version of Gemma 4 with high accuracy, and the calculations allow it to run on consumer GPUs.

Gemma allows users to check with AI on their own hardware instead of sharing all their data with a cloud AI system from Google or someone else. Google also changed the Gemma 4 license to Apache 2.0, which is more flexible than the Gemma license that Google used in previous releases. However, there are hardware limitations that most people have to run native AI models. That’s where MTP comes in.

LLMs like Gemma (or Gemini) generate signals autoregressively—that is, they generate one signal at a time based on the previous signal. Each one takes as much computational work as the last, regardless of whether the signal is a filler word in the output or an important piece of information in a complex logical problem.

The problem with running your own AI is that your machine’s memory may not be very fast compared to the high bandwidth memory (HBM) used in enterprise applications. As a result, the processor spends a lot of time moving parameters from VRAM to read units for each symbol, and the processor’s running time is not used at this time.

Gemma 4 26B on NVIDIA RTX PRO 6000. Standard Inference (left) vs. MTP Drafter (right) in symbols per second. That’s what the quality agency said, half the waiting time.

MTP uses that time to bypass the heavy model and create imaginary signals with a lighter model. Although the recording range is small (only 74 million units in Gemma 4 E2B), it is optimized in several ways to speed up the production of tokens. For example, the artist shares a valuable secret (especially the memory of the LLM) so it is not necessary to recalculate what the big brand has done before. E2B and E4B models also use a step-down method to reduce groups of signals.



Source link

اترك ردّاً

لن يتم نشر عنوان بريدك الإلكتروني. الحقول الإلزامية مشار إليها بـ *