Gemma 4 Models Optimized for Intel Hardware: Enabling Instant Deployment from Day Zero

We’re excited to announce Intel’s strategic partnership with Google to deliver optimized Gemma 4 models on Intel hardware from day one. This collaboration enables developers to leverage the power of Google’s latest AI models on Intel hardware: Intel® Core™ Ultra processors, Intel® Xeon® CPUs, and Intel® Arc™ GPUs. Developers can create AI applications that run from data centers to workstations to AI PCs, and even power mobile applications.

What’s New in Gemma 4

Google today announced the Gemma 4 model lineup, featuring new capabilities across multiple model architectures. Gemma 4 will feature a family of 4 model sizes across 3 architectures (Small, Dense, and MoE). All models are multimodal (Text + Image) with text output while the small models also support Audio Input. Other key capabilities include thinking, coding, function calling, OCR, object detection, ASR (Audio-in models only), and long context windows (128k – 256k).

The Hardware Foundation

Intel® Xeon® CPUs are increasingly used for AI inference as cost-effective alternatives for small to medium sized models. Intel® AMX (Advanced Matrix Extensions) provides on-chip acceleration for matrix multiplication, significantly boosting inference speeds for BF16 and INT8 datatypes. This ensures that applications using Gemma 4 small models can consistently meet their inference latency requirements The high memory capacity, terabytes in some configurations, makes it possible to run even larger models. Most enterprises are already running servers with Xeon CPUs. With these optimizations already in place, running Gemma 4 models on existing datacenter Intel® Xeon® systems will be seamless.

Intel® Xe GPU based Systems, such as the newly launched Intel® Arc™ Pro B70/B65 GPUs, are designed to meet the needs of modern AI inference and provide an all-in-one inference platform. With enhanced memory capacity, they aim to simplify adoption and ease of use. With a containerized solution built for Linux environments, these systems are optimized to deliver incredible inference performance with multi-GPU scaling and PCIe P2P data transfer. They are also designed to include enterprise-class reliability and manageability features such as ECC, SRIOV, telemetry and remote firmware updates.

Intel® Core™ Ultra processors enable everyday consumers to experience and experiment with Gemma 4 through AI PCs. This newest processor family packs remarkably capable AI capabilities even in thin and light devices, bringing cutting-edge AI workflows to consumer laptops. Intel® Core™ Ultra Series 3 processors combine CPU, GPU, and NPU compute in a single package – for example, Intel® Core™ Ultra X9 Processors with Intel® Arc™ B390 GPU featuring XMX (Xe Matrix Extensions) AI engines delivers the performance needed for local AI experiences while maintaining the power efficiency essential for mobile computing. With advanced AI capabilities now available to the masses, students, creators, and everyday users can explore the latest AI innovations, all running natively on their personal devices.

The Software Stack

Intel’s “upstreaming first” strategy on open-source AI frameworks like PyTorch, Hugging Face transformers, vLLM and SGLang builds a solid foundation for this day-zero experience on Intel® Xeon® CPUs and Xe GPUs. For years, Intel has been working closely with the open-source community on kernel optimizations and feature enabling,. Here are the key features of Gemma 4 and how they are supported on Intel hardware:

  • Attention: Gemma 4 uses 2 variants of attention in different layers: sliding attention and full attention.
    • On Intel® Xe GPUs, vLLM Attention kernels in Triton work out-of-box, and flash attention kernels optimized with Intel Sycl*TLA provide additional performance boost. On Intel® Xeon® CPUs, with vLLM’s built-in CPU Attention backend, both sliding and full attention work out-of-the-box.
    • For Hugging Face transformers, both variants are supported through PyTorch* operations out-of-the-box.
  • Gemma4MoE: The MoE path leverages a highly optimized FusedMoE backend. Intel upstreamed optimized FusedMoE kernels for both Intel Xeon and Intel Xe GPU in vLLM and Hugging Face transformers, so MoE layers can work out-of-the box.
  • Vision Tower and Audio Tower: These are transformer models running on Hugging Face transformers. With solid Hugging Face transformers support, these 2 towers are enabled on Intel® Xeon® CPUs and Intel® Xe GPUs.

Enhanced Model Optimization via OpenVINO™ toolkit: OpenVINO™ support for Gemma 4 delivers advanced model optimization and deployment capabilities, enabling developers to maximize performance on Intel hardware.

OpenVINO™ Integration with LiteRT for Gemma 4 Deployment on Intel NPUs:

OpenVINO™ is being integrated as a backend in LiteRT, Google’s on-device framework for high-performance ML & GenAI deployment on edge, and we have been collaborating closely with Google to enable high-performance support for both LiteRT and LiteRT-LM on Intel platforms through the OpenVINO™ backend. This collaboration targets the advanced NPUs featured in Intel® Core™ Ultra Series 2 processors and the newly launched Intel® Core™ Ultra Series 3 processors running on Microsoft Windows and Linux. As part of this effort, we have achieved early internal enablement of the Gemma4 E2B LiteRT model through the OpenVINO backend, demonstrating that LLM inference can be efficiently offloaded to the on-device NPU while significantly reducing power consumption and maintaining the responsiveness required for on-device AI experiences. A follow-up technical update will provide detailed guidance on running the Gemma4 model using the OpenVINO backend through LiteRT-LM, enabling developers to fully exploit hardware acceleration and achieve optimal performance on these systems.

Experience Gemma 4 on Intel Hardware

Gemma 4 models (Gemma 4 E2B, Gemma 4 E4B, Gemma 4 26B A4B, Gemma 4 31B) are verified on Intel® Xeon® CPUs and Xe GPUs with Hugging Face and vLLM frameworks. Developers and enterprise customers with existing and accessible Intel hardware can seamlessly run Gemma 4 models starting today. Here are the instructions to get started:

Looking ahead

Intel will continue to deliver deeper optimizations for Gemma 4 models across leading AI frameworks. Stay tuned for upcoming updates designed to unlock even greater performance.

Key Takeaways
  1. Google’s new Gemma 4 models are supported on Intel hardware starting on Day 0.
  2. Intel’s upstream to PyTorch, vLLM, and Hugging Faces optimizes Gemma 4 models on Intel Xeon, Intel Xe GPU, and Intel Core Ultra processors.
  3. OpenVINO provides enhanced model optimization and seamless deployment on Intel NPUs through LiteRT backend integration.
  4. Developers and enterprise customers can leverage existing hardware or acquire accessible Intel Xe GPUs and Intel Core Ultra processors off-the-shelf to start building AI applications powered by Gemma 4 models.

Stephanie Maluso

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

Berkeley Design Technology, Inc.
PO Box #4446
Walnut Creek, CA 94596

Phone
Phone: +1 (925) 954-1411
Scroll to Top