Accelerate Microsoft Phi-4 Small Language Models

2/26/2025

Intel Corporation

In our ongoing mission to bring AI everywhere, Intel is committed to investing in the AI ecosystem to ensure its platforms are ready for the industry’s newest AI models and software. Today, we are excited to announce the support of Phi-4, the latest family of small, open-source AI models developed by Microsoft*, with our AI solutions across AI PCs, edge devices, and datacenter platforms.

Today’s release includes new additions to the Phi family. Phi-4-mini is a lightweight, open model. It has 3.8B parameters and is a dense decoder-only transformer model. The new architecture features in Phi-4-mini compared with Phi-3.5-mini are 200K vocabulary size, grouped-query attention, and shared input and output embedding. Phi-4-multimodal is a lightweight, open, multimodal model which has 5.6B parameters. It supports text, audio, and image as input, while returning text as output. It’s built with the Phi-4-mini language model and speech and audio encoders and adapters.

PCs and edge devices are at the forefront of delivering AI experiences that are designed to assist users while remaining personalized and private. Intel enables AI language models to run locally on AI PCs powered by Intel® Core™ Ultra processors that feature a neural processing unit (NPU) with built-in Intel® Arc™ GPU, or on Arc discrete GPUs with Intel® Xᵉ Matrix Extensions (Intel® XMX) acceleration. The compact size of Phi-4 models makes them perfect for on-device inference and allows for lightweight model fine-tuning or customization on AI PCs.

We benchmarked the inference latency of the Phi-4-mini variant on an AI PC powered by Intel® Core™ Ultra 9 288V processor with a built-in Intel Arc GPU and Intel® Arc™ B580 discrete GPU using the OpenVINO™ toolkit for performance optimization. OpenVINO helps accelerate AI inference while enabling improved throughput and accuracy.

**Figure 1. Throughput of Phi-4-mini-instruct on Intel® Core™ Ultra 9 288V processor with Built-in Intel Arc 140V GPU**

**Figure 2. Throughput of Phi-4-mini-instruct on Intel® Arc™ B580 discrete GPU**

Whereas the Phi family has long been known for text generation, Phi-4-multimodal adds exciting new capabilities to handle audio, images, and text – a ‘multi-modality’ AI model. Here is a demo showing the smooth experience of Phi-4-multimodal running on Intel® Arc™ B580 discrete GPU.

**Demo 1. Phi-4-multimodal running on Intel® Arc™ B580 discrete GPU**

Intel’s data center AI products, such as Intel® Gaudi® AI accelerators and Intel® Xeon® processors, are also fully equipped to support the new Phi-4 models. The recently launched Intel® Xeon® 6 processor, acclaimed as the world’s best CPU for AI, offers exceptional performance in traditional machine learning, smaller generative AI models, and GPU-accelerated workloads when used as a host CPU. Our preliminary benchmarking of Phi-4-mini and Phi-4-multimodal using PyTorch and Intel Extension for PyTorch on Intel Xeon 6 with MRDIMMs demonstrates that widely available Xeon processors are a performant and viable option for inference deployment of SLMs like Phi-4.

In benchmarking Phi-4-mini on Intel® Xeon® 6 processor with Performance cores 2S system with MRDIMMs, with 1K input/1K output and BF16 precision, 1955 tokens/s throughput can be achieved. For Phi-4-multimodal, based on the typical input formats showed in the model card (1 prompt + 1 image, 1 prompt + 1 audio clip, 1 image + 1 audio clip), and with output tokens specified to 1k, an Intel® Xeon® 6 processor with Performance cores 2S system with MRDIMMs can generate ~120 tokens/s while delivering 50ms next token latency SLA. More details can be found at below table.

**Table 1. Throughput of Phi-4-multimodal on Intel® Xeon® 6 processor with Performance cores 2S system with MRDIMMs**

Open Platform for Enterprise AI (OPEA) has been updated to add the support of both models to simplify end-user implementation on Gaudi and Xeon using complete end-to-end solutions through a host of microservices.

Intel has a long-standing relationship with Microsoft in AI software for both datacenter and client. In summary, Intel AI PCs and discrete graphics, Xeon processors and Gaudi AI Accelerators support Phi-4 models today. Intel will continue to optimize performance and AI experience across Intel’s product portfolio.

Learn More

Product and Performance Information

Intel® Core™ Ultra

Intel® Core™ Ultra: Measurement on an Asus Zenbook S 14 with Intel Core Ultra 9 288V platform using 32GB 8533Mhz total memory, Intel graphics driver 101.6559, openvino-genai 2025.1.0.dev20250225, Windows 11 Pro 24H2 version 26100.2894, Balanced power policy, Best Performance power mode, and core isolation disabled. Test by Intel on Feb 25th, 2025. Repositories: phi-4-mini, phi-4-multimodal.

Intel® Arc™ B-Series

Intel® Arc™ B-Series Graphics: Measurement on Intel Arc B580 12GB graphics using Intel Core i9-14900K, ASUS ROG MAXIMUS Z790 HERO motherboard, 32GB (2x 16GB) DDR5 5600Mhz and Samsung 990 Pro 2TB NVMe SSD. Software configurations include Intel graphics driver 101.6559, openvino-genai 2025.1.0.dev20250225, Windows 11 Pro 24H2 version 26100.2894, Performance power policy, and core isolation disabled. Test by Intel on Feb 25th, 2025. Repositories: phi-4-mini, phi-4-multimodal.

Intel Gaudi 3 AI Accelerator:

Tested with 1 Intel Gaudi 3 AI Accelerator. 2 socket Intel®Xeon® Platinum 8480+ CPU @ 2.00GHz. 1TB System Memory. OS: Ubuntu 22.04. Intel Gaudi software suite 1.19.2. Docker version vault.habana.ai/gaudi-docker/1.19.2/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest. Optimum Habana transformers_future branch (transformer v4.48). Test by Intel on Feb 25th, 2025. Repository: Dockerfile

Measurement on Intel Xeon 6 Processor (formerly code-named: Granite Rapids) using: 2x Intel® Xeon® 6 6980P with P-cores, HT On, Turbo On, NUMA 6, Integrated Accelerators Available [used]: DLB [8], DSA [8], IAA[8], QAT[on CPU, 8], Total Memory 1536GB (24x64GB MRDIMM 8800 MT/s [8800 MT/s]), BIOS BHSDCRB1.IPC.3544.D02.2410010029, microcode 0x11000314, 1x Ethernet Controller I210 Gigabit Network Connection 1x Micron_7450_MTFDKBG960TFR 894.3G, CentOS Stream 9, 6.6.0-gnr.bkc.6.6.16.8.23.x86_64, Test by Intel on Feb 25th 2025. For Phi-4-min: run multiple instances (2 instances per NUMA node, 12 instances in total per system, BS=32, Greedy Search, 1K input, 1K output, BF16 precision; For Phi-4-multimodal-instruct: 1 instance per NUMA node, 6 instances in total per system, BS=1, Input format as specified in main content of this blog, 1K output, Greedy Search, BF16 precision. Input content for Vision-Language task: https://www.ilankelman.org/stopsigns/australia.jpg with prompt “<|user|><|image_1|>What is shown in this image?<|end|><|assistant|> “. Input content for Speech-Language task: https://voiceage.com/wbsamples/in_mono/Trailer.wav with prompt “<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>”. Input content for Vision-Speech task: https://www.ilankelman.org/stopsigns/australia.jpg and https://voiceage.com/wbsamples/in_mono/Trailer.wav with prompt“<|user|><|image_1|><|audio_1|><|end|><|assistant|>”. Test by Intel on Feb 25th, 2025. Repository here.

AI disclaimer:
AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements.
Details at www.intel.com/AIPC

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Intel AI Solutions Accelerate Microsoft* Phi-4 Small Language Models

Get the Latest on All Things CODE

Learn More

Product and Performance Information

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Intel AI Solutions Accelerate Microsoft* Phi-4 Small Language Models

Get the Latest on All Things CODE

Learn More

Product and Performance Information

产品和性能信息