Optimize Large Language Model Inference on Intel® Gaudi® AI Accelerators with Hugging Face* Optimum-Habana

ID 标签 837290
已更新 3/24/2025
版本 Original
公共

Optimize with Intel® Gaudi® AI Accelerators

  • Create new deep learning models or migrate existing code in minutes.

  • Deliver generative AI performance with simplified development and increased productivity.

Overview

This guide demonstrates how developers can now access and use several LLM inference reference models optimized to run on Intel® Gaudi 2® and Intel® Gaudi® 3 AI accelerators using the Optimum-Habana library. The Optimum-Habana library is the interface between the Transformers and Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides a set of tools enabling easier model loading, training, and inference on single- and multi-AI accelerator settings for different downstream tasks. A list of validated models and tasks is available.

Hugging Face* hosts the open source code for the Optimum-Habana library in the Optimum-Habana repository. For more information on models and support, see the Optimum-Habana documentation.

This guide focuses on the GPT-NeoX model and demonstrates the following:

  • The code changes required to optimize GPT-NeoX to run on Intel Gaudi AI accelerators
  • Using the GPT-NeoX model on a single Intel Gaudi AI accelerator
  • Using the GPT-NeoX model on multiple devices using the Microsoft DeepSpeed* framework

Developers can enable and optimize other Hugging Face models on Intel Gaudi accelerators by using these steps along with steps from the Optimum-Habana documentation.

Installing Optimum-Habana

Clone Optimum-Habana and install the necessary dependencies into an appropriate environment by running the following commands:

git clone https://github.com/huggingface/optimum-habana.git cd optimum-habana && pip install . cd examples/text-generation && pip install -r requirements.txt pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0

Note These steps were validated using the pytorch-installer-2.5.1 with a Docker* container image for the Intel Gaudi AI accelerator. It is available in the Intel® Gaudi® Technology Vault.

Optimum-Habana Model Optimizations

The Optimum-Habana library not only enables utilization for Intel Gaudi accelerators but also improves the targeted model performance on Intel Gaudi accelerators by including a dedicated subclass that inherits from the original upstream model code. For example, in the case of the GPT-NeoX model, the GaudiGPTNeoXForCausalLM inherits from the original GPTNeoXForCausalLM upstream model code. The implementation of this subclass includes the following specific Intel Gaudi AI accelerator optimizations:

  • Pad all input vectors of the self-attention mask to the max token length before calling the generate function, enforcing static-shapes for the generation input. Static shapes for inputs result in better performance on AI accelerators by preventing unnecessary graph recompilations dynamic-shaped inputs could trigger.
  • Employ a static key-value cache to eliminate the need for recompilations of self-attention forward passes needed to generate new tokens.

The static shapes optimization is implemented for several Optimum-Habana models in the optimum/habana/transformers/generation/utils.py file. However, the static key-value cache optimization only applies to the GPT-NeoX and is implemented in optimum/habana/transformers/models/gpt_neox/modeling_gpt_neox.py.

The Optimum-Habana library also enables support for hpu_graphs, Training and Inference, and DeepSpeed Inference, which are additional methods of optimizing model performance.

Run Inference on One Intel Gaudi AI Accelerator

We can now run the model. To get a text generation output using the 20-billion parameter version of GPT-NeoX, run the following command.

Note Feel free to modify the prompt. You must include –use_kv_cache argument, which implements the optimization discussed earlier.

python run_generation.py \ -–model_name_or_path EleutherAI/gpt-neox-20b \ -–batch_size 1 \ -–max_new_tokens 100 \ -–use_kv_cache \ -–use_hpu_graphs \ -–bf16 \ -–prompt "A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt"

The prompt returns the following output using an Intel Gaudi 2 AI accelerator:

Input/outputs: input 1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt',) output 1.1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the storage industry.\n\nThe company, called Storj, is a cloud storage company that is based in San Francisco, California. The company is a peer-to-peer (P2P) storage network. The company is a peer-to-peer (P2P) storage network.\n\nThe company is a peer-to-peer (P2P) storage network. The company is a peer-to-peer (P2P) storage network.\n\n',) Stats: ---------------------------------------------------------------------------------- Input tokens Throughput (including tokenization) = 50.90378113717287 tokens/second Memory allocated = 38.76 GB Max memory allocated = 38.79 GB Total memory available = 94.62 GB Graph compilation duration = 9.838871515356004 seconds

Running the same command without the static key-value cache optimization enabled gives the following statistics:

Stats: ---------------------------------------------------------------------------------- Input tokens Throughput (including tokenization) = 37.24143262951518 tokens/second Memory allocated = 47.86 GB Max memory allocated = 47.86 GB Total memory available = 94.62 GB Graph compilation duration = 16.407687230966985 seconds

The static key-value cache optimization greatly reduces graph compilation duration, reduces memory usage, and increases the throughput of the model.

Run Inference on Multiple Devices with Intel Gaudi Accelerators Using DeepSpeed*

To launch the multicard run, use the same arguments in the previous section with the gaudi_spawn.py script, which invokes mpirun:

python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path EleutherAI/gpt-neox-20b \ --batch_size 1 \ --max_new_tokens 100 \ --use_kv_cache \ --use_hpu_graphs \ --bf16 \ --prompt "A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt"

The prompt returns the following output using an Intel Gaudi 2 platform that has eight cards available:

Input/outputs: input 1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt',) output 1.1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the storage industry.\n\nThe company, called Storj, is a peer-to-peer (P2P) file storage system. It is a peer-to-peer (P2P) file storage system.\n\nThe company is a peer-to-peer (P2P) file storage system.\n\nThe company is a peer-to-peer (P2P) file storage system.\n\nThe company is a peer-to-peer (P',) Stats: ----------------------------------------------------------------------------------- Input tokens Throughput (including tokenization) = 114.64426620380212 tokens/second Memory allocated = 6.56 GB Max memory allocated = 6.57 GB Total memory available = 94.62 GB Graph compilation duration = 5.10424523614347 seconds -----------------------------------------------------------------------------------

Next Steps

Hugging Face, Habana Labs, and Intel continue to enable reference models and publish them in optimum-habana and Model-References where anyone can freely access them. For helpful articles and forum posts, see the developer site.