Overview
This guide demonstrates how developers can now access and use several LLM inference reference models optimized to run on Intel® Gaudi 2® and Intel® Gaudi® 3 AI accelerators using the Optimum-Habana library. The Optimum-Habana library is the interface between the Transformers and Diffusers libraries and Intel Gaudi AI Accelerators (HPU). It provides a set of tools enabling easier model loading, training, and inference on single- and multi-AI accelerator settings for different downstream tasks. A list of validated models and tasks is available.
Hugging Face* hosts the open source code for the Optimum-Habana library in the Optimum-Habana repository. For more information on models and support, see the Optimum-Habana documentation.
This guide focuses on the GPT-NeoX model and demonstrates the following:
- The code changes required to optimize GPT-NeoX to run on Intel Gaudi AI accelerators
- Using the GPT-NeoX model on a single Intel Gaudi AI accelerator
- Using the GPT-NeoX model on multiple devices using the Microsoft DeepSpeed* framework
Developers can enable and optimize other Hugging Face models on Intel Gaudi accelerators by using these steps along with steps from the Optimum-Habana documentation.
Installing Optimum-Habana
Clone Optimum-Habana and install the necessary dependencies into an appropriate environment by running the following commands:
git clone https://github.com/huggingface/optimum-habana.git cd optimum-habana && pip install . cd examples/text-generation && pip install -r requirements.txt pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.19.0
Note These steps were validated using the pytorch-installer-2.5.1 with a Docker* container image for the Intel Gaudi AI accelerator. It is available in the Intel® Gaudi® Technology Vault.
Optimum-Habana Model Optimizations
The Optimum-Habana library not only enables utilization for Intel Gaudi accelerators but also improves the targeted model performance on Intel Gaudi accelerators by including a dedicated subclass that inherits from the original upstream model code. For example, in the case of the GPT-NeoX model, the GaudiGPTNeoXForCausalLM inherits from the original GPTNeoXForCausalLM upstream model code. The implementation of this subclass includes the following specific Intel Gaudi AI accelerator optimizations:
- Pad all input vectors of the self-attention mask to the max token length before calling the generate function, enforcing static-shapes for the generation input. Static shapes for inputs result in better performance on AI accelerators by preventing unnecessary graph recompilations dynamic-shaped inputs could trigger.
- Employ a static key-value cache to eliminate the need for recompilations of self-attention forward passes needed to generate new tokens.
The static shapes optimization is implemented for several Optimum-Habana models in the optimum/habana/transformers/generation/utils.py file. However, the static key-value cache optimization only applies to the GPT-NeoX and is implemented in optimum/habana/transformers/models/gpt_neox/modeling_gpt_neox.py.
The Optimum-Habana library also enables support for hpu_graphs
, Training and Inference, and DeepSpeed Inference, which are additional methods of optimizing model performance.
Run Inference on One Intel Gaudi AI Accelerator
We can now run the model. To get a text generation output using the 20-billion parameter version of GPT-NeoX, run the following command.
Note Feel free to modify the prompt. You must include –use_kv_cache
argument, which implements the optimization discussed earlier.
python run_generation.py \ -–model_name_or_path EleutherAI/gpt-neox-20b \ -–batch_size 1 \ -–max_new_tokens 100 \ -–use_kv_cache \ -–use_hpu_graphs \ -–bf16 \ -–prompt "A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt"
The prompt returns the following output using an Intel Gaudi 2 AI accelerator:
Input/outputs: input 1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt',) output 1.1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the storage industry.\n\nThe company, called Storj, is a cloud storage company that is based in San Francisco, California. The company is a peer-to-peer (P2P) storage network. The company is a peer-to-peer (P2P) storage network.\n\nThe company is a peer-to-peer (P2P) storage network. The company is a peer-to-peer (P2P) storage network.\n\n',) Stats: ---------------------------------------------------------------------------------- Input tokens Throughput (including tokenization) = 50.90378113717287 tokens/second Memory allocated = 38.76 GB Max memory allocated = 38.79 GB Total memory available = 94.62 GB Graph compilation duration = 9.838871515356004 seconds
Running the same command without the static key-value cache optimization enabled gives the following statistics:
Stats: ---------------------------------------------------------------------------------- Input tokens Throughput (including tokenization) = 37.24143262951518 tokens/second Memory allocated = 47.86 GB Max memory allocated = 47.86 GB Total memory available = 94.62 GB Graph compilation duration = 16.407687230966985 seconds
The static key-value cache optimization greatly reduces graph compilation duration, reduces memory usage, and increases the throughput of the model.
Run Inference on Multiple Devices with Intel Gaudi Accelerators Using DeepSpeed*
To launch the multicard run, use the same arguments in the previous section with the gaudi_spawn.py script
, which invokes mpirun:
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ --model_name_or_path EleutherAI/gpt-neox-20b \ --batch_size 1 \ --max_new_tokens 100 \ --use_kv_cache \ --use_hpu_graphs \ --bf16 \ --prompt "A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt"
The prompt returns the following output using an Intel Gaudi 2 platform that has eight cards available:
Input/outputs: input 1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt',) output 1.1: ('A new Silicon Valley-based cloud storage start up has come out of stealth mode. Investors are saying the company will disrupt the storage industry.\n\nThe company, called Storj, is a peer-to-peer (P2P) file storage system. It is a peer-to-peer (P2P) file storage system.\n\nThe company is a peer-to-peer (P2P) file storage system.\n\nThe company is a peer-to-peer (P2P) file storage system.\n\nThe company is a peer-to-peer (P',) Stats: ----------------------------------------------------------------------------------- Input tokens Throughput (including tokenization) = 114.64426620380212 tokens/second Memory allocated = 6.56 GB Max memory allocated = 6.57 GB Total memory available = 94.62 GB Graph compilation duration = 5.10424523614347 seconds -----------------------------------------------------------------------------------
Next Steps
Hugging Face, Habana Labs, and Intel continue to enable reference models and publish them in optimum-habana and Model-References where anyone can freely access them. For helpful articles and forum posts, see the developer site.