Setup Instructions
Please make sure to follow Driver Installation to install the Gaudi driver on the system.
It is recommended to use the PyTorch Docker image to run the examples below.
To use the provided Dockerfile for the sample, follow the Docker Installation guide to setup the Habana runtime for Docker images.
The Docker image assists in setting up the PyTorch software and packages to run the samples. However, installing additional required packages like DeepSpeed is still necessary to run the samples.
Get examples from optimum-habana github repository
To benchmark Llama2 and Llama3 models, obtain optimum-habana from the GitHub repository using the following command.
Docker Run
After building the Docker image, run the following command to start a Docker instance, which will open in the text-generation folder inside the docker instance.
NOTE: The Huggingface model file size might be large, so it is recommended to use an external disk as the Huggingface hub folder. \ Export the HFHOME environment variable to the external disk and then export the mount point into the Docker instance. \ ex: "-e HFHOME=/mnt/huggingface -v /mnt:/mnt"
Install required packages inside docker
First, install the optimum-habana:
Second, install the requirements:
For run_lm_eval.py
:
Then, to use DeepSpeed-inference, install DeepSpeed as follows:
Tensor quantization statisics measurement
This step needs to be completed only once for each model with the corresponding world size values.
The hqtoutput generated after this step will be used for the FP8 run.
If changing models for the FP8 run, repeat this step to obtain the relevant hqtoutput.
Llama2
Here is an example to measure the tensor quantization statistics on LLama2:
Export different values to the following environment variables to change parameters for tensor quantization statistics:
Environment Variable | Values |
---|---|
model_name | meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-7b-hf |
world_size | 1, 2, 8 |
Llama3
Here is an example to measure the tensor quantization statistics on Llama3 with 8 cards:
Please note that Llama3-405B requires a minimum of 8 Gaudi3 cards.
Export different values to the following environment variables to change parameters for tensor quantization statistics:
Environment Variable | Values |
---|---|
model_name | meta-llama/Llama-3.1-405B-Instruct, meta-llama/Llama-3.1-70B-Instruct, and meta-llama/Llama-3.1-8B-Instruct |
world_size | 8 |
Quantize and run the fp8 model
Here is an example to quantize the model based on previous measurements for LLama2 or 3 models:
Export different values to the following environment variables to change parameters for tensor quantization statistics:
Environment Variable | Values |
---|---|
model_name | meta-llama/Llama-2-70b-hf, meta-llama/Llama-2-7b-hf, meta-llama/Llama-3.1-405B-Instruct, meta-llama/Llama-3.1-70B-Instruct, and meta-llama/Llama-3.1-8B-Instruct |
input_len | 128, 2048, and etc |
output_len | 128, 2048, and etc |
batch_size | 350, 1512, 1750, and etc |
world_size | 1, 2, 8 |
Please note that Llama3-405B requires a minimum of 8 Gaudi3 cards.
Here is an example to run llama2-70b with input tokens length=128, output tokens length=128 and batch size = 1750
After setting the environment variables, run the FP8 model using the following command: