Fine-tune the Llama 2 70B model using only eight Intel® Gaudi® 2 accelerators with Intel Gaudi software version 1.13.0.
Fine-tuning large language models (LLM) with billions of parameters such as Llama 2 70B is a challenging task that demands huge memory and high computational resources. At bfloat16 precision, a single model parameter requires two bytes of memory. Thus, simply loading 70 billion parameters of Llama 2 70B requires 140 GB of device memory. Additionally, more memory is required to accommodate optimizer states and gradients of the model during training.
In this article, we show how to fine-tune Llama 2 70B with DeepSpeed ZeRO-3 and LoRA* techniques on eight Intel® Gaudi® 2 AI accelerators.
DeepSpeed ZeRO-3 Optimization
DeepSpeed is a deep learning optimization library that enables the scaling of models for training and inference. The Zero Redundancy Optimizer (ZeRO) is a memory optimization technique within DeepSpeed that comprises three optimization stages. Stage 3 of ZeRO (ZeRO-3) optimization reduces memory consumption in distributed training by partitioning optimizer states, gradients, and model parameters across the worker processes.
Figure 1 shows that each worker possesses only a subset of the parameters. In preparation for running the forward or backward pass, the necessary parameters are made available using communication collective operations just before running. After running, parameters are removed that are no longer needed until the subsequent forward or backward pass. Moreover, in the parameter update phase, each worker is responsible for updating only the optimizer states corresponding to the parameters assigned to it.
Figure 1. DeepSpeed ZeRO-3 optimization comparison
Table 1 shows that although DeepSpeed ZeRO-3 optimization can significantly reduce memory usage, full parameter fine-tuning of Llama 2 70B, even on eight Intel Gaudi 2 cards, is still impossible.
Table 1. Calculated memory and resource requirements
Model Description and Optimizer States |
Memory Requirements |
Calculated By |
Llama 2 70B (70 billion) parameters |
Approximately 1.1 TB |
140 GB per Intel Gaudi 2 card on an HLS-2 server |
Loading model parameters in BF16 precision |
140 GB |
Two Bytes x 70 B |
Gradients in bfloat 16 precision |
140 GB |
Two Bytes x 70 B |
Optimizer states (parameters, momentum of gradients, and variance of gradients) of Adam optimizer in FP32 |
840 GB |
3 x 4 Bytes x 70 B |
Thus, we also introduced a Parameter-Efficient Fine-Tuning (PEFT) method to fine-tune only a subset of parameters to reduce resource use.
Parameter-Efficient Fine-Tuning with LoRA*
PEFT is a cost-effective solution to the resource-intensive fine-tuning of large language models. It fine-tunes only a small number of model parameters, adapting the pretrained model for a specific downstream task instead of fine-tuning the entire model. LoRA is one of the most used methods among the various techniques of PEFT.
LoRA dramatically reduces the number of trainable parameters by freezing the pretrained model weights and performing weight updates with low-rank matrices. This is because fine-tuning pretrained weights can be represented as a sum of the pretrained weight (W0) and the accumulated gradient update (ΔW), which can be decomposed into two low-rank matrices, A and B.
W’ = W0 + ΔW = W0 + BA
W’: weight matrix after fine-tuning, ∈Rd×k
W0: pretrained weight matrix, ∈Rd×k
ΔW: accumulated gradient update of W0 during fine-tuning, ∈Rd×k
A, B: trainable low-rank matrices, B∈Rd×r,A∈Rr×k where r ≪min(d,k)
Figure 2 shows that in the forward pass, input features are multiplied with both pretrained weight (W0) and accumulated gradient update (ΔW= BA). Then, their outputs are added to yield the results. During the backward pass, A and B receive gradient updates while the pretrained full-rank weights remain frozen.
Figure 2. PEFT with LoRA
Fine-Tune Llama 2 70B
In the Intel Gaudi software 1.13.0 release, we enabled Llama 2 70B fine-tuning on eight Intel Gaudi 2 cards with DeepSpeed ZeRO-3 optimization and LoRA. To improve the model’s training performance, we added support for running the softmax in the attention layer in bfloat16 precision without compromising the accuracy of the outputs. Furthermore, memory consumption with DeepSpeed ZeRO-3 has been optimized by constraining the internal graph size and adding synchronization points. The PT_HPU_MAX_COMPOUND_OP_SIZE and DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED environment variables are switches to enable the optimization, used along with the command.
To apply DeepSpeed ZeRO-3 optimization to fine-tuning Intel Gaudi 2 accelerators, apply these settings and configurations:
- Set the stage to 3.
- Configure overlap_comm and contiguous_gradients to false within a dictionary under the zero_optimization entry.
These DeepSpeed settings are configured in a .json file format. For this example, the .json file is already preloaded to Optimum-Habana GitHub repository (llama2_ds_zero3_config.json) and included in the runtime command.
For LoRA, we injected the trainable low-rank matrices to modules:
- q_proj
- k_proj
- v_proj
- o_proj
For the LoRA configurations, we used:
- LoRA rank of 4
- LoRA α of 16
- Dropout probability of 0.05
In this example, we fine-tuned Llama 2 70B with the Alpaca dataset for two epochs to converge, using a local batch size of 10 and a maximum sequence length of 2048. The training batch size of 10 was selected for improved accuracy, not for maximizing memory usage. A larger batch size can also fit in the device memory, but the Alpaca dataset results in a smaller number of weight updates per epoch, therefore making it more challenging to achieve convergence.
We delivered the Llama 2 70B fine-tuning example to the Optimum Habana repository on GitHub. Optimum-Habana is an interface between the Hugging Face* Transformers library and the Intel Gaudi AI accelerator.
To run the example:
- Pull the Docker image from the Habana Vault.
- Clone the Optimum-Habana repository and install Optimum-Habana from the cloned repository inside the Docker container.
- Install DeepSpeed and the dependent Python* packages required for Llama 2 70B fine-tuning.
How to Access and Use the Llama 2 Model
Use of the pretrained model is subject to compliance with third-party licenses, including the Llama 2 Community License Agreement.
To run gated models like Llama-2-70b-hf, you must:
- Have a Hugging Face account.
- Agree to the terms of use of the model in its model card on the HF Hub.
- Set a read token.
Log in to your account using the Hugging Face command line interface.
Before launching your script, run huggingface-cli login
Download the Model and Fine-Tune the Example
To download the Llama 2 model, authenticate your Hugging Face account.
To run the fine-tuning example using eight Intel Gaudi 2 accelerators, go to the optimum-habana/examples/language-modeling directory, and run the command:
It takes approximately 44 minutes to fine-tune Llama 2 70B on eight Intel Gaudi 2 cards for two epochs to converge.
Summary
We showed how to enable Llama 2 70B fine-tuning on eight Intel Gaudi 2 AI accelerators by applying DeepSpeed ZeRO-3 optimization and the LoRA technique. While the example in this article primarily focuses on Llama 2 70B, these methodologies are widely applicable to other large language models.
Additional Resources
- Memory-Efficient Training on Intel® Gaudi® with DeepSpeed
- Fine-Tuning GPT-2* with Hugging Face and Intel Gaudi
- Intel Gaudi Software Version 1.7.0
References
- Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, arXiv:1910.02054
- Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models”, arXiv:2106.09685Memory-Efficient Training on Habana® Gaudi® with DeepSpeed
- Memory-Efficient Training on the Intel® Gaudi® Platform with DeepSpeed