Fine-tune the Llama 2 70B model using only eight Intel® Gaudi® 2 accelerators.
Fine-tuning large language models (LLM) with billions of parameters such as Llama 2 70B is a challenging task that demands huge memory and high computational resources. At bfloat16 precision, a single model parameter requires two bytes of memory. Thus, simply loading 70 billion parameters of Llama 2 70B requires 140 GB of device memory. Additionally, more memory is required to accommodate optimizer states and gradients of the model during training.
This article shows how to fine-tune Llama 2 70B with DeepSpeed ZeRO-3 and LoRA* techniques on eight Intel® Gaudi® 2 AI accelerators. A Jupyter notebook tutorial is available in the Habana AI Gaudi-tutorial GitHub* repository. Please consult the top-level README.md for instructions on how to enable an appropriate Jupyterlab environment supporting that tutorial.
DeepSpeed ZeRO-3 Optimization
DeepSpeed is a deep learning optimization library that enables the scaling of models for training and inference. The Zero Redundancy Optimizer (ZeRO) is a memory optimization technique within DeepSpeed that comprises three optimization stages. Stage 3 of ZeRO (ZeRO-3) optimization reduces memory consumption in distributed training by partitioning optimizer states, gradients, and model parameters across the worker processes.
Figure 1. shows that each worker possesses only a subset of the parameters. In preparation for running the forward or backward pass, the necessary parameters are made available using communication collective operations just before running. After running, parameters are removed that are no longer needed until the subsequent forward or backward pass. Moreover, in the parameter update phase, each worker is responsible for updating only the optimizer states corresponding to the parameters assigned to it.
Figure 1. DeepSpeed ZeRO-3 optimization comparison
Table 1. shows that although DeepSpeed ZeRO-3 optimization can significantly reduce memory usage, full parameter fine-tuning of Llama 2 70B, even on eight Intel® Gaudi® 2 cards, is still impossible.
Model Description and Optimizer States | Memory Requirements | Calculated By |
---|---|---|
Llama 2 70B (70 billion) parameters | Approximately 1.1 TB | 140 GB per Intel® Gaudi® 2 card on an HLS-2 server |
Loading model parameters in BF16 precision | 140 GB | Two Bytes x 70 B |
Gradients in bfloat 16 precision | 140 GB | Two Bytes x 70 B |
Optimizer states (parameters, momentum of gradients, and variance of gradients) of Adam optimizer in FP32 | 840 GB | 3 x 4 Bytes x 70 B |
This limitation is overcome by introducing a Parameter-Efficient Fine-Tuning (PEFT) method to fine-tune only a subset of parameters to reduce resource use.
Parameter-Efficient Fine-Tuning with LoRA*
PEFT is a cost-effective solution to the resource-intensive fine-tuning of large language models. It fine-tunes only a small number of model parameters, adapting the pretrained model for a specific downstream task instead of fine-tuning the entire model. LoRA is one of the most used methods among the various techniques of PEFT.
LoRA dramatically reduces the number of trainable parameters by freezing the pretrained model weights and performing weight updates with low-rank matrices. This is because fine-tuning pretrained weights can be represented as a sum of the pretrained weight (W0) and the accumulated gradient update (ΔW), which can be decomposed into two low-rank matrices, A and B.
W’ = W0 + ΔW = W0 + BA W’: weight matrix after fine-tuning, ∈Rd×k W0: pretrained weight matrix, ∈Rd×k ΔW: accumulated gradient update of W0 during fine-tuning, ∈Rd×k A, B: trainable low-rank matrices, B∈Rd×r,A∈Rr×k where r ≪min(d,k)
Figure 2. shows that in the forward pass, input features are multiplied with both pretrained weight (W0) and accumulated gradient update (ΔW= BA). Then, their outputs are added to yield the results. During the backward pass, A and B receive gradient updates while the pretrained full-rank weights remain frozen.
Figure 2. PEFT with LoRA
Fine-Tune Llama 2 70B
The Intel® Gaudi® software 1.13.0 release enabled Llama 2 70B fine-tuning on eight Intel® Gaudi® 2 cards with DeepSpeed ZeRO-3 optimization and LoRA. To improve the model’s training performance, support was added for running the softmax in the attention layer in bfloat16 precision without compromising the accuracy of the outputs. Furthermore, memory consumption with DeepSpeed ZeRO-3 has been optimized by constraining the internal graph size and adding synchronization points. The PT_HPU_MAX_COMPOUND_OP_SIZE and DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED environment variables are switches to enable the optimization, used along with the command.
To apply DeepSpeed ZeRO-3 optimization to fine-tuning Intel® Gaudi® 2 accelerators, apply these settings and configurations:
- Set the stage to 3.
- Configure overlap_comm and contiguous_gradients to false within a dictionary under the zero_optimization entry.
These DeepSpeed settings are configured in a .json file format. For this example, the .json file is already preloaded to Optimum-Habana GitHub repository (llama2_ds_zero3_config.json) and included in the runtime command.
For LoRA, trainable low-rank matrices were injected into modules:
q_proj k_proj v_proj o_proj
The following LoRA configurations, were used:
LoRA rank of 4 LoRA α of 16 Dropout probability of 0.05
This example fine-tuned the Llama 2 70B with the Alpaca dataset for two epochs until it converged, using a local batch size of 10 and a maximum sequence length of 2048. The training batch size of 10 was selected for improved accuracy, not for maximizing memory usage. A larger batch size can also fit in the device memory, but the Alpaca dataset results in a smaller number of weight updates per epoch, therefore making it more challenging to achieve convergence.
The Llama 2 70B fine-tuning example is available in the Optimum Habana repository on GitHub. Optimum-Habana is an interface between the Hugging Face* Transformers library and the Intel® Gaudi® AI accelerator.
To create an environment capabale of executing the Llama 2 70B fine-tuning example, consult the top level documentation available in the optimum-habana top level README.md regarding Gaudi Setup.
The basic steps are:
1) Pull and run the latest Intel® Gaudi® Docker image from the Habana Vault.
2) Enable access to the Llama 2 model.
3) Install Optimum-Habana, Deepspeed and other requirements inside the Docker container.
4) Run the model.
How to Start the Intel® Gaudi® Docker image
Please consult the Docker Installation documentation for information on how to install and configure the habana container runtime. If the container runtime is properly configured, the Intel® Gaudi® Docker image can be started using the following command:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
How to Access and Use the Llama 2 Model
Use of the pretrained model is subject to compliance with third-party licenses, including the Llama 2 Community License Agreement.
Using gated models like Llama-2-70b-hf requires the following:
- A Hugging Face account.
- Agreeing to the terms of use of the model in its model card on the HF Hub.
- Setting a read token.
Before launching training scripts, users should log into thier account using the Hugging Face command line interface:
huggingface-cli login
How to Install Deepspeed, Optimum-Habana and other requirements
Run the following commands to enable and download the Llama 2 model:
pip install peft pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.20.0 pip install optimum-habana==1.15.0 git clone https://github.com/huggingface/optimum-habana.git cd optimum-habana/ git checkout v1.15.0 cd examples/language-modeling pip install -r requirements.txt huggingface-cli login --token
How to Download the Model and Fine-Tune the Example
To run the fine-tuning example using eight Intel® Gaudi® 2 accelerators, go to the optimum-habana/examples/language-modeling directory, and run the command:
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 \ python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py \ --model_name_or_path meta-llama/Llama-2-70b-hf \ --deepspeed llama2_ds_zero3_config.json \ --dataset_name tatsu-lab/alpaca \ --bf16 True \ --output_dir ./lora_out \ --num_train_epochs 2 \ --max_seq_len 2048 \ --per_device_train_batch_size 10 \ --per_device_eval_batch_size 10 \ --gradient_checkpointing \ --evaluation_strategy epoch \ --eval_delay 2 \ --save_strategy no \ --learning_rate 0.0018 \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --dataset_concatenation \ --attn_softmax_bf16 True \ --do_train \ --do_eval \ --use_habana \ --use_lazy_mode \ --pipelining_fwd_bwd \ --throughput_warmup_steps 3 \ --lora_rank 4 \ --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \ --validation_split_percentage 4
It takes approximately 44 minutes to fine-tune Llama 2 70B on eight Intel® Gaudi® 2 cards for two epochs to converge.
Summary
This example showed how to enable Llama 2 70B fine-tuning on eight Intel® Gaudi® 2 AI accelerators by applying DeepSpeed ZeRO-3 optimization and the LoRA technique. While the example in this article primarily focuses on Llama 2 70B, these methodologies are widely applicable to other large language models.
Additional Resources
Memory-Efficient Training on Intel® Gaudi® with DeepSpeed
Fine-Tuning GPT-2* with Hugging Face and Intel® Gaudi®
References
Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models”, arXiv:1910.02054
Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models”, arXiv:2106.09685Memory-Efficient Training on Habana® Gaudi® with DeepSpeed
Memory-Efficient Training on the Intel® Gaudi® Platform with DeepSpeed