LocalGPT with Llama 2
This tutorial shows how to use the LocalGPT open source initiative on the Intel® Gaudi® 2 AI accelerator. LocalGPT allows users to load their own documents and run an interactive chat session with the material. To query and summarize individual content, copy any .pdf or .txt documents into the SOURCE_DOCUMENTS folder, run the ingest.py script to tokenize the content, and then use the run_localGPT.py script to start the interaction.
Note This example has a corresponding Jupyter notebook tutorial in the Habana AI Gaudi Tutorial repository. Instructions on setting up the Jupyterlabs environment for the tutorial can be found in the root level README.md of that repository. Although using the Jupyterlabs environment is not mandatory, all the steps detailed in this article must be executed in a supported and properly configured Intel® Gaudi® Docker container running on a properly functioning Intel® Gaudi® 2 platform.
This example uses the Llama 2 13B chat model from Meta* (meta-llama/Llama-2-13b-chat-hf) as the reference model to run inference on Intel® Gaudi® 2 AI accelerators.
To optimize this instance of LocalGPT, create new content on top of the existing Hugging Face* based text-generation inference task and pipelines, including using:
- The Optimum for Intel® Gaudi® AI accelerators library with the Llama 2 13B model optimized on Intel® Gaudi® 2 AI accelerators.
- LangChain* to import the source document with a custom embedding model using the GaudiHuggingFaceEmbeddings class based on HuggingFaceEmbeddings.
- A custom pipeline class, GaudiTextGenerationPipeline, optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.
- The last section uses the full LocalGPT framework with the Llama 2 70B model as the reference model that will manage the inference on Intel® Gaudi® 2 AI accelerators. DeepSpeed inference is requried for the 70B model.
To optimize LocalGPT on Intel® Gaudi® 2 AI accelerators, custom classes were developed for text embeddings and text generation. The application uses the custom class GaudiHuggingFaceEmbeddings to convert textual data to vector embeddings. This class extends the HuggingFaceEmbeddings class from LangChain and uses an Intel® Gaudi® 2 AI accelerators-optimized implementation of SentenceTransformer.
The tokenization process was modified to incorporate static shapes, which provides a significant speedup. Furthermore, the GaudiTextGenerationPipeline class provides a link between the Optimum for Intel® Gaudi® AI accelerators library and LangChain. Similar to pipelines from Hugging Face transformers, this class enables text generation with optimizations such as kv-caching, static shapes, and hpu graphs. It also lets users modify the text-generation parameters (such as temperature, topp, and dosample) and consists of a method to compile computation graphs on Intel® Gaudi® 2 AI accelerators. Instances of this class can be directly passed as input to LangChain classes.
To run the model in a suitably configured environment, perform the following steps:
Step 1 - Start a supported and properly configured Intel® Gaudi® Docker container running on a properly functioning Intel® Gaudi® 2 platform. Please see the latest Intel® Gaudi® software documentation on how to start an Intel® Gaudi® Docker container.
Note Users will want to mount a local directory in the container to facilitate content transfer to the SOURCE_DIRECTORY. For Docker this is done using the -v option.
Step 2 - Obtain the tutorial code and change to the localGPT_inference example directory:
cd ~ git clone https://github.com/HabanaAI/Gaudi-tutorials.git cd Gaudi-tutorials/PyTorch/localGPT_inference
Step 3 - Install the requirements for LocalGPT:
pip install -q --upgrade pip pip install -q -r requirements.txt
Optional Install DeepSpeed to run inference on the full Llama 2 70B model
pip install git+https://github.com/HabanaAI/DeepSpeed.git
Step 4 - Install the Optimum for Intel® Gaudi® AI accelerator library:
pip install -q optimum-habana
Step 5 - Add desired content to the SOURCEDOCUMENTS sub-directory. The Gaudi tutorial directory, Gaudi-tutorials/PyTorch/localGPTinference, provides a copy of the Constitution of the United States in the SOURCE_DOCUMENTS sub-directory, but users can add additional content to the folder for ingestion, as desired. The supported file types are .txt, .pdf, .csv, and .xlsx. Any other file type must be converted to these types before it will be ingested.
Step 6 - Create an ingestion script, called run_ingest.py, containing the following code:
# Load the files as LangChain Documents from constants import SOURCE_DIRECTORY from ingest import load_documents documents = load_documents(SOURCE_DIRECTORY) print(f"Loaded {len(documents)} documents from {SOURCE_DIRECTORY}") from langchain_text_splitters import RecursiveCharacterTextSplitter # Split the text into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) texts = text_splitter.split_documents(documents) print(f"Created {len(texts)} chunks of text") # Create embeddings from chunks of text from constants import EMBEDDING_MODEL_NAME from langchain_huggingface import HuggingFaceEmbeddings from habana_frameworks.torch.utils.library_loader import load_habana_module from optimum.habana.sentence_transformers.modeling_utils import adapt_sentence_transformers_to_gaudi load_habana_module() adapt_sentence_transformers_to_gaudi() embeddings = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs={"device": "hpu"}) # Create a Chroma vector database to store embeddings import time from constants import PERSIST_DIRECTORY, CHROMA_SETTINGS from langchain_chroma import Chroma start_time = time.perf_counter() db = Chroma.from_documents(texts, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS) end_time = time.perf_counter() print(f"Time taken to create vector store: {(end_time-start_time)*1000} ms")
This ingestion script uses the ingest.py and constants.py utility scripts to ingest the desired content and create a Chroma vector database to store the associated embeddings.
Step 7 - To ingest all the documents in the SOURCEDOCUMENTS sub-directory, run the runingest.py command:
python run_ingest.py
Sample output:
Loaded 1 documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Splitting chunks of text from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Created 72 chunks of text Loading Habana module /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS [WARNING|utils.py:225] 2025-04-04 15:42:31,509 >> optimum-habana v1.16.0 has been validated for SynapseAI v1.20.0 but the driver version is v1.18.0, this could lead to undefined behavior! Loading Habana modules from /usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/lib Done loading Habana module /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Calling adapt_sentence_transformers_to_gaudi /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Done calling adapt_sentence_transformers_to_gaudi /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Creating the HuggingFaceEmbeddings class /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Using HPU fused kernel for apply_rotary_pos_emb Using HPU fused kernel for RMSNorm Using HPU fused kernel for apply_rotary_pos_emb Using HPU fused kernel for RMSNorm ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 PT_HPU_EAGER_PIPELINE_ENABLE = 1 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 152 CPU RAM : 1056439480 KB ------------------------------------------------------------------------------ Done creating the HuggingFaceEmbeddings class /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Starting the creation of the Chroma database /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS Time taken to create vector store: 1063.7424513697624 ms
The run_ingest.py file uses LangChain tools to parse the document and create embeddings locally using the GaudiHuggingFaceEmbeddings class. It then stores the result in a local vector database, located in the DB sub-directory, using Chroma vector store.
Note To start from an empty database, delete the DB folder and run the ingest script again.
How to Access and Use the Llama 2 Model
Use of the pretrained model is subject to compliance with third-party licenses, including the Llama 2 Community License Agreement (LLAMAV2). For guidance on the intended use of the Llama 2 model, what is considered misuse, out-of-scope uses, intended users, and additional terms, review and read the instructions in the Community License. Users of the Llama 2 model bear sole liability and responsibility to follow and comply with any third-party licenses.
To run gated models like Llama-2-13b-chat-hf of Llama-2-70b-chat-hf, do the following:
- Sign up for a Hugging Face account.
- Agree to the model's terms of use in its model card on the Hugging Face hub.
- Set a read token.
Before launching a script, use the Hugging Face command-line interface to sign into the Hugging Face account:
huggingface-cli login --token <your token here>
If the token is valid a message indicating that the login succeeded will be printed.
Run the LocalGPT Model with Llama 2 13B Chat
For the Lllama2 13B inference run, change the model meta-llama/Llama-2-13b-chat-hf by modifying the value of the LLM_ID variable in the constants.py file from the default value of meta-llama/Llama-2-70b-chat-hf.
Since the example is interactive, it's a better experience to launch it from a terminal window. The run_localGPT.py script uses a local LLM (Llama 2) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.
Note The inference is running in sampling mode, so to modify the output, users can optionally modify the temperature and top_p settings in run_localGPT.py, line 84. The current settings are temperature=0.5 and top_p=0.5. To stop running the model, type exit at the prompt.
To start the chat, run the following in a terminal window :
python run_localGPT.py --device_type hpu
The following example shows the initial output and possible interactions:
python run_localGPT.py --device_type hpu python run_localGPT.py --device_type hpu 2025-04-04 20:24:18,398 - INFO - run_localGPT.py:218 - Running on: hpu 2025-04-04 20:24:18,398 - INFO - run_localGPT.py:219 - Display Source Documents set to: False 2025-04-04 20:24:18,398 - INFO - run_localGPT.py:48 - temperature set to 0.2, top_p set to 0.95 2025-04-04 20:24:18,398 - INFO - run_localGPT.py:49 - Loading Model: meta-llama/Llama-2-13b-chat-hf, on: hpu 2025-04-04 20:24:18,398 - INFO - run_localGPT.py:50 - This action can take a few minutes! Using HPU fused kernel for apply_rotary_pos_emb Using HPU fused kernel for RMSNorm Using HPU fused kernel for apply_rotary_pos_emb Using HPU fused kernel for RMSNorm ............. model-00003-of-00003.safetensors: 100%|█████████████████████████████████████████████| 6.18G/6.18G [00:13<00:00, 444MB/s] Downloading shards: 100%|█████████████████████████████████████████████████████████████████| 3/3 [00:58<00:00, 19.61s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.19it/s] generation_config.json: 100%|██████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 2.66MB/s] ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 PT_HPU_EAGER_PIPELINE_ENABLE = 1 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 152 CPU RAM : 1056439480 KB ------------------------------------------------------------------------------ Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32) 2025-04-04 20:27:59,517 - INFO - run_localGPT.py:155 - Local LLM Loaded 2025-04-04 20:27:59,523 - INFO - SentenceTransformer.py:218 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 /root/Gaudi-tutorials/PyTorch/localGPT_inference/run_localGPT.py:256: LangChainDeprecationWarning: Please see the migration guide at: https://python.langchain.com/docs/versions/migrating_memory/ memory = ConversationBufferMemory(input_key="question", memory_key="history") Enter a query: /root/Gaudi-tutorials/PyTorch/localGPT_inference/run_localGPT.py:298: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 1.0. Use :meth:`~invoke` instead. res = qa(query) 2025-04-04 20:28:02,813 - INFO - run_localGPT.py:302 - Query processing time: 1.70864562317729s > Question: > Answer: Enter a query: What is the first Article of the Constitution? 2025-04-04 20:29:08,693 - INFO - run_localGPT.py:302 - Query processing time: 2.3001526705920696s > Question: What is the first Article of the Constitution? > Answer: The first article of the constitution is the article that describes the powers and structure of the legislative branch of the federal government, including the house of representatives and the senate. Enter a query: What powers does it confer to the executive branch? 2025-04-04 20:30:04,640 - INFO - run_localGPT.py:302 - Query processing time: 2.7009461410343647s > Question: What powers does it confer to the executive branch? > Answer: It confers the power to the president to nominate and, by and with the advice and consent of the senate, to appoint ambassadors, other public ministers and consuls, judges of the supreme court, and all other officers of the united states, whose appointments are not otherwise provided for, and which shall be established by law. Enter a query: exit
Running the LocalGPT full example with Llama 2 70B Chat
Change the model back to meta-llama/Llama-2-70b-chat-hf by modifying the value of the LLM_ID variable in the constants.py file.
Since this is interactive, it's a better experince to launch this from a terminal window. This run_localGPT.py script uses a local LLM (Llama 2 in this case) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation. This is the run command to use:
PT_HPU_LAZY_ACC_PAR_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true python gaudi_spawn.py --use_deepspeed --world_size 8 run_localGPT.py --device_type hpu --temperature 0.7 --top_p 0.95
Running the full 70B model takes up ~128GB of disk space, and significantly more device memory than the 7B or 13B versions of the model. The additional memory requirements require the efficencies provided by the DeepSpeed library.
Note: The inference is running sampling mode, so the user can optionally modify the temperature and topp settings. The current settings are temperature=0.7 and topp=0.95. Type "exit" at the prompt to stop the execution.
Next Steps
To query and chat with personalized content, add additional content to the SOURCE_DOCUMENTS folder and ingest the new content.
To experiment with different values to get different outputs, modify the temperature and top_p values in the run_localGPT.py file, line 84:
pipe = GaudiTextGenerationPipeline(model_name_or_path=model_id, max_new_tokens=100, temperature=0.5, top_p=0.5, repetition_penalty=1.15, use_kv_cache=True, do_sample=True)
For more information, review the updated class GaudiTextGenerationPipeline in the Gaudi-tutorials/tree/main/PyTorch/localGPT_inference/gaudi_utils/pipeline.py file for information on tokenization and padding.
More Resources
Memory-Efficient Training on Intel® Gaudi® Accelerator with DeepSpeed
Fine-Tune GPT2* with Hugging Face and Intel® Gaudi® AI Accelerators