LocalGPT with Llama 2
This tutorial shows how to use the LocalGPT open source initiative on the Intel® Gaudi® 2 AI accelerator. LocalGPT allows you to load your own documents and run an interactive chat session with the material. To query and summarize your content, loading any .pdf or .txt documents into the SOURCE DOCUMENTS folder, run the ingest.py script to tokenize your content, and then use the run_localGPT.py script to start the interaction.
This example uses the Llama 2 13B chat model from Meta* (meta-llama/Llama-2-13b-chat-hf) as the reference model to run inference on Intel Gaudi 2 AI accelerators.
To optimize this instance of LocalGPT, create new content on top of the existing Hugging Face* -based text-generation inference task and pipelines, including using:
- The Optimum for Intel Gaudi AI accelerators library with the Llama 2 13B model optimized on Intel Gaudi 2 AI accelerators.
- LangChain* to import the source document with a custom embedding model using the GaudiHuggingFaceEmbeddings class based on HuggingFaceEmbeddings.
- A custom pipeline class, GaudiTextGenerationPipeline, optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.
To optimize LocalGPT on Intel Gaudi 2 AI accelerators, custom classes were developed for text embeddings and text generation. The application uses the custom class GaudiHuggingFaceEmbeddings to convert textual data to vector embeddings. This class extends the HuggingFaceEmbeddings class from LangChain and uses an Intel Gaudi 2 AI accelerators-optimized implementation of SentenceTransformer.
The tokenization process was modified to incorporate static shapes, which provides a significant speedup. Furthermore, the GaudiTextGenerationPipeline class provides a link between the Optimum for Intel Gaudi AI accelerators library and LangChain. Similar to pipelines from HuggingFace transformers, this class enables text generation with optimizations such as kv-caching, static shapes, and hpu graphs. It also lets you modify the text-generation parameters (such as temperature, top_p, and do_sample) and consists of a method to compile computation graphs on Intel Gaudi 2 AI accelerators. Instances of this class can be directly passed as input to LangChain classes.
To set up and run the model:
- Set the folder location and environment variables:
cd /root/Gaudi-tutorials/PyTorch/localGPT_inference export DEBIAN_FRONTEND="noninteractive" export TZ=Etc/UTC /root/Gaudi-tutorials/PyTorch/localGPT_inference
- Install the requirements for LocalGPT:
apt-get update apt-get install -y tzdata bash-completion python3-pip openssh-server vim git iputils-ping net-tools protobuf-compiler curl bc gawk tmux rm -rf /var/lib/apt/lists/* pip install -q --upgrade pip pip install -q -r requirements.txt
- Install the Optimum for Intel Gaudi AI accelerators library:
pip install -q --upgrade-strategy eager optimum[habana]
- To load local content, copy all files into the SOURCE_DOCUMENTS directory.
For this example, a copy of the Constitution of the United States is included in the folder. You can add and ingest additional content to the folder.
The default file types are .txt, .pdf, .csv, and .xlsx. Any other file type must be converted to these types. - To ingest all the data, run the following command:
python ingest.py --device_type hpu 2023-10-10 23:23:58,137 - INFO - ingest.py:124 - Loading documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS 2023-10-10 23:23:58,148 - INFO - ingest.py:37 - Loading document batch 2023-10-10 23:24:48,208 - INFO - ingest.py:133 - Loaded 1 documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS 2023-10-10 23:24:48,208 - INFO - ingest.py:134 - Split into 2227 chunks of text Loading Habana modules from /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib 2023-10-10 23:24:49,625 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2023-10-10 23:24:50,149 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2023-10-10 23:24:50,723 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings 2023-10-10 23:24:50,950 - INFO - duckdb.py:460 - loaded in 4454 embeddings 2023-10-10 23:24:50,952 - INFO - duckdb.py:472 - loaded in 1 collections ====== HABANA PT BRIDGE CONFIGURATION ==== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 -------------------: System Configuration :------------------- Num CPU Cores : 160 CPU RAM : 1056447244 KB ------------------------------------------------------------------- Batches: 100%|██████████████████████████████████| 70/70 [00:02<00:00, 23.41it/s] 2023-10-10 23:24:58,235 - INFO - ingest.py:161 - Time taken to create embeddings vectorstore: 7.784449464001227s 2023-10-10 23:24:58,235 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB 2023-10-10 23:24:58,619 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
The ingest.py file uses LangChain tools to parse the document and create embeddings locally using the GaudiHuggingFaceEmbeddings class. It then stores the result in a local vector database (DB) using Chroma vector store.
Note To start from an empty database, delete the DB folder and run the ingest script again.
How to Access and Use the Llama 2 Model
Use of the pretrained model is subject to compliance with third-party licenses, including the Llama 2 Community License Agreement (LLAMAV2). For guidance on the intended use of the Llama 2 model, what is considered misuse, out-of-scope uses, intended users, and additional terms, review and read the instructions in the Community License. You bear sole liability and responsibility to follow and comply with any third-party licenses.
To run gated models like Llama-2-13b-chat-hf, you need to do the following:
- Sign up for a Hugging Face account.
- Agree to the model's terms of use in its model card on the Hugging Face hub.
- Set a read token.
- Before launching your script, use the Hugging Face command-line interface to sign into your account: run huggingface-cli login.
huggingface-cli login --token <your token here>
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Run the LocalGPT Model with Llama 2 13B Chat
To change the model, modify LLM_ID = in the constants.py file. For this example, the default is meta-llama/Llama-2-13b-chat-hf.
Since the example is interactive, it's a better experience to launch it from a terminal window. The run_localGPT.py script uses a local LLM (Llama 2) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.
python run_localGPT.py --device_type hpu
Note The inference is running in sampling mode, so to modify the output, you can optinally modify the temperature and top_p settings in run_localGPT.py, line 84. The current settings are temperature=0.5 and top_p=0.5. To stop running the model, type exit at the prompt.
To start the chat, run the following in a terminal window :
python run_localGPT.py –device_type hpu
The following example shows the initial output:
python run_localGPT.py --device_type hpu
2023-10-10 23:29:55,812 - INFO - run_localGPT.py:186 - Running on: hpu
2023-10-10 23:29:55,812 - INFO - run_localGPT.py:187 - Display Source Documents set to: False
2023-10-10 23:29:56,315 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:29:56,718 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2023-10-10 23:29:56,922 - INFO - __init__.py:88 - Running Chroma using direct local API.
2023-10-10 23:29:56,931 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
2023-10-10 23:29:56,935 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations
2023-10-10 23:29:56,938 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
2023-10-10 23:29:57,183 - INFO - duckdb.py:460 - loaded in 6681 embeddings
2023-10-10 23:29:57,184 - INFO - duckdb.py:472 - loaded in 1 collections
2023-10-10 23:29:57,185 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection
2023-10-10 23:29:57,186 - INFO - run_localGPT.py:38 - Loading Model: meta-llama/Llama-2-13b-chat-hf, on: hpu
2023-10-10 23:29:57,186 - INFO - run_localGPT.py:39 - This action can take a few minutes!
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2023-10-10 23:29:57,622] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[WARNING|utils.py:177] 2023-10-10 23:29:58,637 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.12.0.480 was found, this could lead to undefined behavior!
[WARNING|utils.py:190] 2023-10-10 23:29:59,786 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but the driver version is v1.12.0, this could lead to undefined behavior!
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 17427.86it/s]
Loading checkpoint shards: 100%|██████████████████| 3/3 [00:02<00:00, 1.12it/s]
========= HABANA PT BRIDGE CONFIGURATION =======
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
-----------------------: System Configuration :----------------------
Num CPU Cores : 160
CPU RAM : 1056447244 KB
--------------------------------------------------------------------------
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.
2023-10-10 23:32:20,404 - INFO - run_localGPT.py:133 - Local LLM Loaded
Enter a query: what is the Article I ?
2023-10-23 19:47:28,598 - INFO - run_localGPT.py:240 - Query processing time: 1.5914111537858844s
> Question:
what is the Article I ?
> Answer:
It is the first article of the US constitution .
Enter a query: what does it say?
2023-10-23 19:47:36,684 - INFO - run_localGPT.py:240 - Query processing time: 1.872558546019718s
> Question:
what does it say?
> Answer:
The first article of the US constitution states "All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives."
Enter a query: What about article II
2023-10-23 20:34:42,818 - INFO - run_localGPT.py:240 - Query processing time: 1.6038263840600848s
> Question:
What about article II
> Answer:
Article II of the US constitution deals with the executive branch of government and establishes the office of the president.
Next Steps
To query and chat with your own content, you can add it to the SOURCE DOCUMENTS folder.
To experiment with different values to get different outputs, you can also modify the temperature and top_p values in the run_localGPT.py file, line 84:
pipe = GaudiTextGenerationPipeline(model_name_or_path=model_id, max_new_tokens=100, temperature=0.5, top_p=0.5, repetition_penalty=1.15, use_kv_cache=True, do_sample=True)
For more information, review the updated class GaudiTextGenerationPipeline: in the /gaudi_utils/pipeline.py for information on tokenization and padding.