Gaudi processor logo

LocalGPT* with Llama 2 on Intel® Gaudi® 2 AI Accelerators

Run an interactive chat with your local documentation using LocalGPT* on Intel® Gaudi® 2 AI accelerators with the Llama 2 model.

author-image

By

LocalGPT with Llama 2

This tutorial shows how to use the LocalGPT open source initiative on the Intel® Gaudi® 2 AI accelerator. LocalGPT allows you to load your own documents and run an interactive chat session with the material. To query and summarize your content, loading any .pdf or .txt documents into the SOURCE DOCUMENTS folder, run the ingest.py script to tokenize your content, and then use the run_localGPT.py script to start the interaction.

This example uses the Llama 2 13B chat model from Meta* (meta-llama/Llama-2-13b-chat-hf) as the reference model to run inference on Intel Gaudi 2 AI accelerators.

To optimize this instance of LocalGPT, create new content on top of the existing Hugging Face* -based text-generation inference task and pipelines, including using:

  • The Optimum for Intel Gaudi AI accelerators library with the Llama 2 13B model optimized on Intel Gaudi 2 AI accelerators.
  • LangChain* to import the source document with a custom embedding model using the GaudiHuggingFaceEmbeddings class based on HuggingFaceEmbeddings.
  • A custom pipeline class, GaudiTextGenerationPipeline, optimizes text-generation tasks for padding and indexing for static shapes, to improve performance.
     

To optimize LocalGPT on Intel Gaudi 2 AI accelerators, custom classes were developed for text embeddings and text generation. The application uses the custom class GaudiHuggingFaceEmbeddings to convert textual data to vector embeddings. This class extends the HuggingFaceEmbeddings class from LangChain and uses an Intel Gaudi 2 AI accelerators-optimized implementation of SentenceTransformer.

The tokenization process was modified to incorporate static shapes, which provides a significant speedup. Furthermore, the GaudiTextGenerationPipeline class provides a link between the Optimum for Intel Gaudi AI accelerators library and LangChain. Similar to pipelines from HuggingFace transformers, this class enables text generation with optimizations such as kv-caching, static shapes, and hpu graphs. It also lets you modify the text-generation parameters (such as temperature, top_p, and do_sample) and consists of a method to compile computation graphs on Intel Gaudi 2 AI accelerators. Instances of this class can be directly passed as input to LangChain classes.

To set up and run the model:

  1. Set the folder location and environment variables:
    cd /root/Gaudi-tutorials/PyTorch/localGPT_inference
    
    export DEBIAN_FRONTEND="noninteractive"
    
    export TZ=Etc/UTC
    
    /root/Gaudi-tutorials/PyTorch/localGPT_inference

     

  2. Install the requirements for LocalGPT:
    apt-get update
    
    apt-get install -y tzdata bash-completion python3-pip openssh-server vim git iputils-ping net-tools protobuf-compiler curl bc gawk tmux
    
    rm -rf /var/lib/apt/lists/*
    
    
    pip install -q --upgrade pip
    
    pip install -q -r requirements.txt

     

  3. Install the Optimum for Intel Gaudi AI accelerators library:
    pip install -q --upgrade-strategy eager optimum[habana]

     

  4. To load local content, copy all files into the SOURCE_DOCUMENTS directory.

    For this example, a copy of the Constitution of the United States is included in the folder. You can add and ingest additional content to the folder.

    The default file types are .txt, .pdf, .csv, and .xlsx. Any other file type must be converted to these types.
  5. To ingest all the data, run the following command:
    python ingest.py --device_type hpu
    
    2023-10-10 23:23:58,137 - INFO - ingest.py:124 - Loading documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS
    
    2023-10-10 23:23:58,148 - INFO - ingest.py:37 - Loading document batch
    
    2023-10-10 23:24:48,208 - INFO - ingest.py:133 - Loaded 1 documents from /root/Gaudi-tutorials/PyTorch/localGPT_inference/SOURCE_DOCUMENTS
    
    2023-10-10 23:24:48,208 - INFO - ingest.py:134 - Split into 2227 chunks of text
    
    Loading Habana modules from /usr/local/lib/python3.8/dist-packages/habana_frameworks/torch/lib
    
    2023-10-10 23:24:49,625 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
    
    2023-10-10 23:24:50,149 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
    
    2023-10-10 23:24:50,723 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings
    
    2023-10-10 23:24:50,950 - INFO - duckdb.py:460 - loaded in 4454 embeddings
    
    2023-10-10 23:24:50,952 - INFO - duckdb.py:472 - loaded in 1 collections
    ====== HABANA PT BRIDGE CONFIGURATION ====
    
    PT_HPU_LAZY_MODE = 1
    
    PT_RECIPE_CACHE_PATH =
    
    PT_CACHE_FOLDER_DELETE = 0
    
    PT_HPU_RECIPE_CACHE_CONFIG =
    
    PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
    
    PT_HPU_LAZY_ACC_PAR_MODE = 1
    
    PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
    
    -------------------: System Configuration :-------------------
    
    Num CPU Cores : 160
    
    CPU RAM : 1056447244 KB
    -------------------------------------------------------------------
    
    Batches: 100%|██████████████████████████████████| 70/70 [00:02<00:00, 23.41it/s]
    
    2023-10-10 23:24:58,235 - INFO - ingest.py:161 - Time taken to create embeddings vectorstore: 7.784449464001227s
    
    2023-10-10 23:24:58,235 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB
    
    2023-10-10 23:24:58,619 - INFO - duckdb.py:414 - Persisting DB to disk, putting it in the save folder: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB

    The ingest.py file uses LangChain tools to parse the document and create embeddings locally using the GaudiHuggingFaceEmbeddings class. It then stores the result in a local vector database (DB) using Chroma vector store.

    Note To start from an empty database, delete the DB folder and run the ingest script again.

How to Access and Use the Llama 2 Model

Use of the pretrained model is subject to compliance with third-party licenses, including the Llama 2 Community License Agreement (LLAMAV2). For guidance on the intended use of the Llama 2 model, what is considered misuse, out-of-scope uses, intended users, and additional terms, review and read the instructions in the Community License. You bear sole liability and responsibility to follow and comply with any third-party licenses.

To run gated models like Llama-2-13b-chat-hf, you need to do the following:

  • Sign up for a Hugging Face account.
  • Agree to the model's terms of use in its model card on the Hugging Face hub.
  • Set a read token.
  • Before launching your script, use the Hugging Face command-line interface to sign into your account: run huggingface-cli login.
huggingface-cli login --token <your token here>

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.

Token is valid (permission: read).

Your token has been saved to /root/.cache/huggingface/token

Login successful

Run the LocalGPT Model with Llama 2 13B Chat

To change the model, modify LLM_ID = in the constants.py file. For this example, the default is meta-llama/Llama-2-13b-chat-hf.

Since the example is interactive, it's a better experience to launch it from a terminal window. The run_localGPT.py script uses a local LLM (Llama 2) to understand questions and create answers. The context for the answers is extracted from the local vector store using a similarity search to locate the right piece of context from the documentation.

python run_localGPT.py --device_type hpu

Note The inference is running in sampling mode, so to modify the output, you can optinally modify the temperature and top_p settings in run_localGPT.py, line 84. The current settings are temperature=0.5 and top_p=0.5. To stop running the model, type exit at the prompt.

To start the chat, run the following in a terminal window :

python run_localGPT.py –device_type hpu

The following example shows the initial output:

python run_localGPT.py --device_type hpu



2023-10-10 23:29:55,812 - INFO - run_localGPT.py:186 - Running on: hpu

2023-10-10 23:29:55,812 - INFO - run_localGPT.py:187 - Display Source Documents set to: False

2023-10-10 23:29:56,315 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2

2023-10-10 23:29:56,718 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2

2023-10-10 23:29:56,922 - INFO - __init__.py:88 - Running Chroma using direct local API.

2023-10-10 23:29:56,931 - WARNING - __init__.py:43 - Using embedded DuckDB with persistence: data will be stored in: /root/Gaudi-tutorials/PyTorch/localGPT_inference/DB

2023-10-10 23:29:56,935 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations

2023-10-10 23:29:56,938 - INFO - json_impl.py:45 - Using python library for writing JSON byte strings

2023-10-10 23:29:57,183 - INFO - duckdb.py:460 - loaded in 6681 embeddings

2023-10-10 23:29:57,184 - INFO - duckdb.py:472 - loaded in 1 collections

2023-10-10 23:29:57,185 - INFO - duckdb.py:89 - collection with name langchain already exists, returning existing collection

2023-10-10 23:29:57,186 - INFO - run_localGPT.py:38 - Loading Model: meta-llama/Llama-2-13b-chat-hf, on: hpu

2023-10-10 23:29:57,186 - INFO - run_localGPT.py:39 - This action can take a few minutes!

/usr/local/lib/python3.8/dist-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.

warn("The installed version of bitsandbytes was compiled without GPU support. "

/usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32

/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations

warnings.warn(

[2023-10-10 23:29:57,622] [INFO] [real_accelerator.py:123:get_accelerator] Setting ds_accelerator to hpu (auto detect)

[WARNING|utils.py:177] 2023-10-10 23:29:58,637 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but habana-frameworks v1.12.0.480 was found, this could lead to undefined behavior!

[WARNING|utils.py:190] 2023-10-10 23:29:59,786 >> optimum-habana v1.7.5 has been validated for SynapseAI v1.11.0 but the driver version is v1.12.0, this could lead to undefined behavior!

Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 17427.86it/s]

Loading checkpoint shards: 100%|██████████████████| 3/3 [00:02<00:00, 1.12it/s]

========= HABANA PT BRIDGE CONFIGURATION =======

PT_HPU_LAZY_MODE = 1

PT_RECIPE_CACHE_PATH =

PT_CACHE_FOLDER_DELETE = 0

PT_HPU_RECIPE_CACHE_CONFIG =

PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807

PT_HPU_LAZY_ACC_PAR_MODE = 1

PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0

-----------------------: System Configuration :----------------------

Num CPU Cores : 160

CPU RAM : 1056447244 KB

--------------------------------------------------------------------------

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (4096). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

2023-10-10 23:32:20,404 - INFO - run_localGPT.py:133 - Local LLM Loaded



Enter a query: what is the Article I ?

2023-10-23 19:47:28,598 - INFO - run_localGPT.py:240 - Query processing time: 1.5914111537858844s

> Question:

what is the Article I ?

> Answer:

It is the first article of the US constitution .



Enter a query: what does it say?

2023-10-23 19:47:36,684 - INFO - run_localGPT.py:240 - Query processing time: 1.872558546019718s

> Question:

what does it say?

> Answer:

The first article of the US constitution states "All legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist of a Senate and House of Representatives."



Enter a query: What about article II

2023-10-23 20:34:42,818 - INFO - run_localGPT.py:240 - Query processing time: 1.6038263840600848s

> Question:

What about article II

> Answer:

Article II of the US constitution deals with the executive branch of government and establishes the office of the president.

Next Steps

To query and chat with your own content, you can add it to the SOURCE DOCUMENTS folder.

To experiment with different values to get different outputs, you can also modify the temperature and top_p values in the run_localGPT.py file, line 84:

pipe = GaudiTextGenerationPipeline(model_name_or_path=model_id, max_new_tokens=100, temperature=0.5, top_p=0.5, repetition_penalty=1.15, use_kv_cache=True, do_sample=True)

For more information, review the updated class GaudiTextGenerationPipeline: in the /gaudi_utils/pipeline.py for information on tokenization and padding.

More Resources