Extracting your PDFs into Redis, with OpenAI embeddings

If you’re in the world of AI chaining, you’ll be hearing a lot about “vector databases”. These are a new form of database that store vectors, or “embeddings”.

Fun fact: at Relevance, we built our vector database in 2020! We also own www.vectordatabase.com.

Embeddings are numerical representations of data that capture semantic meaning, allowing us to perform “vector similarity search”.

What does that mean? It means that we can take the user's search query, turn it into an embedding, and use mathematics to find similar content. By using embeddings, "apple juice" can be very similar to "orange juice", but very different from "Apple iPhone". This means that our search results can better understand the intent of the query.

By turning our knowledge base into embeddings, we can find the most relevant pieces of content to answer the user's question, even if their question doesn't exactly match the keywords in the content.

Once we have the relevant pieces, we can feed them into the LLM prompt to help GPT answer the user's question!

The good news is, I’ve built a notebook that does all of this for you! Just supply your PDF URLs, or whatever other data is relevant: https://colab.research.google.com/drive/1TyGx_NuN21UtcYDx7qDHKCfblaZ0vh6l?usp=sharing

Let’s break it down!

Let’s parse our PDFs

To extract text from my PDF URLs, we use pdfplumber and create a list of documents that includes the text and the pdf_url. Later, we can use this URL to show references in our chat interface.

pdf_urls = [
    'https://raw.githubusercontent.com/RelevanceAI/sample-datasets/main/PDF/football-association-regulations/18-0102-FFA-Concussion-Guidelines-final.pdf',
    'https://raw.githubusercontent.com/RelevanceAI/sample-datasets/main/PDF/football-association-regulations/fnsw_hot_weather_policy.pdf',
    'https://raw.githubusercontent.com/RelevanceAI/sample-datasets/main/PDF/football-association-regulations/football_nsw_registration_tcs.pdf',
    'https://raw.githubusercontent.com/RelevanceAI/sample-datasets/main/PDF/football-association-regulations/nwsf_code_of_conduct-1.pdf',
    'https://raw.githubusercontent.com/RelevanceAI/sample-datasets/main/PDF/football-association-regulations/nwsf_judiciary_code_of_conduct.pdf',
]

import pdfplumber
import requests
import io

# Load PDFs and extract text into documents
pdf_documents = []

for url in pdf_urls:
  response = requests.get(url)
  pdf_data = io.BytesIO(response.content)
  with pdfplumber.open(pdf_data) as pdf:
    for page in pdf.pages:
      pdf_documents.append({
          "text": page.extract_text(),
          "pdf_url": url,
      })

Split up our PDF text into “chunks”

To make our vector search plan work, we need to break up the text from the PDF into smaller pieces. This allows the vector search to return only the specific pieces of the PDF that are relevant to the user's question, which can then be fed into the LLM, rather than using the entire PDF text.

We could split the text up by characters, but that would mean sentences could be arbitrarily cut in half.

Instead, we want to split up the content by sentences, using a token limit as a backup. To do this, we have written some utility functions, aided by a token counting function provided by OpenAI themselves.

# Break each document into a set of documents into an array of documents with max text length
split_pdf_documents = []

# Open AI's token splitting library
import tiktoken
tiktokenInstance = tiktoken.encoding_for_model("text-davinci-003");
def tokenize(text):
    return tiktokenInstance.encode(text)

def detokenize(tokens):
    return tiktokenInstance.decode(tokens)
        

# Utility for splitting based on sentence
from sentence_splitter import split_text_into_sentences

# Splits PDF text intelligently, based on sentences and tokens
def chunks_to_list(lst, n):
    new_lst = []
    for i in range(0, len(lst), n):
        new_lst.append(lst[i : i + n])
    return new_lst

def token_split(text, max_tokens=4060):
    tokens = tokenize(text)
    return [detokenize(c) for c in chunks_to_list(tokens, max_tokens)]

def token_and_sentence_split(text, max_tokens=512, window_size=0, max_sentence_tokens=512):
    raw_sentences = split_text_into_sentences(text, "en")
    chunk_token_len = 0
    chunk_texts = []
    splitted = []
    sentences = []
    # make sure no sentences are longer than the max_token length
    for sentence in raw_sentences:
        if len(tokenize(sentence)) > max_sentence_tokens:
            sentences += token_split(sentence, max_tokens)
        else:
            sentences.append(sentence)

    for sentence in sentences:
        token_len = len(tokenize(sentence))
        # if adding the next sentence would exceed the max token length, add the current chunk to the list and start a new chunk
        if chunk_token_len + token_len > max_tokens:
            if chunk_texts:
                splitted.append(" ".join(chunk_texts))
            chunk_texts = []
            chunk_token_len = 0
        # otherwise, add the sentence to the current chunk
        chunk_texts.append(sentence)
        chunk_token_len += token_len
    # add the last chunk to the list
    if chunk_texts:
        splitted.append(" ".join(chunk_texts))
    # remove empty strings
    splitted = [s for s in splitted if s]
    return splitted

Next, we process our extracted PDFs using these functions to create a list of split documents!

# Import uuid to generate IDs for the documents
import uuid
for pdf in pdf_documents:
  raw_text = pdf.get('text');
  text_split = token_and_sentence_split(raw_text)
  for text in text_split:
    generated_uuid = uuid.uuid4()
    uuid_str = str(generated_uuid)
    split_pdf_documents.append({
        **pdf,
        "text": text,
        "id": uuid_str
    })

Hitting Open AI to create embeddings

Now that we have our documents filled with sentences from the PDFs, we need to create embeddings for each sentence. To accomplish this, we'll use OpenAI's library and encode the text using the text-embedding-ada-002 model. It's important to note that this should be the same model we use for searching in the Relevance chain.

# Now, let's create some embeddings for our split PDF sentences!
split_pdf_documents_with_embeddings = [];

for document in split_pdf_documents:
  openAiResponse = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=document.get('text')
  )

  embedding = openAiResponse.data[0].embedding

  split_pdf_documents_with_embeddings.append({
      **document,
      "embedding": embedding
  })

Insert into Redis

Redis is not only one of the world's best caching databases, but also one of the best vector databases. Let's pipe our documents into Redis!

# Insert documents into Redis
pipe = redis_client.pipeline()

row_count = -1
for doc in split_pdf_documents_with_embeddings:
    row_count += 1
    # Note, we use this prefix for referencing later when creating the search index. You can change knowledge_base to whatever you like!
    pipe.json().set(f"knowledge_base:{doc['id']}", '$', doc);
    if row_count % 500 == 0:
        pipe.execute()

pipe.execute()

Now, let's create an index that will enable us to perform vector similarity search and retrieve relevant documents. To do this, we need to run the following raw command:


# Note, we are referencing the index prefix (knowledge_base:) referenced earlier in the document insert
redis_client.execute_command("FT.create knowledge_base ON JSON PREFIX 1 knowledge_base: SCHEMA $.id AS id TEXT $.text AS text TEXT $.embedding as embedding VECTOR HNSW 6 DIM 1536 DISTANCE_METRIC L2 TYPE FLOAT32")

Click “Run All” on this Notebook, and you’ll now have a Redis powered knowledge base. Make sure you put your Redis connection string into Relevance AI’s API key sidebar too, so your chain can access it, and select the knowledge_base index in the Redis vector search step.