Building a Retrieval-Augmented Generation (RAG) System Over PDFs Using Docling and Weaviate

TL;DR

Enhance the capabilities of language models with Retrieval-Augmented Generation (RAG) by combining document parsing, vector storage, and semantic search. This guide shows you how to set up a complete RAG pipeline using Docling for PDF parsing and Weaviate for vector storage, with OpenAI embeddings powering semantic search.


Introduction

Large language models (LLMs) like GPT-4 excel at generating text but lack direct access to external, domain-specific knowledge. Retrieval-Augmented Generation (RAG) bridges this gap by integrating LLMs with semantic search over external documents.

Imagine querying a dense, unstructured PDF and getting precise, context-aware answers from your LLM. This article walks you through building a RAG system step by step, using Docling for PDF parsing, Weaviate for vector-based storage and retrieval, and OpenAI embeddings for semantic understanding.


Prerequisites

Before starting, ensure you have the following:

  • Python installed with the following packages:
    • docling
    • weaviate-client
    • rich
    • torch
  • An OpenAI API key for embedding generation.
  • Access to a GPU for efficient processing (optional but recommended).

Step 1: Environment Setup

Install the required Python packages:

pip install docling~=2.7.0
pip install -U weaviate-client~=4.9.4
pip install rich
pip install torch

Step 2: Parsing PDFs with Docling

Docling is a versatile tool for parsing and structuring documents.

Convert PDFs to Docling Documents

Extract text from a PDF with the following code:

from docling.document import DoclingDocument

# Load your PDF file
doc = DoclingDocument.from_file("path_to_your_pdf.pdf")

# Access the extracted text
text = doc.text

Perform Hierarchical Chunking

Split the document into manageable chunks for better semantic analysis:

# Chunk the document into sections
chunks = doc.chunk(method="hybrid")

# Access each chunk
for chunk in chunks:
print(chunk.text)



Step 3: Setting Up Weaviate

Weaviate is a vector database that powers semantic search.

Create and Configure a Weaviate Collection

Define a schema for storing document chunks:

import weaviate

client = weaviate.Client("http://localhost:8080")

# Define the schema
class_obj = {
"class": "DocumentChunk",
"vectorizer": "none",
"properties": [
{"name": "content", "dataType": ["text"]},
{"name": "document_id", "dataType": ["string"]},
]
}

client.schema.create_class(class_obj)



Step 4: Generating Embeddings with OpenAI

Generate semantic embeddings for the document chunks:

import openai

openai.api_key = 'your_openai_api_key'

def generate_embedding(text):
    response = openai.Embedding.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response['data'][0]['embedding']

Step 5: Inserting Data into Weaviate

Store the parsed document chunks and their embeddings in Weaviate:

with client.batch as batch:
for chunk in chunks:
properties = {
"content": chunk.text,
"document_id": "your_document_id"
}
vector = generate_embedding(chunk.text)
batch.add_data_object(properties, "DocumentChunk", vector=vector)



Step 6: Querying the Data

Retrieve relevant document chunks based on a query:

query = "Your search query"
query_vector = generate_embedding(query)

result = client.query.get("DocumentChunk", ["content", "document_id"]) \
    .with_near_vector({"vector": query_vector}) \
    .with_limit(5) \
    .do()

for item in result['data']['Get']['DocumentChunk']:
    print(item['content'])

Step 7: Performing Retrieval-Augmented Generation (RAG)

Integrate the retrieved data into your LLM for enhanced responses:

context = "\n".join([item['content'] for item in result['data']['Get']['DocumentChunk']])

prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"

response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=prompt,
    max_tokens=150
)

print(response.choices[0].text.strip())

Best Practices for Implementing RAG

  1. Optimize Chunking: Adjust chunk sizes to balance semantic richness and storage efficiency.
  2. Index Strategically: Use meaningful document IDs and metadata for better traceability.
  3. Validate Context: Ensure the retrieved data is relevant to avoid diluting LLM performance.
  4. Monitor Costs: Track API usage for embedding generation to manage expenses.

Conclusion

By implementing this RAG system, you’ve unlocked the ability to query dense, unstructured PDFs with precision. This integration of Docling and Weaviate enhances LLM capabilities, making them more informed and context-aware.

Next Steps:

  • Experiment with more complex document sets.
  • Explore additional vector database features in Weaviate.
  • Extend the system to handle multimodal inputs like images or tables.

Leave a Reply

Your email address will not be published. Required fields are marked *

y