TL;DR
Enhance the capabilities of language models with Retrieval-Augmented Generation (RAG) by combining document parsing, vector storage, and semantic search. This guide shows you how to set up a complete RAG pipeline using Docling for PDF parsing and Weaviate for vector storage, with OpenAI embeddings powering semantic search.
Introduction
Large language models (LLMs) like GPT-4 excel at generating text but lack direct access to external, domain-specific knowledge. Retrieval-Augmented Generation (RAG) bridges this gap by integrating LLMs with semantic search over external documents.
Imagine querying a dense, unstructured PDF and getting precise, context-aware answers from your LLM. This article walks you through building a RAG system step by step, using Docling for PDF parsing, Weaviate for vector-based storage and retrieval, and OpenAI embeddings for semantic understanding.
Prerequisites
Before starting, ensure you have the following:
- Python installed with the following packages:
docling
weaviate-client
rich
torch
- An OpenAI API key for embedding generation.
- Access to a GPU for efficient processing (optional but recommended).
Step 1: Environment Setup
Install the required Python packages:
pip install docling~=2.7.0
pip install -U weaviate-client~=4.9.4
pip install rich
pip install torch
Step 2: Parsing PDFs with Docling
Docling is a versatile tool for parsing and structuring documents.
Convert PDFs to Docling Documents
Extract text from a PDF with the following code:
from docling.document import DoclingDocument
# Load your PDF file
doc = DoclingDocument.from_file("path_to_your_pdf.pdf")
# Access the extracted text
text = doc.text
Perform Hierarchical Chunking
Split the document into manageable chunks for better semantic analysis:
# Chunk the document into sections
chunks = doc.chunk(method="hybrid")
# Access each chunk
for chunk in chunks:
print(chunk.text)
Step 3: Setting Up Weaviate
Weaviate is a vector database that powers semantic search.
Create and Configure a Weaviate Collection
Define a schema for storing document chunks:
import weaviate
client = weaviate.Client("http://localhost:8080")
# Define the schema
class_obj = {
"class": "DocumentChunk",
"vectorizer": "none",
"properties": [
{"name": "content", "dataType": ["text"]},
{"name": "document_id", "dataType": ["string"]},
]
}
client.schema.create_class(class_obj)
Step 4: Generating Embeddings with OpenAI
Generate semantic embeddings for the document chunks:
import openai
openai.api_key = 'your_openai_api_key'
def generate_embedding(text):
response = openai.Embedding.create(
input=text,
model="text-embedding-ada-002"
)
return response['data'][0]['embedding']
Step 5: Inserting Data into Weaviate
Store the parsed document chunks and their embeddings in Weaviate:
with client.batch as batch:
for chunk in chunks:
properties = {
"content": chunk.text,
"document_id": "your_document_id"
}
vector = generate_embedding(chunk.text)
batch.add_data_object(properties, "DocumentChunk", vector=vector)
Step 6: Querying the Data
Retrieve relevant document chunks based on a query:
query = "Your search query"
query_vector = generate_embedding(query)
result = client.query.get("DocumentChunk", ["content", "document_id"]) \
.with_near_vector({"vector": query_vector}) \
.with_limit(5) \
.do()
for item in result['data']['Get']['DocumentChunk']:
print(item['content'])
Step 7: Performing Retrieval-Augmented Generation (RAG)
Integrate the retrieved data into your LLM for enhanced responses:
context = "\n".join([item['content'] for item in result['data']['Get']['DocumentChunk']])
prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=150
)
print(response.choices[0].text.strip())
Best Practices for Implementing RAG
- Optimize Chunking: Adjust chunk sizes to balance semantic richness and storage efficiency.
- Index Strategically: Use meaningful document IDs and metadata for better traceability.
- Validate Context: Ensure the retrieved data is relevant to avoid diluting LLM performance.
- Monitor Costs: Track API usage for embedding generation to manage expenses.
Conclusion
By implementing this RAG system, you’ve unlocked the ability to query dense, unstructured PDFs with precision. This integration of Docling and Weaviate enhances LLM capabilities, making them more informed and context-aware.
Next Steps:
- Experiment with more complex document sets.
- Explore additional vector database features in Weaviate.
- Extend the system to handle multimodal inputs like images or tables.
Leave a Reply