LIBRAG is a Retrieval-Augmented Generation (RAG) system designed to improve access to the Boston Public Library’s (BPL) Digital Commonwealth archive. It enables users to ask natural language questions and retrieve highly relevant documents, complete with metadata, images, and audio entries.
The Boston Public Library needed a better way for users to explore their extensive digital collection through intuitive search methods rather than traditional keyword queries.
LIBRAG addresses this need by using a RAG pipeline that retrieves documents semantically and generates informative, source-backed responses.
Key enhancements to the project include:
- Full ingestion and embedding of over 1.2 million metadata entries.
- Multi-field embedding (title, abstract, creator, date, and combined fields).
- BM25 reranking of search results to improve answer relevance.
- Support for retrieving and displaying images and audio metadata in search results.
- A user-friendly Streamlit UI for seamless interaction.
Try the hosted demo here:
👉 Hugging Face Spaces - BPL-RAG-Spring-2025
The system is organized into two main pipelines: Data Preparation and Query Handling.
- Scrape metadata from the Digital Commonwealth API.
- Preprocess fields (title, abstract, creator, date).
- Combine key fields into a single text block for better embedding.
- Embed records using the
all-mpnet-base-v2
model from Hugging Face. - Store embeddings in a Pinecone vector database.
- Accept user natural language query through Streamlit UI.
- Preprocess and embed the query.
- Retrieve top matching documents from Pinecone.
- Rerank retrieved documents using BM25 based on metadata relevance.
- Generate a final response using GPT-4o-mini and present retrieved sources, including image previews and grouped audio metadata when applicable.
- Large-scale Embedding: Successfully processed over 1.2 million items from BPL’s archive.
- Multi-Field Retrieval: Embedding includes titles, abstracts, creators, and dates for richer semantic search.
- BM25 Reranking: Reorders retrieved documents to prioritize field matches (e.g., better date or title alignment).
- Image and Audio Integration:
- Images are retrieved and displayed alongside text-based results.
- Audio metadata records are reconstructed from grouped fragments for full context.
- Cost Optimization: Open-source embeddings and GPT-4o-mini model are used to minimize API costs while maintaining quality.
- Hosting: Deployed publicly through Hugging Face Spaces for accessibility.
This project processes both images and audio metadata alongside the textual metadata from the Digital Commonwealth archive.
-
Images: If you enable image processing (by not using the
--no-images
flag), the system fetches image metadata and processes captions or other image-related descriptions. TheDigitalCommonwealthScraper
is used to scrape image URLs, and theImageCaptioner
analyzes the content, providing relevant captions and tags. These images and their metadata are included in the search results, making it easier for users to find visual references along with the text-based content. -
Audio: If you enable audio processing (by not using the
--no-audio
flag), the system fetches and embeds audio files associated with metadata records. TheAudioEmbedder
is responsible for embedding the audio content into the system, ensuring that audio files are accessible and can be linked alongside the retrieved results. Audio content is treated as metadata and, when available, is displayed in the search results.
-
Hugging Face Embeddings (all-mpnet-base-v2): We use the
all-mpnet-base-v2
model from Hugging Face for embedding metadata, enabling semantic search. This model has been specifically trained for general language understanding tasks and is efficient in representing the content in a vector space that allows for similarity-based retrieval. -
OpenAI GPT-4o-mini: The GPT-4o-mini model is used for natural language processing and generating responses. It enables the system to understand user queries and generate meaningful summaries based on the retrieved documents.
-
BM25: We use the BM25 ranking algorithm to improve the relevance of search results. It ranks the retrieved documents by considering keyword matches and their importance within the context of the query.
By using these models together, the system provides accurate, contextually relevant results, enhancing the experience of querying historical documents from the Digital Commonwealth archive.
- Python 3.12.4 or later
- Pinecone account for vector storage
- OpenAI account for LLM API access (or alternative model integration)
Clone the repository and set up a virtual environment:
python -m venv .venv
source .venv/bin/activate # Mac/Linux
.venv\Scripts\activate # Windows
Install dependencies:
pip install -r requirements.txt
Configure your .env
file:
PINECONE_API_KEY=your-pinecone-api-key
OPENAI_API_KEY=your-openai-api-key
PINECONE_INDEX_NAME=your-pinecone-index-name
Run the following to download metadata:
python load_scraper.py <BEGIN_PAGE> <END_PAGE>
This will save a .json
file of metadata entries for the specified page range.
Run the following to upload the embeddings:
python load_pinecone.py <BEGIN_INDEX> <END_INDEX> <PATH_TO_JSON> <PINECONE_INDEX_NAME> [--static] [--no-images] [--no-audio]
--static
: Use this flag when you want to process the data without dynamic data fetching from the Digital Commonwealth API (e.g., for faster processing when metadata is already pre-fetched).--no-images
: Skips the image processing (useful when you don't want to fetch or process images).--no-audio
: Skips the audio embedding (useful if your data doesn't contain any audio or you want to speed up processing).
Make sure your Pinecone index is created beforehand with the correct vector dimension (786 for all-mpnet-base-v2).
Launch the application:
streamlit run streamlit_app.py
The app will be available at http://localhost:8501/
.
Response times may vary depending on dataset size and retrieval/re-ranking operations.
- Speed: Retrieval and metadata enrichment can take 10–40 seconds due to sequential Digital Commonwealth API calls.
- Metadata Bottleneck: After retrieving vector matches, full metadata is fetched live for reranking, adding delay.
- Scaling Costs: Pinecone costs scale with the volume of embedded vectors; full ingestion of the archive is costly.
- Build a pre-cached local metadata database to eliminate live API calls.
- Add native audio playback support in Streamlit using
st.audio()
. - Enhance query classification to prioritize image or audio results when contextually appropriate.
- Extend metadata grouping logic for additional media types (e.g., manuscripts, videos).
- Improve retrieval robustness by refining prompt engineering and query parsing.
This project is licensed under the MIT License.