Research code for evaluating Retrieval-Augmented Generation (RAG) parameters on enterprise technical documentation.
This repository contains the evaluation framework used to systematically study the impact of chunk size and embedding model choice on RAG performance for enterprise technical documentation. The study uses SAP Business One documentation (4,544 pages across 51 PDF files) as a representative corpus for enterprise ERP documentation.
- What chunk sizes optimize RAG performance for enterprise documentation across different embedding models?
- Which embedding model performs best with enterprise documentation?
- How do chunk parameters and embedding models interact?
- How much do optimized RAG configurations improve over traditional IR baselines?
- Optimal chunk sizes vary significantly by embedding model: 750 characters (E5) to 3,000 characters (Gemini 004)
- Dense retrieval consistently outperforms sparse baselines: 18 of 19 dense configurations exceeded the best BM25 baseline
- Gemini text-embedding-004 achieved best overall performance: 0.846 nv_accuracy with 3,000-character chunks
- Retrieval quality is the primary determinant of answer quality: Strong correlation (ρ ≈ 0.97) between context precision and end-to-end accuracy
Since no public Q&A dataset exists for SAP Business One documentation, this study employs the RAGAS TestsetGenerator framework to create corpus-specific evaluation data:
Knowledge Graph Construction:
- Documents processed and added as graph nodes
- LLM-based headline extraction for logical sectioning
- Semantic clustering via embedding similarity
- Named entity recognition for cross-document connections
Question-Answer Synthesis:
- Single-hop queries: Direct information from one document section
- Multi-hop queries: Require combining information across multiple sections
- Dual-LLM approach: Gemini Flash 2.0 for scenario generation, Claude 3.7 Sonnet for Q&A synthesis
- 200 questions total ensuring comprehensive corpus coverage
- multilingual-E5-large: 560M parameters, contrastive learning
- Nomic Embed V2-MoE: Mixture-of-experts with 8 expert blocks
- BGE-M3: Multi-functionality model supporting dense, sparse, and token-level representations
- Gemini text-embedding-004: Google's production encoder (API-based)
- BM25: Okapi BM25 with default parameters
- TF-IDF: Traditional sparse retrieval with unigrams and bigrams
- Random: Random chunk selection (lower bound)
- Chunk sizes: 300-16,000 characters (model-dependent based on context windows)
- Chunk overlap: Fixed at 20%
- Retrieved chunks (K): Fixed at 5
- Chunking strategy: Boundary-respecting fixed-size chunks
- nv_accuracy: NVIDIA Answer Accuracy for end-to-end performance
- answer_correctness: F1-based factual and semantic accuracy
- factual_correctness: Pure factual accuracy without semantic similarity
- faithfulness: Measures hallucination/grounding in retrieved context
- context_precision: Quality of retrieval ranking (Average Precision)
- context_recall: Coverage of retrieved contexts
rag-evaluation/
├── config/ # Configuration files
│ ├── base_config.yaml # Common evaluation settings
│ └── approaches/ # Model-specific configurations
├── retrievers/ # Retriever implementations
│ ├── base_retriever.py # Abstract base class
│ ├── embedding_retriever.py # Dense retrieval (E5, BGE-M3, Nomic)
│ ├── gemini_retriever.py # Gemini API integration
│ ├── bm25_retriever.py # BM25 baseline
│ ├── tfidf_retriever.py # TF-IDF baseline
│ └── random_retriever.py # Random baseline
├── core/ # Core evaluation components
│ ├── data_manager.py # Dataset and document handling
│ ├── evaluation.py # RAGAS evaluation framework
│ └── experiment.py # Experiment orchestration
├── notebooks/ # Jupyter notebooks
│ ├── embedding_server.ipynb # GPU-accelerated embedding service
│ ├── dataset_generation.ipynb # Synthetic Q&A generation
│ └── results_analysis.ipynb # Results analysis and visualization
├── scripts/ # Utility scripts
│ ├── run_comparison.py # Multi-approach comparison
│ └── find_latest_eval_jsons.py # Result file management
└── main.py # Main experiment entry point
- Python 3.12.2+
- CUDA-capable GPU (recommended for embedding generation)
- API keys for Gemini and DeepSeek models
Using uv (recommended):
# Windows
.\setup-uv.ps1
# Linux/Mac
./setup-uv.shTraditional pip:
pip install -r requirements.txtCreate a .env file with necessary API keys:
# API Keys
GOOGLE_API_KEY=your_google_api_key
DEEPSEEK_API_KEY=your_deepseek_api_key
# Optional: Remote embedding server
EMBEDDING_API_URL=https://your-colab-tunnel.ngrok-free.app
EMBEDDING_API_TOKEN=your_secure_tokenEvaluate a specific approach:
python main.py --config config/approaches/gemini_c3000_o600.yamlCompare all approaches:
python scripts/run_comparison.py --approaches gemini_c3000_o600 bge_m3_c2000_o400 nomic_c1200_o240 e5_c750_o150 bm25_c8000_o1600 --cached-docs "path/to/cached_documents.pkl" --test-data "path/to/test_data.jsonl"To regenerate the evaluation dataset:
- Open
notebooks/dataset_generation.ipynbin Google Colab - Configure API keys in Colab secrets
- Upload enterprise documents to Google Drive
- Run all cells to generate Q&A pairs with comprehensive corpus coverage
For large-scale evaluation, use the provided Colab notebook:
- Open
notebooks/embedding_server.ipynbin Google Colab (GPU runtime) - Set up ngrok tunnel and authentication token
- Configure
EMBEDDING_API_URLandEMBEDDING_API_TOKENin your.env
Evaluation results are stored in data/evaluation/ with timestamps:
evaluation_results_{approach}_{timestamp}.json: Full results with per-question metricsevaluation_results_{approach}_{timestamp}.csv: Tabular format for analysiscomparison_{timestamp}.json: Cross-approach comparison
The complete evaluation matrix (27 configurations × 200 questions = 5,400 evaluations) can be reproduced by:
- Setting up the environment as described above
- Running all configurations in
config/approaches/ - Using the analysis notebook to generate figures and tables
Note: Full reproduction requires significant computational resources (~8 hours GPU time for embeddings, ~48 hours for evaluation) and API costs (~€100 for Gemini/DeepSeek calls).
This research code is released under the MIT License. See LICENSE for details.
- RAGAS framework for automated RAG evaluation
- SAP for publicly available Business One documentation