Official implementation for the paper "DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery"
Thid id the first comprehensive benchmark for evaluating AI agents' ability to discover and synthesize datasets from real-world demands across knowledge-intensive and reasoning-intensive tasks. Our tri-dimensional evaluation framework reveals that even advanced systems achieve only 20% accuracy, establishing rigorous baselines for autonomous dataset discovery and illuminating the path toward AI capable of finding any dataset in the digital universe.
- Automated Dataset Discovery Baseline: Search and discover relevant datasets using LLM-powered agents
- Complete Evaluation Pipeline: End-to-end evaluation framework with SFT training, inference, and assessment
- LLaMA-Factory Integration: Seamless integration with LLaMA-Factory for model training and inference
- Comprehensive Metrics: Multiple evaluation metrics including BLEU, ROUGE, F1, exact match, and accuracy
We provide a one-click setup script that handles everything:
# Clone the repository
git clone https://github.com/GAIR-NLP/DatasetResearch
cd DatasetResearch
# Run the setup script (installs uv, dependencies, downloads data, sets up CLI)
chmod +x setup.sh
./setup.sh
After setup completes, the CLI is ready to use! The script will show you exactly how to run commands:
# If global command is available:
datasetresearch --help
# Otherwise use from project directory:
./datasetresearch --help
If you prefer manual setup:
# Install dependencies
uv sync # or pip install -r requirements.txt
# Set up LLaMA-Factory
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
cd ..
# Make CLI executable
chmod +x datasetresearch
source .venv/bin/activate
The Deep Dataset Researcher provides a unified CLI interface for all operations. Use the datasetresearch
command to access all functionality.
# View all available commands
datasetresearch --help
# Get help for specific commands
datasetresearch search --help
datasetresearch synthesis --help
datasetresearch register --help
datasetresearch run --help
datasetresearch eval --help
datasetresearch metaeval --help
Search for relevant datasets using the CLI:
# Search for datasets using default configuration
datasetresearch search
# Search with custom configuration
datasetresearch search --config configs/search_config.yaml
New ID Format: Search datasets now use standardized IDs:
- Format:
search_{selected_dataset}_{model}
- Example:
search_alpaca_gpt4_gemini
,search_dolly_15k_o3
Generate synthetic training data using the CLI:
# Generate synthetic data with default settings
datasetresearch synthesis
# Generate data with custom configuration
datasetresearch synthesis --config configs/generation_config.yaml
# Generate with specific metadata file
datasetresearch synthesis \
--metadata-file datasets/search_results.json \
--output-dir experiments/generation/
New ID Format: Synthesis datasets now use standardized IDs:
- Format:
synthesis_{original_dataset_id}_{model}
- Example:
synthesis_math_problems_o3
,synthesis_code_snippets_gemini
Register datasets with LLaMA-Factory for training:
# Register search datasets
datasetresearch register \
--metadata-file datasets/search_results.json \
--output-file LLaMA-Factory/data/dataset_info.json \
--base-dir search_dataset \
--model gemini
# Register synthesis datasets
datasetresearch register \
--metadata-file datasets/generation_results.json \
--output-file LLaMA-Factory/data/dataset_info.json \
--base-dir synthesis \
--model o3
Execute complete SFT training and inference:
# Run with specific configuration and model
datasetresearch run \
--config evaluation/config_few_shot.yaml \
--dataset_json datasets/results/re_eval/gpt-4o-mini.json \
--model llama3_8b \
--task_model gpt-4o-search-preview \
Use the evaluation pipeline for comprehensive assessment:
# Run complete evaluation pipeline
datasetresearch eval
# Run evaluation with specific dataset set
datasetresearch eval --set mini
# Run with custom configurations
datasetresearch eval \
--eval-config evaluation/config.yaml \
--pipeline-config configs/pipeline_settings.yaml
Run metadata evaluation pipeline:
# Run full metadata evaluation
datasetresearch metaeval
# Run with specific configuration
datasetresearch metaeval --config configs/evaluate_metadata_config.yaml
# Run specific modes
datasetresearch metaeval --mode generate_only
datasetresearch metaeval --mode evaluate_only
Command | Purpose | Key Options |
---|---|---|
search |
Dataset discovery | --config , --dry-run |
synthesis |
Data generation | --config , --metadata-file , --num-data , --dry-run |
register |
Dataset registration | --metadata-file , --output-file , --base-dir , --model |
run |
Training/inference | --config (required), --model , --step , --dry-run |
eval |
Evaluation pipeline | --eval-config , --pipeline-config , --set , --dry-run |
metaeval |
Metadata evaluation | --config , --mode , --dry-run |
experiments/search/
: Search results and discovered datasetsdatasets/search_set_metadata_*.json
: Metadata for discovered datasets- New: Standardized
search_{dataset}_{model}
IDs throughout
datasets/generation_metadata_*.json
: Generated training data metadata- New: Files saved with
synthesis_{dataset}_{model}
naming convention - New: Consistent ID usage across w_example and wo_example outputs
LLaMA-Factory/data/dataset_info.json
: Registered datasets for training- New: Direct mapping from
search_dataset_id
to dataset keys - New: File paths match
{base_dir}/{model}/{search_dataset_id}.json
evaluation/results/final_results/
: Individual evaluation JSON filesevaluation/results/evaluation_summary.csv
: Unified CSV output from integrated pipelineevaluation/processed_results/
: Aggregated performance metrics
The framework uses multiple configuration files for different components:
All API configurations are managed through configuration files instead of environment variables. Update the respective config files with your API credentials:
# Search model configuration
search_model_name: "openai/gpt-4o-mini-search-preview"
search_api_base: "https://openrouter.ai/api/v1"
search_api_key: "your-openrouter-key"
# Generation model configuration
model_name: "o3"
model_api_base: "https://gpt.yunstorm.com/"
model_api_key: "your-azure-key"
# Azure OpenAI API configuration
azure_openai:
api_endpoint: "https://your-azure-endpoint"
api_key: "your-azure-key"
api_version: "2025-01-01-preview"
model: "o3"
# LLM configuration for metadata generation
llm_config:
api_model: "o3"
api_base: "https://gpt.yunstorm.com/"
api_key: "your-azure-key"
Note: Replace the placeholder keys with your actual API credentials. All modules now use configuration files exclusively - environment variables are no longer supported.
models:
llama3_8b:
name: "llama3_8b"
base_model: "models/LLama3/Llama-3.1-8B"
template: "llama3"
finetuning_type: "full"
batch_size: 1
learning_rate: 1.0e-5
target_data_models:
gpt-4o-search-preview:
display_name: "gpt-4o-search"
o3-w:
display_name: "gpt-o3"
evaluation_settings:
sft_model: "llama3_8b"
methods:
fine_tune:
display_name: "fine-tune"
zero_shot:
display_name: "0-shot"
If you use this framework in your research, please cite our paper:
@misc{li2025datasetresearchbenchmarkingagentsystems,
title={DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery},
author={Keyu Li and Mohan Jiang and Dayuan Fu and Yunze Wu and Xiangkun Hu and Dequan Wang and Pengfei Liu},
year={2025},
eprint={2508.06960},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.06960},
}
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
For questions, issues, or contributions:
- π§ Issues: Open an issue in the GitHub repository
- π Documentation: Check
evaluation/README.md
for detailed evaluation framework docs - ποΈ Data: Access test datasets at π€ GAIR/DatasetResearch
- π¬ CLI Help: Use
datasetresearch --help
ordatasetresearch <command> --help
- π§ Setup Issues: Run
./setup.sh
for automated setup and troubleshooting
This framework is designed for:
- Agent System Benchmarking: Compare different agent systems for dataset discovery
- Dataset Quality Assessment: Evaluate and compare dataset quality with standardized metrics
- Model Performance Analysis: Comprehensive model evaluation across tasks and domains
DatasetResearch/
βββ setup.sh # One-click setup script with environment and data setup
βββ datasetresearch # CLI entry point
βββ datasetresearch_cli/ # CLI implementation
β βββ commands/ # CLI command modules
β β βββ search.py # Dataset search command
β β βββ synthesis.py # Data synthesis command
β β βββ run.py # Training/inference command
β β βββ eval.py # Evaluation command
β β βββ register.py # Dataset registration command
β β βββ metaeval.py # Metadata evaluation command
β βββ core/ # Core CLI functionality
β βββ utils/ # CLI utilities
βββ LLaMA-Factory/ # Integrated LLaMA-Factory for model training
βββ bash_scripts/ # Automation scripts for data processing and evaluation
βββ configs/ # Configuration files for different experiments
βββ datasets/ # Dataset storage and metadata
β βββ results/ # Processed evaluation results
β βββ metadata files # Dataset metadata and search results
βββ evaluation/ # Evaluation framework
β βββ evaluators/ # Task-specific evaluators
β βββ results/ # Evaluation outputs
β βββ templates/ # Evaluation templates
βββ experiments/ # Experiment results and configurations
β βββ generation/ # Synthetic data generation experiments
β βββ search/ # Dataset search experiments
βββ metrics/ # Custom evaluation metrics
βββ scripts/ # Core functionality scripts
β βββ dataset/ # Dataset processing utilities
β βββ method/ # Search and generation agents
β βββ utils/ # Utility functions
βββ assets/ # Images and static files
β βββ teaser_1.pdf # Paper teaser figure
βββ downloaded_datasets/ # Raw downloaded datasets