A comprehensive framework for conducting experiments on synthetic data generation, evaluation, and analysis. This codebase provides the core infrastructure used for research on contrastive datasets, negation processing, and reading comprehension evaluation.
This repository contains the essential experimental infrastructure extracted from research on CondaQA and synthetic dataset generation. The framework provides:
- Core Experimental Framework: Modular system for running experiments with different models and workflows
- Model Interfaces: Unified interfaces for various language models (OpenAI GPT, local transformers models, etc.)
- Workflow System: Configurable pipelines for different experimental tasks
- Evaluation Tools: Utilities for evaluating model performance and conducting statistical analyses
- Human Evaluation System: Tools for creating and managing human evaluation surveys
utils.py
: Core classes for conversation management, data loading, and experiment orchestrationmodels.py
: Model loading and interface utilities supporting both API and local modelsworkflows.py
: Experimental workflows and task definitionsrun.py
: Main experiment runner script
condaqa_eval.py
: Main evaluation script for CondaQA-style datasetslm_harness_utils.py
: Evaluation utilities and metrics calculationmodels.py
: Model interfaces specific to evaluation tasksconsistency_statistical_tests.py
: Statistical testing utilities
- Survey creation scripts for different datasets (CondaQA, DROP)
- HTML templates for survey interfaces (
data/human_eval_templates/
)
- Clone the repository:
git clone <repository-url>
cd eval-synth-eval
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
# For OpenAI models
export OPENAI_API_KEY="your-key-here"
# For other API models (optional)
export ANTHROPIC_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"
export GEMINI_API_KEY="your-key-here"
# For local models (optional)
export TRANSFORMERS_CACHE="/path/to/cache"
Run an experiment using the framework:
cd src/edit
python run.py -m gpt2024 -w create_scope_edit -d path/to/data.json -s 0 -n 10
Parameters:
-m
: Model name (e.g.,gpt2024
,llama2
,mistral
,dummy
)-w
: Workflow name (seeworkflows.py
for available options)-d
: Path to data file-s
: Number of instances to skip (for resuming)-n
: Number of instances to process-k
: API key (if not using environment variable)
The framework supports various experimental workflows:
- Edit Generation:
create_scope_edit
,create_affirmative_edit
,create_paraphrase_edit
- Question Generation:
generate_question_by_genie
,generate_conda_questions_by_template
- Evaluation:
detect_coherence
,detect_requires_implication
The framework uses a conversation-based architecture:
from utils import Conversation, Message, Pipeline
from models import load_model
from workflows import load_workflow
# Create a pipeline
model = load_model("gpt2024", api_key)
workflow = load_workflow("create_scope_edit")
pipeline = Pipeline(model, workflow)
# Run on data
result, conversation = pipeline(data_instance)
To add a new model, implement the model interface in models.py
:
def _your_model(api_key=None):
def query(messages, n=None, **kwargs):
# Implement your model logic here
# messages: list of {"role": "user/assistant", "content": "..."}
# n: number of completions to generate (optional)
return response # string if n=None, list if n>0
return query
# Add to model registry
_models["your_model"] = _your_model
Implement workflows in workflows.py
:
def _your_workflow():
def workflow(messages: Conversation, instance):
# Use messages.query() to interact with the model
response = messages.query("Your prompt here")
return response
return workflow
# Add to workflow registry
_workflows["your_workflow"] = _your_workflow
The framework expects data in JSONL format with the following structure:
{
"original passage": "Text of the passage...",
"original sentence": "The specific sentence with negation...",
"original cue": "negation_word",
"sentence2": "Question about the passage?",
"label": "Answer to the question",
"PassageEditID": 0
}
For CondaQA-style evaluation:
cd src/eval
python condaqa_eval.py --model_name gpt-4-turbo --test_file data/test.jsonl --output_dir results/
The framework includes utilities for statistical testing:
from src.eval.consistency_statistical_tests import run_statistical_tests
results = run_statistical_tests(model_outputs, gold_labels)
Create human evaluation surveys:
cd src/eval/human-eval
python create_survey_condaqa.py --input_data data.json --output_file survey.html
The system uses templates from data/human_eval_templates/
for consistent survey formatting.
When contributing to this framework:
- Security: Never commit API keys or sensitive information
- Documentation: Update README and docstrings for new features
- Testing: Test new models/workflows with the dummy model first
- Code Style: Follow existing patterns for consistency
If you use this framework in your research, please cite the relevant papers:
@inproceedings{ravichander-et-al-2022-condaqa,
title={CondaQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation},
author={Ravichander, Abhilasha and Gardner, Matt and Marasović, Ana},
booktitle={EMNLP 2022},
year={2022}
}
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues:
- Check the documentation and code examples
- Search existing issues in the repository
- Create a new issue with detailed information about your problem
Important: This framework can make API calls to external services. Be mindful of:
- Cost: API calls to OpenAI, Anthropic, etc. incur charges
- Rate Limits: Implement appropriate delays for large-scale experiments
- Data Privacy: Be careful when sending sensitive data to external APIs
Always test with small samples and the dummy
model before running large experiments.