Skip to content

utahnlp/eval-synth-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Evaluation Framework

A comprehensive framework for conducting experiments on synthetic data generation, evaluation, and analysis. This codebase provides the core infrastructure used for research on contrastive datasets, negation processing, and reading comprehension evaluation.

Overview

This repository contains the essential experimental infrastructure extracted from research on CondaQA and synthetic dataset generation. The framework provides:

  • Core Experimental Framework: Modular system for running experiments with different models and workflows
  • Model Interfaces: Unified interfaces for various language models (OpenAI GPT, local transformers models, etc.)
  • Workflow System: Configurable pipelines for different experimental tasks
  • Evaluation Tools: Utilities for evaluating model performance and conducting statistical analyses
  • Human Evaluation System: Tools for creating and managing human evaluation surveys

Key Components

Core Framework (src/edit/)

  • utils.py: Core classes for conversation management, data loading, and experiment orchestration
  • models.py: Model loading and interface utilities supporting both API and local models
  • workflows.py: Experimental workflows and task definitions
  • run.py: Main experiment runner script

Evaluation System (src/eval/)

  • condaqa_eval.py: Main evaluation script for CondaQA-style datasets
  • lm_harness_utils.py: Evaluation utilities and metrics calculation
  • models.py: Model interfaces specific to evaluation tasks
  • consistency_statistical_tests.py: Statistical testing utilities

Human Evaluation (src/eval/human-eval/)

  • Survey creation scripts for different datasets (CondaQA, DROP)
  • HTML templates for survey interfaces (data/human_eval_templates/)

Quick Start

Installation

  1. Clone the repository:
git clone <repository-url>
cd eval-synth-eval
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
# For OpenAI models
export OPENAI_API_KEY="your-key-here"

# For other API models (optional)
export ANTHROPIC_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"
export GEMINI_API_KEY="your-key-here"

# For local models (optional)
export TRANSFORMERS_CACHE="/path/to/cache"

Basic Usage

Run an experiment using the framework:

cd src/edit
python run.py -m gpt2024 -w create_scope_edit -d path/to/data.json -s 0 -n 10

Parameters:

  • -m: Model name (e.g., gpt2024, llama2, mistral, dummy)
  • -w: Workflow name (see workflows.py for available options)
  • -d: Path to data file
  • -s: Number of instances to skip (for resuming)
  • -n: Number of instances to process
  • -k: API key (if not using environment variable)

Example Workflows

The framework supports various experimental workflows:

  • Edit Generation: create_scope_edit, create_affirmative_edit, create_paraphrase_edit
  • Question Generation: generate_question_by_genie, generate_conda_questions_by_template
  • Evaluation: detect_coherence, detect_requires_implication

Architecture

Conversation System

The framework uses a conversation-based architecture:

from utils import Conversation, Message, Pipeline
from models import load_model
from workflows import load_workflow

# Create a pipeline
model = load_model("gpt2024", api_key)
workflow = load_workflow("create_scope_edit")
pipeline = Pipeline(model, workflow)

# Run on data
result, conversation = pipeline(data_instance)

Adding New Models

To add a new model, implement the model interface in models.py:

def _your_model(api_key=None):
    def query(messages, n=None, **kwargs):
        # Implement your model logic here
        # messages: list of {"role": "user/assistant", "content": "..."}
        # n: number of completions to generate (optional)
        return response  # string if n=None, list if n>0
    return query

# Add to model registry
_models["your_model"] = _your_model

Adding New Workflows

Implement workflows in workflows.py:

def _your_workflow():
    def workflow(messages: Conversation, instance):
        # Use messages.query() to interact with the model
        response = messages.query("Your prompt here")
        return response
    return workflow

# Add to workflow registry
_workflows["your_workflow"] = _your_workflow

Data Format

The framework expects data in JSONL format with the following structure:

{
    "original passage": "Text of the passage...",
    "original sentence": "The specific sentence with negation...",
    "original cue": "negation_word",
    "sentence2": "Question about the passage?",
    "label": "Answer to the question",
    "PassageEditID": 0
}

Evaluation

Running Evaluations

For CondaQA-style evaluation:

cd src/eval
python condaqa_eval.py --model_name gpt-4-turbo --test_file data/test.jsonl --output_dir results/

Statistical Analysis

The framework includes utilities for statistical testing:

from src.eval.consistency_statistical_tests import run_statistical_tests
results = run_statistical_tests(model_outputs, gold_labels)

Human Evaluation

Create human evaluation surveys:

cd src/eval/human-eval
python create_survey_condaqa.py --input_data data.json --output_file survey.html

The system uses templates from data/human_eval_templates/ for consistent survey formatting.

Contributing

When contributing to this framework:

  1. Security: Never commit API keys or sensitive information
  2. Documentation: Update README and docstrings for new features
  3. Testing: Test new models/workflows with the dummy model first
  4. Code Style: Follow existing patterns for consistency

Citation

If you use this framework in your research, please cite the relevant papers:

@inproceedings{ravichander-et-al-2022-condaqa,
  title={CondaQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation},
  author={Ravichander, Abhilasha and Gardner, Matt and Marasović, Ana},
  booktitle={EMNLP 2022},
  year={2022}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions or issues:

  1. Check the documentation and code examples
  2. Search existing issues in the repository
  3. Create a new issue with detailed information about your problem

API Usage and Costs

Important: This framework can make API calls to external services. Be mindful of:

  • Cost: API calls to OpenAI, Anthropic, etc. incur charges
  • Rate Limits: Implement appropriate delays for large-scale experiments
  • Data Privacy: Be careful when sending sensitive data to external APIs

Always test with small samples and the dummy model before running large experiments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published