Synthetic Evaluation Framework

A comprehensive framework for conducting experiments on synthetic data generation, evaluation, and analysis. This codebase provides the core infrastructure used for research on contrastive datasets, negation processing, and reading comprehension evaluation.

Overview

This repository contains the essential experimental infrastructure extracted from research on CondaQA and synthetic dataset generation. The framework provides:

Core Experimental Framework: Modular system for running experiments with different models and workflows
Model Interfaces: Unified interfaces for various language models (OpenAI GPT, local transformers models, etc.)
Workflow System: Configurable pipelines for different experimental tasks
Evaluation Tools: Utilities for evaluating model performance and conducting statistical analyses
Human Evaluation System: Tools for creating and managing human evaluation surveys

Key Components

Core Framework (`src/edit/`)

utils.py: Core classes for conversation management, data loading, and experiment orchestration
models.py: Model loading and interface utilities supporting both API and local models
workflows.py: Experimental workflows and task definitions
run.py: Main experiment runner script

Evaluation System (`src/eval/`)

condaqa_eval.py: Main evaluation script for CondaQA-style datasets
lm_harness_utils.py: Evaluation utilities and metrics calculation
models.py: Model interfaces specific to evaluation tasks
consistency_statistical_tests.py: Statistical testing utilities

Human Evaluation (`src/eval/human-eval/`)

Survey creation scripts for different datasets (CondaQA, DROP)
HTML templates for survey interfaces (data/human_eval_templates/)

Quick Start

Installation

Clone the repository:

git clone <repository-url>
cd eval-synth-eval

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

# For OpenAI models
export OPENAI_API_KEY="your-key-here"

# For other API models (optional)
export ANTHROPIC_API_KEY="your-key-here"
export DEEPSEEK_API_KEY="your-key-here"
export GEMINI_API_KEY="your-key-here"

# For local models (optional)
export TRANSFORMERS_CACHE="/path/to/cache"

Basic Usage

Run an experiment using the framework:

cd src/edit
python run.py -m gpt2024 -w create_scope_edit -d path/to/data.json -s 0 -n 10

Parameters:

-m: Model name (e.g., gpt2024, llama2, mistral, dummy)
-w: Workflow name (see workflows.py for available options)
-d: Path to data file
-s: Number of instances to skip (for resuming)
-n: Number of instances to process
-k: API key (if not using environment variable)

Example Workflows

The framework supports various experimental workflows:

Edit Generation: create_scope_edit, create_affirmative_edit, create_paraphrase_edit
Question Generation: generate_question_by_genie, generate_conda_questions_by_template
Evaluation: detect_coherence, detect_requires_implication

Architecture

Conversation System

The framework uses a conversation-based architecture:

from utils import Conversation, Message, Pipeline
from models import load_model
from workflows import load_workflow

# Create a pipeline
model = load_model("gpt2024", api_key)
workflow = load_workflow("create_scope_edit")
pipeline = Pipeline(model, workflow)

# Run on data
result, conversation = pipeline(data_instance)

Adding New Models

To add a new model, implement the model interface in models.py:

def _your_model(api_key=None):
    def query(messages, n=None, **kwargs):
        # Implement your model logic here
        # messages: list of {"role": "user/assistant", "content": "..."}
        # n: number of completions to generate (optional)
        return response  # string if n=None, list if n>0
    return query

# Add to model registry
_models["your_model"] = _your_model

Adding New Workflows

Implement workflows in workflows.py:

def _your_workflow():
    def workflow(messages: Conversation, instance):
        # Use messages.query() to interact with the model
        response = messages.query("Your prompt here")
        return response
    return workflow

# Add to workflow registry
_workflows["your_workflow"] = _your_workflow

Data Format

The framework expects data in JSONL format with the following structure:

{
    "original passage": "Text of the passage...",
    "original sentence": "The specific sentence with negation...",
    "original cue": "negation_word",
    "sentence2": "Question about the passage?",
    "label": "Answer to the question",
    "PassageEditID": 0
}

Evaluation

Running Evaluations

For CondaQA-style evaluation:

cd src/eval
python condaqa_eval.py --model_name gpt-4-turbo --test_file data/test.jsonl --output_dir results/

Statistical Analysis

The framework includes utilities for statistical testing:

from src.eval.consistency_statistical_tests import run_statistical_tests
results = run_statistical_tests(model_outputs, gold_labels)

Human Evaluation

Create human evaluation surveys:

cd src/eval/human-eval
python create_survey_condaqa.py --input_data data.json --output_file survey.html

The system uses templates from data/human_eval_templates/ for consistent survey formatting.

Contributing

When contributing to this framework:

Security: Never commit API keys or sensitive information
Documentation: Update README and docstrings for new features
Testing: Test new models/workflows with the dummy model first
Code Style: Follow existing patterns for consistency

Citation

If you use this framework in your research, please cite the relevant papers:

@inproceedings{ravichander-et-al-2022-condaqa,
  title={CondaQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation},
  author={Ravichander, Abhilasha and Gardner, Matt and Marasović, Ana},
  booktitle={EMNLP 2022},
  year={2022}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions or issues:

Check the documentation and code examples
Search existing issues in the repository
Create a new issue with detailed information about your problem

API Usage and Costs

Important: This framework can make API calls to external services. Be mindful of:

Cost: API calls to OpenAI, Anthropic, etc. incur charges
Rate Limits: Implement appropriate delays for large-scale experiments
Data Privacy: Be careful when sending sensitive data to external APIs

Always test with small samples and the dummy model before running large experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/human_eval_templates		data/human_eval_templates
results		results
src		src
README.md		README.md
config.example.env		config.example.env
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Evaluation Framework

Overview

Key Components

Core Framework (`src/edit/`)

Evaluation System (`src/eval/`)

Human Evaluation (`src/eval/human-eval/`)

Quick Start

Installation

Basic Usage

Example Workflows

Architecture

Conversation System

Adding New Models

Adding New Workflows

Data Format

Evaluation

Running Evaluations

Statistical Analysis

Human Evaluation

Contributing

Citation

License

Support

API Usage and Costs

About

Uh oh!

Releases

Packages

Languages

utahnlp/eval-synth-eval

Folders and files

Latest commit

History

Repository files navigation

Synthetic Evaluation Framework

Overview

Key Components

Core Framework (src/edit/)

Evaluation System (src/eval/)

Human Evaluation (src/eval/human-eval/)

Quick Start

Installation

Basic Usage

Example Workflows

Architecture

Conversation System

Adding New Models

Adding New Workflows

Data Format

Evaluation

Running Evaluations

Statistical Analysis

Human Evaluation

Contributing

Citation

License

Support

API Usage and Costs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Core Framework (`src/edit/`)

Evaluation System (`src/eval/`)

Human Evaluation (`src/eval/human-eval/`)

Packages