Skip to content

SecurityLab-UCD/FuzzAug

Repository files navigation

FuzzAug

Official code release for paper FuzzAug: Data Augmentation by Fuzzing for Neural Test Generation (EMNLP 2025 Findings).

TL;DR: FuzzAug is a coverage-guided data augmentation method that brings fuzzing's diversity and valid testing semantics to LLM-based unit test generation. By doubling training data with diverse, semantically meaningful tests, especially for newer languages like Rust that training resource is relatively limited, FuzzAug significantly outperforms baselines and shows how dynamic analysis priors can boost LLM-based unit test generation.

FuzzAug Workflow

Repository Organization

directory purpose
fuzz scripts for transforming fuzz targets and collecting inputs
evaluation scripts for run-time evaluation metrics
training script for model training
UniTSyn UniTSyn functions for collecting focal and source pairs
tests unit tests for the important modules

Setup

Python

We use Python 3.11. We recommend using uv to manage your Python dependencies.

cd FuzzAug
uv sync # create a virtual environment, and install dependencies
source .venv/bin/activate

UniTSyn

We depend on UniTSyn, which is already included as a submodule.

git submodule init
git submodule update

Then, please install the dependencies for UniTSyn:

uv pip install -r UniTSyn/requirements.txt

Environment Variables

Both env.sh for this project and UniTSyn are required

pushd UniTSyn
cd UniTSyn && source ./scripts/env.sh && cd .. && source ./env.sh

or just do ./init.sh, which will do everything above.

Installing Rust and Cargo Fuzz

  1. Get Rust and rustup
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  1. Switch to nightly (required by cargo fuzz)
rustup install nightly
rustup default nightly
  1. Install cargo fuzz
cargo install cargo-fuzz
  1. Install rust-fuzz-gen to convert fuzz targets to unit test format.
cargo install --git https://github.com/SecurityLab-UCD/rust-fuzzer-gen.git

Download Rust Repos

cd UniTSyn
python scripts/download_repos.py -r ../data/repo_meta/rust.txt --oroot ../data/repos --decompress=True --oauth=<your token>

Collecting Rust Fuzzing Data from Repos

fuzz/collect_fuzz.py is used to collect fuzzing data from the Rust fuzzing corpus. The pipeline is as follows:

  1. transform: transform the fuzz_target! to print the input to stdout and get test template,
  2. build: build the fuzzing target in each repo, cargo fuzz build
  3. fuzz: fuzz the target in each repo, cargo fuzz run <target>
  4. testgen: substitute the input to the test template and get the test code
python fuzz/collect_fuzz.py --repo_id data/repo_meta/rust.txt -p all
cd UniTSyn
mkdir -p data/focal data/source
python frontend/rust/collect_all.py --repo_id ../data/repo_meta/rust.txt --repo_root ../data/rust_repos --fuzz True
python main.py --language rust --repo_root ../data/rust_repos

Coverage

To evaluate coverage of unit tests, the following dependencies are required:

cargo install grcov
rustup component add llvm-tools-preview

Model Training and Inference

Fine-tuning

python training/train.py \
    --dataset_path data/fuzz100.jsonl \
    --model_name "codellama/CodeLlama-7b-hf" \
    --run_name "fuzzcoder" \
    --max_steps=100 \
    --save_path saved_models/fuzzcoder \
    --lora True

Inference

python training/generate.py \
    --model_name "codellama/CodeLlama-7b-hf" \
    --checkpoint saved_models/fuzzcoder/checkpoint-100 \
    -i data/humaneval_rust.jsonl \
    -o generated.jsonl

Citing this Paper

@inproceedings{he2025fuzzaug,
    author = {He, Yifeng and Wang, Jicheng and Rong, Yuyang and Chen, Hao},
    title = {FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation},  
    booktitle = {Conference on Empirical Methods in Natural Language Processing},
    date = {2025-11-05/2025-11-09},
    address = {Suzhou, China},
}

About

[EMNLP'25] FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages