Yiyou Sun¹, Yuhan Cao, Pohao Huang¹, Haoyue Bai², Hannaneh Hajishirzi³⁴, Nouha Dziri⁴♠, Dawn Song¹♠
¹ University of California, Berkeley · ² University of Wisconsin–Madison · ³ University of Washington · ⁴ AI2 (♠ indicates equal advising)
-
DELTA Benchmark Suite: A controlled collection of synthetic programming families with fully out-of-distribution splits (Manufactoria) and verifiable rewards. DELTA lets us ask two questions:
- Learnability: Can RL solve families where the base model has pass@K=0?
- Transferability: Do the learned procedures generalize?
-
Grokking Phase Transition: On several pass@128=0 families, RL exhibits a grokking-like phase transition—after a long near-zero-reward plateau, accuracy snaps to ~100%. That is discovery, not mere sharpening.
-
Two-Phase Reward Schedule: A staged reward schedule is key to escaping the "all-zero" region:
- Phase 1: Dense per-test rewards to break out of zero region
- Phase 2: Binary full-pass to consolidate exact solutions
- Binary-only gets stuck; dense-only hovers at "almost right." The schedule yields the grokking jump.
This repository contains the DELTA benchmark suite—a collection of five distinct problem families designed to rigorously test LLM reasoning and RL learnability in truly out-of-distribution settings:
A pure OOD learnability testbed based on a classic 2010 Flash game.
- What: Program "robot factories" using a minimal DSL with just two primitives:
PULLER(read) andPAINTER(write) - Why OOD: Brand-new textual DSL never seen on the internet; fresh puzzles requiring finite-state, tape-shuffling strategies
- Difficulty: 10+ problem families from basic pattern matching to computational tasks where GPT-5 achieves 0% success
- Location:
manufactoria/ - Datasets: HuggingFace @manufactoria
Physics simulation for testing compositional and transformative generalization.
- What: Synthesize programs that simulate 2D elastic collisions in polygonal containers with precise trajectories
- Families: 6 physics scenarios (rotating objects, rotating boxes, moving boxes, gravity, multiple balls/boxes)
- Generalization Axes:
- Exploratory 🧭: Harder scenes (more vertices, higher bounciness)
- Compositional 🧩: Recombine primitives (multi-ball + moving boxes)
- Transformative 🔄: Qualitatively different dynamics (periodic trajectories)
- Location:
bouncingsim/ - Datasets: HuggingFace @bouncingsim
Other problem scopes for learnability test
- Comment: These problems represent family domains where LLMs have already undergone substantial training. They are included here for readers' interest. Typically, a very small LLM (less than 0.5B parameters) is employed for learnability testing in order to explore scenarios where pass@k=0 is applicable.
- Location:
sql/competitioncode/lean
Training Code: Complete RLVR infrastructure available at open-instruct/merge-code-utils
We acknowledge that some of the training scripts are tailored for AI2's cluster. However, we provide training scripts for the readers' reference. Readers are encouraged to adapt the training parameters to fit other RLVR frameworks.
We kept the training parameters unchanged across other training settings in the paper. The only variations were in the datasets used, reference models employed, and the scoring modes applied (either per-test accuracy or full pass rate).
Break out of the all-zero reward region by providing partial credit:
cd train
# Edit phase1.sh to configure your experiment
bash phase1.shKey Settings:
SCORE_MODE=pass_rate(per-test accuracy)- Reward = fraction of unit tests passed (0.0 to 1.0)
- Enables gradient flow even when no solution is perfect
From the Phase 1 checkpoint, switch to strict correctness:
# Edit phase2.sh to point to Phase 1 checkpoint
bash phase2.shKey Settings:
SCORE_MODE=full_pass(binary full-pass rate)- Reward = 1.0 only if all tests pass, else 0.0
- Triggers grokking phase transition to exact solutions
If you find this work useful, please cite our paper:
@misc{sun2025rlgrok,
title = {RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?},
author = {Yiyou Sun and Yuhan Cao and Pohao Huang and Haoyue Bai and Hannaneh Hajishirzi and Nouha Dziri and Dawn Song},
year = {2025},
month = {sep},
eprint = {2509.21016},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
doi = {10.48550/arXiv.2509.21016},
url = {https://arxiv.org/abs/2509.21016}
}Note: Some components like (SQL) are still under development. Please check individual component READMEs for the latest status and contact authors if you're interested in contributing or using these datasets.