ts-bench: TypeScript Agent Benchmark

ts-bench is a transparent and reproducible benchmark project for evaluating the TypeScript code editing capabilities of AI coding agents.

Leaderboard

Rank	Agent	Model	Success Rate	Solved	Avg Time	Result
1	opencode	openai/gpt-5	96.0%	24/25	64.8s	#415419
2	goose	claude-sonnet-4-20250514	92.0%	23/25	122.2s	#186071
3	opencode	anthropic/claude-sonnet-4-20250514	92.0%	23/25	127.8s	#043809
4	gemini	gemini-2.5-pro	92.0%	23/25	168.5s	#052819
5	codex	gpt-5	88.0%	22/25	91.7s	#734992
6	opencode	opencode/grok-code	88.0%	22/25	97.0s	#083421
7	claude	glm-4.5	80.0%	20/25	172.3s	#591219
8	claude	claude-sonnet-4-20250514	72.0%	18/25	206.1s	#732069
9	qwen	qwen3-coder-plus	64.0%	16/25	123.9s	#246268
10	aider	claude-sonnet-4-20250514	32.0%	8/25	40.5s	#119174

🤖 Supported Agents

Currently supported agents:

📖 Vision & Principles

This project is strongly inspired by benchmarks like Aider Polyglot. Rather than measuring the performance of large language models (LLMs) alone, it focuses on evaluating the agent layer—the entire AI coding assistant tool, including prompt strategies, file operations, and iterative logic.

Based on this vision, the benchmark is designed according to the following principles:

TypeScript-First: Focused on TypeScript, which is essential in modern development. Static typing presents unique challenges and opportunities for AI agents, making it a crucial evaluation target.
Agent-Agnostic: Designed to be independent of any specific AI agent, allowing fair comparison of multiple CLI-based agents such as Aider and Claude Code.
Baseline Performance: Uses self-contained problem sets sourced from Exercism to serve as a baseline for measuring basic code reading and editing abilities. It is not intended to measure performance on large-scale editing tasks or complex bug fixes across entire repositories like SWE-bench.

📊 Results & Methodology

All benchmark results are generated and published via GitHub Actions.

➡️ View All Benchmark Runs Here
📜 Read the Benchmark Methodology

Each results page provides a formatted summary and downloadable artifacts containing raw data (JSON).

Documentation

For detailed documentation, see:

Environment Setup: Details on setting up the local and Docker environments.
Leaderboard Operation Design: Explains how the leaderboard is updated and maintained.

🚀 Getting Started

Installation

bun install

Usage

Run the benchmark with the following commands. Use --help to see all available options.

# Run the default 25 problems with Claude Code (Sonnet 3.5)
bun src/index.ts --agent claude --model claude-3-5-sonnet-20240620

# Run only the 'acronym' problem with Aider (GPT-4o)
bun src/index.ts --agent aider --model gpt-4o --exercise acronym

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
docs		docs
exercism-typescript @ f1d9681		exercism-typescript @ f1d9681
public/data		public/data
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
bun.lock		bun.lock
index.ts		index.ts
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ts-bench: TypeScript Agent Benchmark

Leaderboard

🤖 Supported Agents

📖 Vision & Principles

📊 Results & Methodology

Documentation

🚀 Getting Started

Installation

Usage

About

Uh oh!

Releases

Packages

Contributors 4

Languages

laiso/ts-bench

Folders and files

Latest commit

History

Repository files navigation

ts-bench: TypeScript Agent Benchmark

Leaderboard

🤖 Supported Agents

📖 Vision & Principles

📊 Results & Methodology

Documentation

🚀 Getting Started

Installation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages