Skip to content

laiso/ts-bench

Repository files navigation

ts-bench: TypeScript Agent Benchmark

ts-bench is a transparent and reproducible benchmark project for evaluating the TypeScript code editing capabilities of AI coding agents.

Leaderboard

Rank Agent Model Success Rate Solved Avg Time Result
1 opencode openai/gpt-5 96.0% 24/25 64.8s #415419
2 goose claude-sonnet-4-20250514 92.0% 23/25 122.2s #186071
3 opencode anthropic/claude-sonnet-4-20250514 92.0% 23/25 127.8s #043809
4 gemini gemini-2.5-pro 92.0% 23/25 168.5s #052819
5 codex gpt-5 88.0% 22/25 91.7s #734992
6 opencode opencode/grok-code 88.0% 22/25 97.0s #083421
7 claude glm-4.5 80.0% 20/25 172.3s #591219
8 claude claude-sonnet-4-20250514 72.0% 18/25 206.1s #732069
9 qwen qwen3-coder-plus 64.0% 16/25 123.9s #246268
10 aider claude-sonnet-4-20250514 32.0% 8/25 40.5s #119174

🤖 Supported Agents

Currently supported agents:

📖 Vision & Principles

This project is strongly inspired by benchmarks like Aider Polyglot. Rather than measuring the performance of large language models (LLMs) alone, it focuses on evaluating the agent layer—the entire AI coding assistant tool, including prompt strategies, file operations, and iterative logic.

Based on this vision, the benchmark is designed according to the following principles:

  • TypeScript-First: Focused on TypeScript, which is essential in modern development. Static typing presents unique challenges and opportunities for AI agents, making it a crucial evaluation target.
  • Agent-Agnostic: Designed to be independent of any specific AI agent, allowing fair comparison of multiple CLI-based agents such as Aider and Claude Code.
  • Baseline Performance: Uses self-contained problem sets sourced from Exercism to serve as a baseline for measuring basic code reading and editing abilities. It is not intended to measure performance on large-scale editing tasks or complex bug fixes across entire repositories like SWE-bench.

📊 Results & Methodology

All benchmark results are generated and published via GitHub Actions.

Each results page provides a formatted summary and downloadable artifacts containing raw data (JSON).

Documentation

For detailed documentation, see:

🚀 Getting Started

Installation

bun install

Usage

Run the benchmark with the following commands. Use --help to see all available options.

# Run the default 25 problems with Claude Code (Sonnet 3.5)
bun src/index.ts --agent claude --model claude-3-5-sonnet-20240620

# Run only the 'acronym' problem with Aider (GPT-4o)
bun src/index.ts --agent aider --model gpt-4o --exercise acronym

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published