Merge pull request #4571 from Liam-DeVoe/claude-code-blog

Zac-HD · web-flow · commit 4cbd566a5aa1 · 2025-11-01T13:48:03.000-07:00
Add `A Claude Code command for Hypothesis` blog post
diff --git a/agents/hypothesis.md b/agents/hypothesis.md
@@ -0,0 +1,92 @@
+---
+description: Write property-based tests with Hypothesis
+---
+
+You are an expert developer of property-based tests, specifically using Hypothesis. Your goal is to identify and implement a small number of the most valuable Hypothesis tests that would benefit an existing codebase right now. You focus on clarity and maintainability, as your code will be reviewed by a developer. Your goal is to write precise tests, not comprehensive test suites.
+
+Create and follow this todo list using the `Todo` tool:
+
+1. [ ] Explore the provided code and identify valuable properties.
+2. [ ] For each property, explore how its related code is used.
+3. [ ] Write Hypothesis tests based on those properties.
+4. [ ] Run the new Hypothesis tests, and reflect on the result.
+
+## 1. Explore the code provided and identify valuable properties
+
+First, explore the provided code, and identify valuable properties to test. A "valuable property" is an invariant or property about the code that is valuable to the codebase right now and that a knowledgeable developer for this codebase would have written a Hypothesis test for. The following are indicative of a valuable property:
+
+- Would catch important bugs: Testing this property would reveal bugs that could cause serious issues.
+- Documents important behavior: The property captures essential assumptions or guarantees that are important to future or current developers.
+- Benefits significantly from Hypothesis: The property is concisely and powerfully expressed as a Hypothesis test, rather than a series of unit tests.
+
+Keep the following in mind:
+
+- Only identify properties that you strongly believe to be true and that are supported by evidence in the codebase, for example in docstrings, comments, code use patterns, type hints, etc. Do not include properties you are at all unsure about.
+- Each property should provide a substantial improvement in testing power or clarity when expressed as a Hypothesis test, rather than a unit test. Properties which could have been equally well tested with a unit test are not particularly valuable.
+- You may come across many possible properties. Your goal is to identify only a small number of the most valuable of those properties that would benefit the codebase right now.
+
+If the provided code is large, focus on exploring in this order:
+
+1. Public API functions/classes
+2. Well-documented implementations of core functionality
+3. Other implementations of core functionality
+4. Internal/private helpers or utilities
+
+Here are some examples of typical properties:
+
+- Round-trip property: `decode(encode(x)) = x`, `parse(format(x)) = x`.
+- Inverse relationship: `add/remove`, `push/pop`, `create/destroy`.
+- Multiple equivalent implementations: Optimized vs reference implementation, complicated vs simple implementation.
+- Mathematical property: Idempotence `f(f(x)) = f(x)`, commutativity `f(x, y) = f(y, x)`.
+- Invariants: `len(filter(x)) <= len(x)`, `set(sort(x)) == set(x)`.
+- Confluence: the order of function application doesn't matter (for example, in compiler optimization passes).
+- Metamorphic property: some relationship between `f(x)` and `g(x)` holds for all x. For example, `sin(π − x) = sin(x)`.
+- Single entry point. If a library has a narrow public API, a nice property-based test simply calls the library with valid inputs. Common in parsers.
+
+While the following should generally not be tested:
+
+- Obvious code wrappers
+- Implementation details
+
+The user has provided the following guidance for where and how to add Hypothesis tests: <user_input>$ARGUMENTS</user_input>.
+
+- If the user has provided no direction, explore the entire codebase.
+- If the user has provided a specific module, explore that module.
+- If the user has provided a specific file, explore that file.
+- If the user has provided a specific function, explore that function.
+- If the user has given more complex guidance, follow that instead.
+
+If you don't identify any valuable properties during exploration, that's fine; just tell the user as much, and then stop.
+
+At the end of this step, you should tell the user the small list of the most valuable properties that you intend to test.
+
+## 2. For each valuable property, explore how its related code is used
+
+Before writing Hypothesis tests, explore how the codebase uses the related code of each valuable property. For example, if a property involves a function `some_function`, explore how the codebase calls `some_function`: what kinds of inputs are passed to it? in what context? etc. This helps correct any misunderstanding about the property before writing a test for it.
+
+## 3. Write Hypothesis tests based on those properties.
+
+For each property, write a new Hypothesis test for it, and add it to the codebase's test suite, following its existing testing conventions.
+
+When writing Hypothesis tests, follow these guidelines:
+
+- Each Hypothesis test should be both sound (tests only inputs the code can actually be called with) and complete (tests all inputs the code can actually be called with). Sometimes this is difficult. In those cases, prefer sound and mostly-complete tests; stopping at 90% completeness is better than over-complicating a test.
+- Only place constraints on Hypothesis strategies if required by the code. For example, prefer `st.lists(...)` (with no size bound) to `st.lists(..., max_size=100)`, unless the property explicitly happens to only be valid for lists with no more than 100 elements.
+
+## 4. Run the new Hypothesis tests, and reflect on the result.
+
+Run the new Hypothesis tests that you just added. If any fail, reflect on why. Is the test failing because of a genuine bug, or because it's not testing the right thing? Often, when a new Hypothesis test fails, it's because the test generates inputs that the codebase assumes will never occur. If necessary, re-explore related parts of the codebase to check your understanding. You should only report that the codebase has a bug to the user if you are truly confident, and can justify why.
+
+# Hypothesis Reference
+
+Documentation reference (fetch with the `WebFetch` tool if required):
+
+- **Strategies API reference**: https://hypothesis.readthedocs.io/en/latest/reference/strategies.html
+- **Other API reference**: https://hypothesis.readthedocs.io/en/latest/reference/api.html
+  - Documents `@settings`, `@given`, etc.
+
+These Hypothesis strategies are under-appreciated for how effective they are. Use them if they are a perfect or near-perfect fit for a property:
+
+- `st.from_regex`
+- `st.from_lark` - for context-free grammars
+- `st.functions` - generates arbitrary callable functions
diff --git a/hypothesis-python/tests/watchdog/test_database.py b/hypothesis-python/tests/watchdog/test_database.py
@@ -64,6 +64,7 @@ def test_database_listener_directory():
             stateful_step_count=10,
             # expensive runtime makes shrinking take forever
             phases=set(Phase) - {Phase.shrink},
+            deadline=None,
         ),
     )
 
diff --git a/website/content/2025-10-20-claude-code-plugin.md b/website/content/2025-10-20-claude-code-plugin.md
@@ -0,0 +1,108 @@
+---
+date: 2025-10-21 00:00
+title: A Claude Code command for Hypothesis
+author: Liam DeVoe, Muhammad Maaz, Zac Hatfield-Dodds, Nicholas Carlini
+---
+
+<div class="cta-buttons">
+  <a href="https://github.com/hypothesisworks/hypothesis/agents/hypothesis.md" class="cta-button">
+    <img src="/theme/icon-code.svg" alt="" class="cta-icon">
+    View the command
+  </a>
+  <a href="https://mmaaz-git.github.io/agentic-pbt-site/" class="cta-button">
+    <img src="/theme/icon-paper.svg" alt="" class="cta-icon">
+    Read the paper
+  </a>
+</div>
+
+*We wrote a paper using Claude to autonomously write and run Hypothesis tests, and found real bugs in numpy, pandas, and other packages. We've extracted this to a Claude Code command for writing Hypothesis tests, which we're sharing today. We hope you find it useful.*
+
+*(Not familiar with property-based testing? [Learn more here](https://increment.com/testing/in-praise-of-property-based-testing/)).*
+
+---
+
+Hypothesis has shipped with [the ghostwriter](https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#ghostwriter) for quite a while, which automatically writes Hypothesis tests for your code. It uses nothing but good old fashioned heuristics, and is a nice way to stand up Hypothesis tests with minimal effort.
+
+Recently, we explored what this same idea might look like with modern AI tools, like Anthropic's Claude Sonnet 4.5 and OpenAI's GPT-5, and the results have been pretty compelling. So we're happy to release `/hypothesis`, a [Claude Code](https://www.claude.com/product/claude-code) command that we developed to automate writing Hypothesis tests.
+
+The `/hypothesis` command instructs the model to automatically read your code, infer testable properties, and add Hypothesis tests to your test suite. The idea is that if you wanted to add Hypothesis tests for a file `mypackage/a/utils.py`, you could run `/hypothesis mypackage/a/utils.py`, go get a coffee, and then come back to see some new newly-added tests. You can alternatively give more complex instructions, like `/hypothesis focus on the database implementation; add tests to test_db.py`.
+
+We've found `/hypothesis` pretty useful when combined with modern AI models, for tasks ranging from setting up tests in fresh repositories, to augmenting existing test suites, to standing up a full fuzzing workflow with [HypoFuzz](https://hypofuzz.com/).
+
+Since `/hypothesis` doesn't (yet) make sense to release in Hypothesis itself, we're releasing it here. [You can find the full command here](https://github.com/hypothesisworks/hypothesis/agents/hypothesis.md), install it by copying into `~/.claude/commands/`, and run it with `/hypothesis` inside of Claude Code[^1].
+
+# Designing the `/hypothesis` command
+
+The broad goal of the `/hypothesis` command is to: (1) look at some code; (2) discover properties that make sense to test; and (3) write Hypothesis tests for those properties.
+
+As many developers will attest, often the trickiest part of property-based testing is figuring out what property to test. This is true for modern AI models as well. We therefore design the instructions of `/hypothesis` around gathering as much context about potential properties as it can, before writing any tests. This ensures that the tests the model writes are strongly supported by factual evidence, for example in type hints, docstrings, usage patterns, or existing unit tests.
+
+The flow of the `/hypothesis` instructions looks like this:
+
+1. Explore the provided code and identify candidate properties.
+2. Explore how the codebases calls that code in practice.
+3. Grounded in this understanding, write corresponding Hypothesis tests.
+4. Run the new Hypothesis tests, and reflect on any failures. Is it a genuine bug, or is the test incorrect? Refactor the test if necessary.
+
+The legwork that `/hypothesis` instructs the model to do both before and after writing a test is critical for deriving high-quality tests. For example, the model might discover in step 2 that a function is called with two different input formats, and both should be tested. Or it might discover in step 4 that it wrote an unsound test, by generating test inputs the function didn't expect, like `math.nan`.
+
+## Failure modes
+
+We observed a few failure modes while developing `/hypothesis`. For example, AI models like to write strategies with unnecessary restrictions, like limiting the maximum length of a list even when the property should hold for all lengths of lists. We added explicit instructions in `/hypothesis` not to do this, though that doesn't appear to have fixed the problem entirely.
+
+By far the most fundamental failure mode is that the model might simply misunderstand a property in the code. For example, we ran `/hypothesis` on [python-dateutil](https://github.com/dateutil/dateutil); specifically, `/hypothesis src/easter.py`. The model determined that a property of the `easter` function is that it should always return a date on a Sunday, no matter the `method` argument, of which dateutil provides three: `method=EASTER_JULIAN`, `method=EASTER_ORTHODOX`, `method=EASTER_WESTERN`. The model wrote a test saying as much, which then failed, and it proudly claimed it had found a bug.
+
+In fact, the model had not found a bug. In reality, `dateutil.easter` computes the date for Easter in the calendar corresponding to the passed `method`, but always returns that date in the Gregorian calendar—which might not be a Sunday. The test written by the model assumed the computation occurred in the Gregorian calendar from start to finish, which was incorrect.
+
+This kind of subtle semantic reasoning remains difficult for models, and it's important to keep it in mind as a limitation.
+
+# Using `/hypothesis` for bug hunting
+
+Armed with a test-writing command, one natural extension is to use it to find real bugs in open-source repositories. To explore this, we used Claude Opus 4.1 to automatically write and run Hypothesis tests for a number of popular Python packages. The results were promising—we found bugs in NumPy, pandas, and Google and Amazon SDKs, and [submitted](https://github.com/numpy/numpy/pull/29609) [patches](https://github.com/aws-powertools/powertools-lambda-python/pull/7246) [for](https://github.com/aws-cloudformation/cloudformation-cli/pull/1106) [several](https://github.com/huggingface/tokenizers/pull/1853) of them. You can [read more in our paper](https://mmaaz-git.github.io/agentic-pbt-site/); it's quite short, so do give it a read if you're interested.
+
+It's insightful to walk through one bug we found in particular: a bug in [NumPy's `numpy.random.wald`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.wald.html) function (also called the inverse Gaussian distribution).
+
+To start, we ran `/hypothesis numpy.random` to kick off the model. This directs the model to write tests for the entire `numpy.random` module. The model starts by reading the source code of `numpy.random` as well as any relevant docstrings. It sees the function `wald`, realizes from its background knowledge that the mathematical `wald` function should only produce positive values, and tracks that as a potential property. It reads further and discovers from the docstring of `wald` that both the `mean` and `scale` parameters must be greater than 0.
+
+Based on this understanding, and a few details from docstrings that we've omitted, the model proposes a range of properties:
+
+1. All outputs of `wald` are positive.
+2. No `math.nan` or `math.inf` values are returned on valid inputs.
+3. The returned array shape matches the `size` parameter.
+4. The `mean` and `scale` arrays broadcasts correctly.
+5. Seeding the distribution produces deterministic results.
+
+And then goes about writing Hypothesis tests for them. Here's one of the (slightly formatted) tests it writes:
+
+```python
+import numpy as np
+
+from hypothesis import given, strategies as st
+
+positive_floats = st.floats(
+    min_value=1e-10, max_value=1e6, allow_nan=False, allow_infinity=False
+)
+
+
+@given(
+    mean=positive_floats,
+    scale=positive_floats,
+    size=st.integers(min_value=1, max_value=1000),
+)
+def test_wald_all_outputs_positive(mean, scale, size):
+    """Test that all Wald distribution samples are positive."""
+    samples = np.random.wald(mean, scale, size)
+    assert np.all(samples > 0), f"Found non-positive values: {samples[samples <= 0]}"
+```
+
+It then runs this test. And the test fails! After reflection, the model decides this is a real bug, leaves the test in the test suite, and reports the failure to the developer.
+
+What's going on here? We tracked this bug down to catastrophic cancellation in NumPy's `wald` implementation, which could sometimes result in negative values. We reported this to the NumPy maintainers alongside a patch with a more numerically stable algorithm. The NumPy maintainers confirmed the bug, and our fix was released in [v2.3.4](https://github.com/numpy/numpy/releases/tag/v2.3.4). You can [check out the PR here](https://github.com/numpy/numpy/pull/29609).
+
+We think this is a really neat confirmation of both the power of property-based testing, and the ability of current AI models to reason about code.
+
+# Conclusion
+
+We hope you find `/hypothesis` useful in adding Hypothesis tests to your test suites! Developing AI prompts is more of an art than a science; so we encourage you to give any feedback on `/hypothesis` by [opening an issue in the Hypothesis repository](https://github.com/HypothesisWorks/hypothesis/issues/new), even if it's just some open-ended thoughts.
+
+[^1]: While Claude Code is currently the most popular tool that supports custom commands, `/hypothesis` is just a markdown file, and works equally as well with any AI framework that supports commands.
diff --git a/website/pelicanconf.py b/website/pelicanconf.py
@@ -44,8 +44,8 @@
 PROFILE_IMAGE_URL = "/dragonfly-rainbow.svg"
 
 MENUITEMS = (
-    ("Articles", "/articles"),
-    ("Documentation", "https://hypothesis.readthedocs.io/en/latest/"),
+    ("Blog", "/articles"),
+    ("Docs", "https://hypothesis.readthedocs.io/en/latest/"),
     ("GitHub", "https://github.com/HypothesisWorks/hypothesis/"),
     ("PyPI", "https://pypi.org/project/hypothesis/"),
 )
diff --git a/website/theme/static/icon-code.svg b/website/theme/static/icon-code.svg
@@ -0,0 +1,3 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 640 512">
+  <path fill="currentColor" d="M392.8 1.2c-17-4.9-34.7 5-39.6 22l-128 448c-4.9 17 5 34.7 22 39.6s34.7-5 39.6-22l128-448c4.9-17-5-34.7-22-39.6zm80.6 120.1c-12.5 12.5-12.5 32.8 0 45.3L562.7 256l-89.4 89.4c-12.5 12.5-12.5 32.8 0 45.3s32.8 12.5 45.3 0l112-112c12.5-12.5 12.5-32.8 0-45.3l-112-112c-12.5-12.5-32.8-12.5-45.3 0zm-306.7 0c-12.5-12.5-32.8-12.5-45.3 0l-112 112c-12.5 12.5-12.5 32.8 0 45.3l112 112c12.5 12.5 32.8 12.5 45.3 0s12.5-32.8 0-45.3L77.3 256l89.4-89.4c12.5-12.5 12.5-32.8 0-45.3z"/>
+</svg>
diff --git a/website/theme/static/icon-paper.svg b/website/theme/static/icon-paper.svg
@@ -0,0 +1,3 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 384 512">
+  <path fill="currentColor" d="M64 0C28.7 0 0 28.7 0 64V448c0 35.3 28.7 64 64 64H320c35.3 0 64-28.7 64-64V160H256c-17.7 0-32-14.3-32-32V0H64zM256 0V128H384L256 0zM112 256H272c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16zm0 64H272c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16zm0 64H272c8.8 0 16 7.2 16 16s-7.2 16-16 16H112c-8.8 0-16-7.2-16-16s7.2-16 16-16z"/>
+</svg>
diff --git a/website/theme/static/prism.css b/website/theme/static/prism.css
@@ -1,6 +1,6 @@
 /* PrismJS 1.17.1  https://prismjs.com/download.html#themes=prism&languages=python
 * prism.js default theme for JavaScript, CSS and HTML by Lea Verou, based on dabblet.com
-* Modified by Zac Hatfield-Dodds; removed background etc.
+* Modified by Zac Hatfield-Dodds; removed background, match github's python colors closer, etc.
 */
 
 pre[class*="language-"]::-moz-selection,
@@ -28,11 +28,11 @@ pre[class*="language-"] * {
 .token.prolog,
 .token.doctype,
 .token.cdata {
-  color: slategray;
+  color: #6a737d;
 }
 
 .token.punctuation {
-  color: #999;
+  color: #24292e;
 }
 
 .namespace {
@@ -46,7 +46,7 @@ pre[class*="language-"] * {
 .token.constant,
 .token.symbol,
 .token.deleted {
-  color: #905;
+  color: #005cc5;
 }
 
 .token.selector,
@@ -55,32 +55,33 @@ pre[class*="language-"] * {
 .token.char,
 .token.builtin,
 .token.inserted {
-  color: #690;
+  color: #032f62;
 }
 
 .token.operator,
 .token.entity,
 .token.url,
 .language-css .token.string,
 .style .token.string {
-  color: #9a6e3a;
+  color: #24292e;
 }
 
 .token.atrule,
 .token.attr-value,
 .token.keyword {
-  color: #07a;
+  color: #d73a49;
 }
 
 .token.function,
-.token.class-name {
-  color: #dd4a68;
+.token.class-name,
+.token.decorator {
+  color: #6f42c1;
 }
 
 .token.regex,
 .token.important,
 .token.variable {
-  color: #e90;
+  color: #e36209;
 }
 
 .token.important,
diff --git a/website/theme/static/style.css b/website/theme/static/style.css
diff --git a/website/theme/templates/article-card.html b/website/theme/templates/article-card.html

Original file line number	Diff line number	Diff line change
`@@ -64,6 +64,7 @@ def test_database_listener_directory():`
`64`	`64`	`stateful_step_count=10,`
`65`	`65`	`# expensive runtime makes shrinking take forever`
`66`	`66`	`phases=set(Phase) - {Phase.shrink},`
	`67`	`+ deadline=None,`
`67`	`68`	`),`
`68`	`69`	`)`
`69`	`70`