Test metadata for untestable ops and fix test_eval.py #114

jiannanWang · 2025-08-25T22:15:02Z

Added equal_metadata and test_metadata to enable metadata correctness checks for previously untestable operators. Now, operators like empty_like, new_empty, new_empty_strided, and bernoulli are tested for metadata.

Running uv run python BackendBench/scripts/main.py --suite opinfo --backend aten --ops "empty_like,new_empty,new_empty_strided,bernoulli" and I got:

Before:

correctness score (mean pass rate over all operators): 0.54
performance score (geomean speedup over all operators): nan

After:

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

In the meantime, I fixed a bug in test_eval.py where it tried to import gpu_bench from eval.py (which does not exist). This was causing all tests in that file to be skipped.

I also noticed that test_data overwrites previous entries when the multiple tests have the same arguments, leading to assertion failures in test_eval_correctness_multiple_tests and test_eval_correctness_metadata. I commented out the affected assertion for now to let the tests pass, but this may need to be solved in future prs.

BackendBench/eval.py

jiannanWang · 2025-08-25T23:20:56Z

New commit:

Remove redundant UNTESTABLE_OPERATORS.

TODO:

Once this PR is merged, remove UNTESTABLE_OPERATORS from dataset filter since they are testable now.

PaliC

For this I'd actually make the distinction between random and tensor creation ops. I'd then route tensor creation ops to test/equals_metadata as it seems p similar to what you see here. Though I'd ensure its comprehensive as we'll add the other creation ops later.

For benoulli the testing code should be in one of the files at the bottom of #112 and we can likely just use the same testing methodology. If this pr gets merged, I still would not say we check for correctness for bernoulli yet.

To me equals_metadata seems correct! However, I'd personally not merge it until we do a branch cut for the alpha version we are releasing on 9/6 just out of an abundance of caution.

PaliC · 2025-08-26T00:29:24Z

BackendBench/eval.py

+        _allclose(a.stride(), b.stride(), atol=0.0, rtol=0.0)
+        _allclose(a.dtype, b.dtype, atol=0.0, rtol=0.0)
+        _allclose(a.device, b.device, atol=0.0, rtol=0.0)
+        return True


I'd check the type string as well as per the reference

Sure! I have added _allclose(a.is_sparse, b.is_sparse, atol=0.0, rtol=0.0).

The type string assertion checks for dtype, device, and is_sparse. The first two are checked already, so I only add is_sparse.

wait ... let's just use the functions / machinery from pytorch directly. I feel like that's a bit more future proof / feeds into our desire to make these generated kernels mergable into pytorch

msaroufim · 2025-08-26T01:06:30Z

BackendBench/eval.py

@@ -64,6 +66,21 @@ def allclose(a, b, atol=1e-2, rtol=1e-2):
        return False


+def equal_metadata(a, b):


One thing I'm not super clear on is that OpInfo this is indeed the way they test tensor creation ops, that's how we figured out this might be the right testing strategy. So why not just use OpInfo again here?

there is a reference here to pytorch's testing strategy https://github.com/pytorch/pytorch/blob/332fa5b388521c05a19217649745c6edfdc2836d/test/test_tensor_creation_ops.py

msaroufim · 2025-08-26T01:06:50Z

BackendBench/eval.py

@@ -24,6 +24,8 @@
    TRITON_AVAILABLE = False

 from BackendBench.utils import serialize_args, uses_cuda_stream, compute_errors
+from BackendBench.scripts.pytorch_operators import extract_operator_name
+from BackendBench.scripts.dataset_filters import UNTESTABLE_OPERATORS


The nam UNTESTABLE is no longer right, would be explicit and call it tensor creation ops

changed to TENSOR_CREATION_OPERATORS

msaroufim · 2025-08-26T01:07:56Z

test/test_eval.py

@@ -183,18 +197,6 @@ def test_fn():
        assert counter == 20
        assert time_per_run > 0

-    def test_gpu_bench(self):


was this giving a problem or do you jus think it's a useless test?

There's no gpu_bench function in eval.py and we are using triton.testing.do_bench for gpu performance. This actually causes an import error and is fixed in this pr.

msaroufim

please check feedback before merge

…e testing as well

jiannanWang added 2 commits August 25, 2025 14:16

test metadata for untestable ops

209044e

add test

a3fa3b0

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 25, 2025

msaroufim reviewed Aug 25, 2025

View reviewed changes

BackendBench/eval.py Outdated Show resolved Hide resolved

remove redundant UNTESTABLE_OPERATORS

077638c

PaliC requested changes Aug 26, 2025

View reviewed changes

msaroufim reviewed Aug 26, 2025

View reviewed changes

msaroufim approved these changes Aug 26, 2025

View reviewed changes

jiannanWang added 4 commits August 26, 2025 10:22

solve conflict

e542c22

rename tensor creation operators and add metadata check to performanc…

24c699f

…e testing as well

comment

9059b1d

fix

6831c5b

jiannanWang marked this pull request as draft August 26, 2025 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test metadata for untestable ops and fix test_eval.py #114

Test metadata for untestable ops and fix test_eval.py #114

Uh oh!

jiannanWang commented Aug 25, 2025

Uh oh!

Uh oh!

jiannanWang commented Aug 25, 2025

Uh oh!

PaliC left a comment

Uh oh!

PaliC Aug 26, 2025 •

edited

Loading

Uh oh!

jiannanWang Aug 26, 2025

Uh oh!

PaliC Aug 26, 2025

Uh oh!

msaroufim Aug 26, 2025

Uh oh!

PaliC Aug 26, 2025 •

edited

Loading

Uh oh!

msaroufim Aug 26, 2025

Uh oh!

jiannanWang Aug 26, 2025

Uh oh!

msaroufim Aug 26, 2025

Uh oh!

jiannanWang Aug 26, 2025 •

edited

Loading

Uh oh!

msaroufim left a comment

Uh oh!

Uh oh!

		@@ -64,6 +66,21 @@ def allclose(a, b, atol=1e-2, rtol=1e-2):
		return False


		def equal_metadata(a, b):

Test metadata for untestable ops and fix test_eval.py #114

Are you sure you want to change the base?

Test metadata for untestable ops and fix test_eval.py #114

Uh oh!

Conversation

jiannanWang commented Aug 25, 2025

Uh oh!

Uh oh!

jiannanWang commented Aug 25, 2025

Uh oh!

PaliC left a comment

Choose a reason for hiding this comment

Uh oh!

PaliC Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiannanWang Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

msaroufim Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

PaliC Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msaroufim Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jiannanWang Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

msaroufim Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

jiannanWang Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msaroufim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PaliC Aug 26, 2025 •

edited

Loading

PaliC Aug 26, 2025 •

edited

Loading

jiannanWang Aug 26, 2025 •

edited

Loading