Skip to content

Conversation

jiannanWang
Copy link
Contributor

Added equal_metadata and test_metadata to enable metadata correctness checks for previously untestable operators. Now, operators like empty_like, new_empty, new_empty_strided, and bernoulli are tested for metadata.

Running uv run python BackendBench/scripts/main.py --suite opinfo --backend aten --ops "empty_like,new_empty,new_empty_strided,bernoulli" and I got:

Before:

correctness score (mean pass rate over all operators): 0.54
performance score (geomean speedup over all operators): nan

After:

correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): nan

In the meantime, I fixed a bug in test_eval.py where it tried to import gpu_bench from eval.py (which does not exist). This was causing all tests in that file to be skipped.

I also noticed that test_data overwrites previous entries when the multiple tests have the same arguments, leading to assertion failures in test_eval_correctness_multiple_tests and test_eval_correctness_metadata. I commented out the affected assertion for now to let the tests pass, but this may need to be solved in future prs.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 25, 2025
@jiannanWang
Copy link
Contributor Author

New commit:

Remove redundant UNTESTABLE_OPERATORS.

TODO:

Once this PR is merged, remove UNTESTABLE_OPERATORS from dataset filter since they are testable now.

Copy link
Contributor

@PaliC PaliC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I'd actually make the distinction between random and tensor creation ops. I'd then route tensor creation ops to test/equals_metadata as it seems p similar to what you see here. Though I'd ensure its comprehensive as we'll add the other creation ops later.

For benoulli the testing code should be in one of the files at the bottom of #112 and we can likely just use the same testing methodology. If this pr gets merged, I still would not say we check for correctness for bernoulli yet.

To me equals_metadata seems correct! However, I'd personally not merge it until we do a branch cut for the alpha version we are releasing on 9/6 just out of an abundance of caution.

_allclose(a.stride(), b.stride(), atol=0.0, rtol=0.0)
_allclose(a.dtype, b.dtype, atol=0.0, rtol=0.0)
_allclose(a.device, b.device, atol=0.0, rtol=0.0)
return True
Copy link
Contributor

@PaliC PaliC Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check the type string as well as per the reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I have added _allclose(a.is_sparse, b.is_sparse, atol=0.0, rtol=0.0).

The type string assertion checks for dtype, device, and is_sparse. The first two are checked already, so I only add is_sparse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait ... let's just use the functions / machinery from pytorch directly. I feel like that's a bit more future proof / feeds into our desire to make these generated kernels mergable into pytorch

@@ -64,6 +66,21 @@ def allclose(a, b, atol=1e-2, rtol=1e-2):
return False


def equal_metadata(a, b):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm not super clear on is that OpInfo this is indeed the way they test tensor creation ops, that's how we figured out this might be the right testing strategy. So why not just use OpInfo again here?

Copy link
Contributor

@PaliC PaliC Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -24,6 +24,8 @@
TRITON_AVAILABLE = False

from BackendBench.utils import serialize_args, uses_cuda_stream, compute_errors
from BackendBench.scripts.pytorch_operators import extract_operator_name
from BackendBench.scripts.dataset_filters import UNTESTABLE_OPERATORS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nam UNTESTABLE is no longer right, would be explicit and call it tensor creation ops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to TENSOR_CREATION_OPERATORS

@@ -183,18 +197,6 @@ def test_fn():
assert counter == 20
assert time_per_run > 0

def test_gpu_bench(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this giving a problem or do you jus think it's a useless test?

Copy link
Contributor Author

@jiannanWang jiannanWang Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no gpu_bench function in eval.py and we are using triton.testing.do_bench for gpu performance. This actually causes an import error and is fixed in this pr.

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check feedback before merge

@jiannanWang jiannanWang marked this pull request as draft August 26, 2025 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants