Skip to content

Conversation

Raimo33
Copy link

@Raimo33 Raimo33 commented Sep 2, 2025

Goal

This PR refactors the benchmarking functions as per #1701, in order to make benchmarks more deterministic and less influenced by the environvment.

This is achieved by replacing Wall-Clock Timer with Per-Process CPU Timer when possible.

@Raimo33 Raimo33 marked this pull request as draft September 2, 2025 14:58
@Raimo33 Raimo33 mentioned this pull request Sep 2, 2025
2 tasks
@real-or-random
Copy link
Contributor

real-or-random commented Sep 2, 2025

Just some quick comments:

[x] remove the number of runs (count) in favor of simpler cleaner approach with just number of iterations (iter).

I think there's a reason to have this. Some benchmarks take much longer than others, so it probably makes sense to run fewer iters for these.

[x] remove min and max statistics in favor of simpler approach with just avg.

I think min and max are useful. For constant-time code, you can also compare min. And max gives you an idea if there were fluctuations or not.

[x] remove needless fixed point conversion in favor of simpler floating point divisions.

Well, okay, that has a history; see #689. It's debatable if it makes sense to avoid floating point math, but as long as it doesn't get in your way here, it's a cool thing to keep it. :D

@real-or-random
Copy link
Contributor

It will be useful to split your changes into meaningful and separate commits, see https://github.com/bitcoin/bitcoin/blob/master/CONTRIBUTING.md#committing-patches.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

I think min and max just complicate things. let me explain:
first of all, as it is right now, they don't even measure the min and max, they just measure the min/max of the averages of all runs. aka not the absolute. Furthermore, in order to have them, one would need to run all the iterations 10 times more. benchmarks are already slow, adding this min/max slows them by 10 fold. imho it's completely unnecessary. @real-or-random

@sipa
Copy link
Contributor

sipa commented Sep 2, 2025

If we're going to rework this, I'd suggest using the stabilized quartiles approach from https://cr.yp.to/papers/rsrst-20250727.pdf:

  • StQ1: the average of all samples between 1st and 3rd octile
  • StQ2: the average of all samples between 3rd and 5th octile
  • StQ3: the average of all samples between 5th and 7th octile

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

I think there's a reason to have this. Some benchmarks take much longer than others, so it probably makes sense to run fewer iters for these.

right now all benchmarks are run with count=10 and fixed iters (apart from ecmult_multi which adjusts the number of iters, not count).

therefore count is only useful to extrapolate min and max

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

Well, okay, that has a history; see #689. It's debatable if it makes sense to avoid floating point math, but as long as it doesn't get in your way here, it's a cool thing to keep it. :D

I disagree with #689. It overcomplicate things for the sake of not having floating point math. those divisions aren't even in the hot path, they're outside the benchmarks.

@sipa
Copy link
Contributor

sipa commented Sep 2, 2025

Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.

what is the usefulness of measuring min/max when we are removing OS interference & thermal throttling out of the equation? min/max will be extremely close to the avg no matter how bad the benchmarked function is.

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 3 times, most recently from 97e5264 to 254a014 Compare September 2, 2025 16:08
@Raimo33 Raimo33 changed the title [WIP] Refactor benchmark [WIP] refactor: remove system interference from benchmarks Sep 2, 2025
@Raimo33 Raimo33 force-pushed the benchmark-precise branch 2 times, most recently from 1d9d6d0 to 4c9a074 Compare September 2, 2025 16:26
@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

by the way, gettimeofday() is officially discouraged since 2008 in favor of clock_gettime(). The POSIX standard marks it as obsolescent but still provides the API for backward compatibility.

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 2 times, most recently from ddeaede to 71dff3f Compare September 2, 2025 17:45
@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

even though the manual says that CLOCK_PROCESS_CPUTIME_ID is only useful if the process is locked to a core, modern CPUs have largely addressed this issue. So I think it is fair to compile CLOCK_PROCESS_CPUTIME_ID even though we don't have the guarantee that the user has pinned the benchmarking process to a core. The worst case scenario is a unreliable benchmark, which the current repo has anyways.

I added a line in the README.md for best practices to run the benchmarks.

I also tried adding a function to pin the process to a core directly in C, but there's no standard POSIX compliant way to do so. There is pthread_setaffinity_np() on linux, where 'np' stands for 'not portable'

@Raimo33 Raimo33 force-pushed the benchmark-precise branch 4 times, most recently from 3e43c75 to ef9e40e Compare September 2, 2025 20:31
@real-or-random
Copy link
Contributor

Concept NACK on removing any ability to observe variance in timing. The current min/avg/max are far from perfect, but they work fairly well in practice. Improving is welcome, but removing them is a step backwards.

what is the usefulness of measuring min/max when we are removing OS interference & thermal throttling out of the equation? min/max will be extremely close to the avg no matter how bad the benchmarked function is.

The point is exactly having a simple way of verifying that there's indeed no interference. Getting rid of sources of variance is hard to get right, and it's impossible to get a perfect solution. (This discussion shows this!) So we better have a way of spotting if something is off.

I like the stabilized quartiles idea.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

I like the stabilized quartiles idea.

tbh it scares me a bit, will see what I can do. Maybe in a future PR.

@Raimo33 Raimo33 marked this pull request as ready for review September 2, 2025 23:56
@real-or-random
Copy link
Contributor

The point is exactly having a simple way of verifying that there's indeed no interference. Getting rid of sources of variance is hard to get right, and it's impossible to get a perfect solution. (This discussion shows this!) So we better have a way of spotting if something is off.

fine, but at least let's make it optional. I don't like my benchmarks being 10 times slower just because min/max need to be computed. if I say 20'000 iterations I want it to be 20`000 iterations.

Not sure. I suggest making this PR focused on a single (uncontroversial) change, which is switching to per process clocks.

@Raimo33 Raimo33 changed the title [WIP] refactor: remove system interference from benchmarks bench: replace wall-clock timer with per-process CPU timer Sep 8, 2025
@Raimo33 Raimo33 force-pushed the benchmark-precise branch 2 times, most recently from 5540c89 to f9d2dde Compare September 8, 2025 12:02
@Raimo33 Raimo33 force-pushed the benchmark-precise branch 2 times, most recently from a5eb2db to bbc8bb7 Compare September 8, 2025 16:04
@Raimo33
Copy link
Author

Raimo33 commented Sep 9, 2025

I've applied the recommended changes and split the PR into multiple commits for simpler diffs

@l0rinc
Copy link

l0rinc commented Sep 9, 2025

I haven't investigated this in detail but couldn't find any related issue: were there any efforts to use https://nanobench.ankerl.com like we do in Core instead? That will likely need some adjustments on the benchmarks themselves, but at least we wouldn't be rolling our own framework.

@Raimo33
Copy link
Author

Raimo33 commented Sep 9, 2025

not familiar with nano bench, but I can see it uses std::chrono. which uses CLOCK_MONOTONIC (our fallback) under the hood, not TSC.

also the user would need to have a c++ compiler, not a big real but I think this repo needs to strictly adhere to c89.

@Raimo33
Copy link
Author

Raimo33 commented Sep 9, 2025

I realized the windows version I implemented was using a unprecise clock. drafting. will fix.

@Raimo33 Raimo33 marked this pull request as draft September 9, 2025 18:02
This commit improves the reliability of benchmarks by removing some of the influence of other background running processes. This is achieved by using CPU bound clocks that aren't influenced by interrupts, sleeps, blocked I/O, etc.
@Raimo33
Copy link
Author

Raimo33 commented Sep 9, 2025

I changed windows clock to QueryPerformanceCounter() which has microsecond precision.
I also removed if checks from clock_gettime() as it can only fail for programming errors.

@real-or-random
Copy link
Contributor

I haven't investigated this in detail but couldn't find any related issue: were there any efforts to use nanobench.ankerl.com like we do in Core instead?

nanobench is a framework for benchmarking C++ but this is C. Of course, we could still call our C from C++, but adding C++ to the code base seems a bit overkill to me if if's "just" for the purpose of benchmarks. I agree that using an existing framework may be a good idea, but I'm not aware of any in pure C.

@Raimo33 Raimo33 marked this pull request as ready for review September 10, 2025 09:51
@l0rinc
Copy link

l0rinc commented Sep 10, 2025

but adding C++ to the code base seems a bit overkill to me if if's "just" for the purpose of benchmarks

The purpose would be to standardize the benchmarking to avoid reinventing solutions to problems that aren't strictly related to the purpose of this library. It would also help with trusting the result more, since we already know how that benchmarking library behaves.
The main client of this library is C++ code, it doesn't seem far-fetched to me to test that.

@real-or-random
Copy link
Contributor

but adding C++ to the code base seems a bit overkill to me if if's "just" for the purpose of benchmarks

The purpose would be to standardize the benchmarking to avoid reinventing solutions to problems that aren't strictly related to the purpose of this library. It would also help with trusting the result more, since we already know how that benchmarking library behaves. The main client of this library is C++ code, it doesn't seem far-fetched to me to test that.

Sorry, I didn't want to be discouraging. I don't think at all that nanobench is a far-fetched idea. And I truly agree that reinventing the wheel is not great.

In the end, what counts is the maintenance burden. My main concern with C++ is that I'm not sure how fluent our regular contributors are in C++. (I'm not, but luckily I'm not the only one here.) If the bit of C++ is harder to touch (for us) than this code, then nanobench is not a good idea. If, on the other hand, we have people who can maintain C++, or if we get support from the Core contributors who are familiar with nanobench, then this can be a win.

I'd be happy to see a demo/WIP PR but if you feel that this is a lot of work, then it might be a good idea to see what other contributors and maintainers think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants