[WIP] Add intel simd #1703

Raimo33 · 2025-07-18T11:26:24Z

This adds avx and avx2 intrinsics support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.

Why not sse and avx512?

sse is only useful in the 32bit code path (USE_FORCE_WIDEMUL_INT64=1). but practically, almost all 32bit SSE enabled CPUs are armv7 architectures, not X86
avx512 would not be beneficial anywhere, and thermal throttling is a problem on most CPUs that support it.

arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged...

Tasks

Add CI/CD flows with permutations of -mavx, -mavx2, -mno-avx, -mno-avx2 when building for amd64
Precompute vectors at startup (the ones marked with TODO: precompute )

Commits

I've split this PR into multiple commits with the following criteria:

non-simd optimizations & style changes
simd optimizations
temporary scripts for development
CI flows

Test & Benchmark

To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results. You can verify yourself by running: ./simd-build.sh && ./simd-test.sh && ./simd-bench.sh and executing the notebook as is.

The baseline is compiled with "-O3 -mavx -mavx2 -U__AVX__ -U__AVX2__" so that spontaneous gcc vectorization is allowed, but my manual vectorization is not compiled.

Results

Raimo33 · 2025-07-18T15:18:09Z

To precompute simd constants at the start, the best solution I found was doing something like this:

#ifdef __SSE2__
  static __m128i _128_vec_ones;
#endif

CONSTRUCTOR void simd_init(void)
{
#ifdef __SSE2__
  _128_vec_ones   = _mm_set1_epi8('1');
#endif
}

where CONSTRUCTOR is __attribute__((constructor))

Raimo33 · 2025-07-18T17:00:38Z

I'm constantly getting these warnings. Apparently they're harmless since I always use loadu and storeu, but for some reason the compiler doesn't like them.

warning: cast increases required alignment of target type [-Wcast-align]
  653 |         _mm256_storeu_si256((__m256i *)r->v, out);

The only fixes I found are:

aligning everything to 64bytes (impossible, breaks even some of my avx logic)
suppress the warning globally
suppress the warning inline each time

real-or-random · 2025-09-02T09:15:55Z

I built this and ran the benchmarks on my machine (12th Gen Intel(R) Core(TM) i7-1260P)

Signing became faster, but verification became slower. After looking at the bench_internal results, I figured that the culprit is secp256k1_fe_impl_negate_unchecked, which makes the field_sqrt benchmark slower and hence verification. When I disable your ifdefs in secp256k1_fe_impl_negate_unchecked, I get consistently faster results across benchmarks (except schnorrsig_verify, not sure what's going on there).

Benchmark results for INT128
Generated on 2025-09-02T09:56:29 CEST
Iterations: 20000

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

ecdsa_verify                  ,    54.8       ,    55.2       ,    56.7    
ecdsa_sign                    ,    34.8       ,    34.9       ,    35.0    
ec_keygen                     ,    24.2       ,    24.2       ,    24.2    
ecdh                          ,    51.0       ,    51.4       ,    53.6    
schnorrsig_sign               ,    25.6       ,    25.6       ,    25.7    
schnorrsig_verify             ,    55.5       ,    56.0       ,    57.5    
ellswift_encode               ,    31.1       ,    31.1       ,    31.2    
ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
ellswift_keygen               ,    55.0       ,    55.0       ,    55.0    
ellswift_ecdh                 ,    56.5       ,    56.8       ,    58.7    


Benchmark results for INT128_SSE2_AVX2
Generated on 2025-09-02T10:00:33 CEST
Iterations: 20000

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

ecdsa_verify                  ,    54.6       ,    54.9       ,    56.4    
ecdsa_sign                    ,    33.5       ,    33.6       ,    33.6    
ec_keygen                     ,    23.0       ,    23.0       ,    23.1    
ecdh                          ,    50.6       ,    50.6       ,    50.7    
schnorrsig_sign               ,    24.5       ,    24.5       ,    24.5    
schnorrsig_verify             ,    55.5       ,    56.1       ,    59.5    
ellswift_encode               ,    31.2       ,    31.2       ,    31.2    
ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
ellswift_keygen               ,    54.2       ,    54.2       ,    54.2    
ellswift_ecdh                 ,    56.2       ,    56.3       ,    56.4

This currently saves ~0.5% in ecdsa_verify, ~4% in ecdsa_sign and ~5% in ec_keygen. The latter is nice, but I was hoping for more in verification. Perhaps with negation "fixed", things can be improved further. At that point, my feeling is that it's hard to say whether this is worth the hassle. We'd also need some CPU id code etc. to make this useful in practice.

Some more comments:

I don't think we should bother with 32-bit x86. The 32-bit code is certainly interesting to some users (hardware wallets, raspberry pi, ... ) but they are almost certainly not on 32-bit x86. (Unless I'm mistaken.)
There's a lot to gain for SHA256, but then we should rely on the SHA instruction set directly. This is probably the lowest hanging fruit for performance in the library, but that's a separate project. Bitcoin Core has a nice SHA256 implementation, with many backends (SHA, SHANI, SSE4, ...), see https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp . The clean way to go is either to give the caller a way to bring their own implementation at runtime, or to extract the C++ code in Bitcoin Core into a separate library (perhaps after converting it to C) that could be linked to libsecp256k1.

Raimo33 · 2025-09-02T09:34:38Z

You're right. Rasberry PI's are 32bit armv7. I agree we shouldn't bother. removed sse2.
I was hoping more on verify as well, that's the main goal.
I think we should copy bitcoin core's sha256 optimizations. but in another PR.

Beware that benchmarks are completely unreliable as of right now. see #1701
I'm waiting on #1732 before proceding

Raimo33 changed the title ~~Add simd~~ Add intel simd Jul 18, 2025

Raimo33 force-pushed the simd branch from 03ddbc2 to 4b21212 Compare August 1, 2025 22:35

Raimo33 changed the title ~~Add intel simd~~ [WIP] Add intel simd Aug 23, 2025

Raimo33 force-pushed the simd branch 11 times, most recently from f1524a9 to ba4646b Compare September 1, 2025 22:34

Raimo33 force-pushed the simd branch 5 times, most recently from b050834 to ac1cb71 Compare September 3, 2025 20:20

Raimo33 closed this Sep 3, 2025

Raimo33 deleted the simd branch September 3, 2025 20:21

Raimo33 restored the simd branch September 3, 2025 20:21

Raimo33 reopened this Sep 3, 2025

Raimo33 force-pushed the simd branch 3 times, most recently from d04b9f4 to 1127dce Compare September 4, 2025 10:12

Raimo33 force-pushed the simd branch 10 times, most recently from 890b234 to d6b3f65 Compare September 5, 2025 13:35

Add generic optimizations

7a05d18

Raimo33 force-pushed the simd branch from d6b3f65 to 5140d5c Compare September 5, 2025 13:40

Raimo33 added 2 commits September 5, 2025 16:19

Add intel simd

cb16f5c

Add dev scripts [skip ci]

056feb1

Raimo33 force-pushed the simd branch from 5140d5c to 056feb1 Compare September 5, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add intel simd #1703

[WIP] Add intel simd #1703

Uh oh!

Raimo33 commented Jul 18, 2025 •

edited

Loading

Uh oh!

Raimo33 commented Jul 18, 2025 •

edited

Loading

Uh oh!

Raimo33 commented Jul 18, 2025 •

edited

Loading

Uh oh!

real-or-random commented Sep 2, 2025

Uh oh!

Raimo33 commented Sep 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] Add intel simd #1703

Are you sure you want to change the base?

[WIP] Add intel simd #1703

Uh oh!

Conversation

Raimo33 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

Commits

Test & Benchmark

Results

Uh oh!

Raimo33 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Raimo33 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

real-or-random commented Sep 2, 2025

Uh oh!

Raimo33 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Raimo33 commented Jul 18, 2025 •

edited

Loading

Raimo33 commented Jul 18, 2025 •

edited

Loading

Raimo33 commented Jul 18, 2025 •

edited

Loading

Raimo33 commented Sep 2, 2025 •

edited

Loading