Skip to content

Conversation

Raimo33
Copy link

@Raimo33 Raimo33 commented Jul 18, 2025

This adds avx and avx2 intrinsics support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.

Why not sse and avx512?

  • sse is only useful in the 32bit code path (USE_FORCE_WIDEMUL_INT64=1). but practically, almost all 32bit SSE enabled CPUs are armv7 architectures, not X86
  • avx512 would not be beneficial anywhere, and thermal throttling is a problem on most CPUs that support it.

arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged...

Tasks

  • Add CI/CD flows with permutations of -mavx, -mavx2, -mno-avx, -mno-avx2 when building for amd64
  • Precompute vectors at startup (the ones marked with TODO: precompute )

Commits

I've split this PR into multiple commits with the following criteria:

  • non-simd optimizations & style changes
  • simd optimizations
  • temporary scripts for development
  • CI flows

Test & Benchmark

To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results. You can verify yourself by running: ./simd-build.sh && ./simd-test.sh && ./simd-bench.sh and executing the notebook as is.

The baseline is compiled with "-O3 -mavx -mavx2 -U__AVX__ -U__AVX2__" so that spontaneous gcc vectorization is allowed, but my manual vectorization is not compiled.

Results

@Raimo33 Raimo33 changed the title Add simd Add intel simd Jul 18, 2025
@Raimo33
Copy link
Author

Raimo33 commented Jul 18, 2025

To precompute simd constants at the start, the best solution I found was doing something like this:

#ifdef __SSE2__
  static __m128i _128_vec_ones;
#endif

CONSTRUCTOR void simd_init(void)
{
#ifdef __SSE2__
  _128_vec_ones   = _mm_set1_epi8('1');
#endif
}

where CONSTRUCTOR is __attribute__((constructor))

@Raimo33
Copy link
Author

Raimo33 commented Jul 18, 2025

I'm constantly getting these warnings. Apparently they're harmless since I always use loadu and storeu, but for some reason the compiler doesn't like them.

warning: cast increases required alignment of target type [-Wcast-align]
  653 |         _mm256_storeu_si256((__m256i *)r->v, out);

The only fixes I found are:

  1. aligning everything to 64bytes (impossible, breaks even some of my avx logic)
  2. suppress the warning globally
  3. suppress the warning inline each time

@Raimo33 Raimo33 changed the title Add intel simd [WIP] Add intel simd Aug 23, 2025
@Raimo33 Raimo33 force-pushed the simd branch 11 times, most recently from f1524a9 to ba4646b Compare September 1, 2025 22:34
@real-or-random
Copy link
Contributor

I built this and ran the benchmarks on my machine (12th Gen Intel(R) Core(TM) i7-1260P)

Signing became faster, but verification became slower. After looking at the bench_internal results, I figured that the culprit is secp256k1_fe_impl_negate_unchecked, which makes the field_sqrt benchmark slower and hence verification. When I disable your ifdefs in secp256k1_fe_impl_negate_unchecked, I get consistently faster results across benchmarks (except schnorrsig_verify, not sure what's going on there).

Benchmark results for INT128
Generated on 2025-09-02T09:56:29 CEST
Iterations: 20000

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

ecdsa_verify                  ,    54.8       ,    55.2       ,    56.7    
ecdsa_sign                    ,    34.8       ,    34.9       ,    35.0    
ec_keygen                     ,    24.2       ,    24.2       ,    24.2    
ecdh                          ,    51.0       ,    51.4       ,    53.6    
schnorrsig_sign               ,    25.6       ,    25.6       ,    25.7    
schnorrsig_verify             ,    55.5       ,    56.0       ,    57.5    
ellswift_encode               ,    31.1       ,    31.1       ,    31.2    
ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
ellswift_keygen               ,    55.0       ,    55.0       ,    55.0    
ellswift_ecdh                 ,    56.5       ,    56.8       ,    58.7    


Benchmark results for INT128_SSE2_AVX2
Generated on 2025-09-02T10:00:33 CEST
Iterations: 20000

Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    

ecdsa_verify                  ,    54.6       ,    54.9       ,    56.4    
ecdsa_sign                    ,    33.5       ,    33.6       ,    33.6    
ec_keygen                     ,    23.0       ,    23.0       ,    23.1    
ecdh                          ,    50.6       ,    50.6       ,    50.7    
schnorrsig_sign               ,    24.5       ,    24.5       ,    24.5    
schnorrsig_verify             ,    55.5       ,    56.1       ,    59.5    
ellswift_encode               ,    31.2       ,    31.2       ,    31.2    
ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
ellswift_keygen               ,    54.2       ,    54.2       ,    54.2    
ellswift_ecdh                 ,    56.2       ,    56.3       ,    56.4    

This currently saves ~0.5% in ecdsa_verify, ~4% in ecdsa_sign and ~5% in ec_keygen. The latter is nice, but I was hoping for more in verification. Perhaps with negation "fixed", things can be improved further. At that point, my feeling is that it's hard to say whether this is worth the hassle. We'd also need some CPU id code etc. to make this useful in practice.

Some more comments:

  • I don't think we should bother with 32-bit x86. The 32-bit code is certainly interesting to some users (hardware wallets, raspberry pi, ... ) but they are almost certainly not on 32-bit x86. (Unless I'm mistaken.)
  • There's a lot to gain for SHA256, but then we should rely on the SHA instruction set directly. This is probably the lowest hanging fruit for performance in the library, but that's a separate project. Bitcoin Core has a nice SHA256 implementation, with many backends (SHA, SHANI, SSE4, ...), see https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp . The clean way to go is either to give the caller a way to bring their own implementation at runtime, or to extract the C++ code in Bitcoin Core into a separate library (perhaps after converting it to C) that could be linked to libsecp256k1.

@Raimo33
Copy link
Author

Raimo33 commented Sep 2, 2025

  • You're right. Rasberry PI's are 32bit armv7. I agree we shouldn't bother. removed sse2.
  • I was hoping more on verify as well, that's the main goal.
  • I think we should copy bitcoin core's sha256 optimizations. but in another PR.

Beware that benchmarks are completely unreliable as of right now. see #1701
I'm waiting on #1732 before proceding

@Raimo33 Raimo33 force-pushed the simd branch 5 times, most recently from b050834 to ac1cb71 Compare September 3, 2025 20:20
@Raimo33 Raimo33 closed this Sep 3, 2025
@Raimo33 Raimo33 deleted the simd branch September 3, 2025 20:21
@Raimo33 Raimo33 restored the simd branch September 3, 2025 20:21
@Raimo33 Raimo33 reopened this Sep 3, 2025
@Raimo33 Raimo33 force-pushed the simd branch 3 times, most recently from d04b9f4 to 1127dce Compare September 4, 2025 10:12
@Raimo33 Raimo33 force-pushed the simd branch 10 times, most recently from 890b234 to d6b3f65 Compare September 5, 2025 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants