Avoid calling keccak_absorb with partial lanes #450

bremoran · 2025-09-10T13:36:00Z

On 32-bit architectures, each call to mld_keccakf1600_xor_bytes incurs an overhead. For example, on Arm v7-M and Arm v8-M and using the optimised bit interleave from xkcp xoring a lane into the state incurs an overhead of 37 instructions. Any time an incomplete lane is xored into the state, this penalty is paid twice. This PR ensures that only full lanes are xored into the state.

Fixes #445

rod-chapman · 2025-09-11T08:22:53Z

Please provide a description for this PR. What is the point of this refactoring? What benefit does it bring? Please provide CBMC proof harness and Makefile for any new functions.

mkannwischer · 2025-10-03T09:51:56Z

@bremoran, sorry for the long wait for the review on this. Could you please rebase this on top of the changes in main, so we can benchmark and review it?

Signed-off-by: Brendan Moran <[email protected]>

Signed-off-by: Matthias J. Kannwischer <[email protected]>

mkannwischer · 2025-10-04T00:35:06Z

@bremoran, that was not quite what I meant by rebasing.
I applied the changes required to make this work myself in a8d2d6a.

This gets inlined into the proof of mld_H - no need for a separate contract if the proofs go through. Signed-off-by: Matthias J. Kannwischer <[email protected]>

github-actions

Mac Mini (M1, 2020) benchmarks (opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`47837` cycles	`47836` cycles	`1.00`
`ML-DSA-44 sign`	`156325` cycles	`156334` cycles	`1.00`
`ML-DSA-44 verify`	`52453` cycles	`52450` cycles	`1.00`
`ML-DSA-65 keypair`	`83684` cycles	`83701` cycles	`1.00`
`ML-DSA-65 sign`	`255488` cycles	`255371` cycles	`1.00`
`ML-DSA-65 verify`	`85590` cycles	`85601` cycles	`1.00`
`ML-DSA-87 keypair`	`136128` cycles	`136113` cycles	`1.00`
`ML-DSA-87 sign`	`320962` cycles	`321312` cycles	`1.00`
`ML-DSA-87 verify`	`137899` cycles	`138009` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Mac Mini (M1, 2020) benchmarks (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`115074` cycles	`115039` cycles	`1.00`
`ML-DSA-44 sign`	`430931` cycles	`430787` cycles	`1.00`
`ML-DSA-44 verify`	`122238` cycles	`122176` cycles	`1.00`
`ML-DSA-65 keypair`	`197047` cycles	`196905` cycles	`1.00`
`ML-DSA-65 sign`	`701023` cycles	`701285` cycles	`1.00`
`ML-DSA-65 verify`	`197670` cycles	`197656` cycles	`1.00`
`ML-DSA-87 keypair`	`334759` cycles	`335149` cycles	`1.00`
`ML-DSA-87 sign`	`884276` cycles	`884767` cycles	`1.00`
`ML-DSA-87 verify`	`328610` cycles	`329046` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`281441` cycles	`288008` cycles	`0.98`
`ML-DSA-44 sign`	`971200` cycles	`972295` cycles	`1.00`
`ML-DSA-44 verify`	`301117` cycles	`306786` cycles	`0.98`
`ML-DSA-65 keypair`	`482405` cycles	`492097` cycles	`0.98`
`ML-DSA-65 sign`	`1584980` cycles	`1609911` cycles	`0.98`
`ML-DSA-65 verify`	`487166` cycles	`493789` cycles	`0.99`
`ML-DSA-87 keypair`	`817778` cycles	`830114` cycles	`0.99`
`ML-DSA-87 sign`	`2103778` cycles	`2168352` cycles	`0.97`
`ML-DSA-87 verify`	`823572` cycles	`838050` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`35501` cycles	`35660` cycles	`1.00`
`ML-DSA-44 sign`	`132302` cycles	`132372` cycles	`1.00`
`ML-DSA-44 verify`	`41006` cycles	`40941` cycles	`1.00`
`ML-DSA-65 keypair`	`63922` cycles	`63906` cycles	`1.00`
`ML-DSA-65 sign`	`220917` cycles	`220391` cycles	`1.00`
`ML-DSA-65 verify`	`66232` cycles	`66307` cycles	`1.00`
`ML-DSA-87 keypair`	`95630` cycles	`96815` cycles	`0.99`
`ML-DSA-87 sign`	`259768` cycles	`265102` cycles	`0.98`
`ML-DSA-87 verify`	`99879` cycles	`100242` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`95541` cycles	`95838` cycles	`1.00`
`ML-DSA-44 sign`	`343758` cycles	`345726` cycles	`0.99`
`ML-DSA-44 verify`	`101480` cycles	`101478` cycles	`1.00`
`ML-DSA-65 keypair`	`164662` cycles	`164854` cycles	`1.00`
`ML-DSA-65 sign`	`571713` cycles	`568786` cycles	`1.01`
`ML-DSA-65 verify`	`166031` cycles	`165621` cycles	`1.00`
`ML-DSA-87 keypair`	`271224` cycles	`270260` cycles	`1.00`
`ML-DSA-87 sign`	`725476` cycles	`724985` cycles	`1.00`
`ML-DSA-87 verify`	`273047` cycles	`273226` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`41585` cycles	`45299` cycles	`0.92`
`ML-DSA-44 sign`	`143200` cycles	`154336` cycles	`0.93`
`ML-DSA-44 verify`	`46943` cycles	`49529` cycles	`0.95`
`ML-DSA-65 keypair`	`73940` cycles	`74392` cycles	`0.99`
`ML-DSA-65 sign`	`236322` cycles	`237019` cycles	`1.00`
`ML-DSA-65 verify`	`77313` cycles	`78423` cycles	`0.99`
`ML-DSA-87 keypair`	`111858` cycles	`112104` cycles	`1.00`
`ML-DSA-87 sign`	`279992` cycles	`279301` cycles	`1.00`
`ML-DSA-87 verify`	`117273` cycles	`116800` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`57678` cycles	`57941` cycles	`1.00`
`ML-DSA-44 sign`	`201328` cycles	`201248` cycles	`1.00`
`ML-DSA-44 verify`	`66243` cycles	`65669` cycles	`1.01`
`ML-DSA-65 keypair`	`102316` cycles	`101945` cycles	`1.00`
`ML-DSA-65 sign`	`332994` cycles	`333057` cycles	`1.00`
`ML-DSA-65 verify`	`107021` cycles	`107115` cycles	`1.00`
`ML-DSA-87 keypair`	`157063` cycles	`157562` cycles	`1.00`
`ML-DSA-87 sign`	`399257` cycles	`399500` cycles	`1.00`
`ML-DSA-87 verify`	`162886` cycles	`162176` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`71344` cycles	`71956` cycles	`0.99`
`ML-DSA-44 sign`	`212929` cycles	`214221` cycles	`0.99`
`ML-DSA-44 verify`	`74779` cycles	`75325` cycles	`0.99`
`ML-DSA-65 keypair`	`123608` cycles	`123638` cycles	`1.00`
`ML-DSA-65 sign`	`345402` cycles	`346781` cycles	`1.00`
`ML-DSA-65 verify`	`124084` cycles	`123918` cycles	`1.00`
`ML-DSA-87 keypair`	`206533` cycles	`208833` cycles	`0.99`
`ML-DSA-87 sign`	`447608` cycles	`447509` cycles	`1.00`
`ML-DSA-87 verify`	`205360` cycles	`204748` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`69348` cycles	`69498` cycles	`1.00`
`ML-DSA-44 sign`	`222588` cycles	`222917` cycles	`1.00`
`ML-DSA-44 verify`	`74645` cycles	`74589` cycles	`1.00`
`ML-DSA-65 keypair`	`123409` cycles	`123347` cycles	`1.00`
`ML-DSA-65 sign`	`365960` cycles	`366381` cycles	`1.00`
`ML-DSA-65 verify`	`123609` cycles	`123483` cycles	`1.00`
`ML-DSA-87 keypair`	`201689` cycles	`200598` cycles	`1.01`
`ML-DSA-87 sign`	`467807` cycles	`466978` cycles	`1.00`
`ML-DSA-87 verify`	`201993` cycles	`201918` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`120674` cycles	`120817` cycles	`1.00`
`ML-DSA-44 sign`	`452232` cycles	`453984` cycles	`1.00`
`ML-DSA-44 verify`	`131541` cycles	`131897` cycles	`1.00`
`ML-DSA-65 keypair`	`204081` cycles	`205210` cycles	`0.99`
`ML-DSA-65 sign`	`739495` cycles	`738619` cycles	`1.00`
`ML-DSA-65 verify`	`209598` cycles	`210495` cycles	`1.00`
`ML-DSA-87 keypair`	`339929` cycles	`343513` cycles	`0.99`
`ML-DSA-87 sign`	`942376` cycles	`952408` cycles	`0.99`
`ML-DSA-87 verify`	`350063` cycles	`353724` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`135463` cycles	`136469` cycles	`0.99`
`ML-DSA-44 sign`	`542488` cycles	`545358` cycles	`0.99`
`ML-DSA-44 verify`	`148719` cycles	`149472` cycles	`0.99`
`ML-DSA-65 keypair`	`227337` cycles	`229684` cycles	`0.99`
`ML-DSA-65 sign`	`880524` cycles	`888847` cycles	`0.99`
`ML-DSA-65 verify`	`236252` cycles	`237595` cycles	`0.99`
`ML-DSA-87 keypair`	`375243` cycles	`375230` cycles	`1.00`
`ML-DSA-87 sign`	`1102759` cycles	`1101253` cycles	`1.00`
`ML-DSA-87 verify`	`387967` cycles	`389206` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`157565` cycles	`158047` cycles	`1.00`
`ML-DSA-44 sign`	`563855` cycles	`566208` cycles	`1.00`
`ML-DSA-44 verify`	`169337` cycles	`169650` cycles	`1.00`
`ML-DSA-65 keypair`	`270050` cycles	`269850` cycles	`1.00`
`ML-DSA-65 sign`	`928714` cycles	`928430` cycles	`1.00`
`ML-DSA-65 verify`	`275259` cycles	`275016` cycles	`1.00`
`ML-DSA-87 keypair`	`450252` cycles	`450841` cycles	`1.00`
`ML-DSA-87 sign`	`1180577` cycles	`1179105` cycles	`1.00`
`ML-DSA-87 verify`	`460070` cycles	`459184` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`73954` cycles	`73980` cycles	`1.00`
`ML-DSA-44 sign`	`236004` cycles	`236034` cycles	`1.00`
`ML-DSA-44 verify`	`80304` cycles	`79930` cycles	`1.00`
`ML-DSA-65 keypair`	`129494` cycles	`129578` cycles	`1.00`
`ML-DSA-65 sign`	`388474` cycles	`388294` cycles	`1.00`
`ML-DSA-65 verify`	`131006` cycles	`130908` cycles	`1.00`
`ML-DSA-87 keypair`	`210035` cycles	`210041` cycles	`1.00`
`ML-DSA-87 sign`	`491914` cycles	`492267` cycles	`1.00`
`ML-DSA-87 verify`	`212663` cycles	`212589` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`462159` cycles	`466373` cycles	`0.99`
`ML-DSA-44 sign`	`2216904` cycles	`2214442` cycles	`1.00`
`ML-DSA-44 verify`	`547750` cycles	`550635` cycles	`0.99`
`ML-DSA-65 keypair`	`778716` cycles	`777523` cycles	`1.00`
`ML-DSA-65 sign`	`3628400` cycles	`3643249` cycles	`1.00`
`ML-DSA-65 verify`	`853665` cycles	`849541` cycles	`1.00`
`ML-DSA-87 keypair`	`1250941` cycles	`1269297` cycles	`0.99`
`ML-DSA-87 sign`	`4442690` cycles	`4513601` cycles	`0.98`
`ML-DSA-87 verify`	`1364598` cycles	`1373707` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`115550` cycles	`115640` cycles	`1.00`
`ML-DSA-44 sign`	`392162` cycles	`392538` cycles	`1.00`
`ML-DSA-44 verify`	`123972` cycles	`123749` cycles	`1.00`
`ML-DSA-65 keypair`	`200210` cycles	`200190` cycles	`1.00`
`ML-DSA-65 sign`	`648965` cycles	`648572` cycles	`1.00`
`ML-DSA-65 verify`	`203087` cycles	`202921` cycles	`1.00`
`ML-DSA-87 keypair`	`328316` cycles	`327699` cycles	`1.00`
`ML-DSA-87 sign`	`822365` cycles	`820887` cycles	`1.00`
`ML-DSA-87 verify`	`332366` cycles	`331384` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4 (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`132701` cycles	`132744` cycles	`1.00`
`ML-DSA-44 sign`	`498674` cycles	`498324` cycles	`1.00`
`ML-DSA-44 verify`	`145009` cycles	`144951` cycles	`1.00`
`ML-DSA-65 keypair`	`226922` cycles	`227315` cycles	`1.00`
`ML-DSA-65 sign`	`814244` cycles	`813246` cycles	`1.00`
`ML-DSA-65 verify`	`231594` cycles	`231619` cycles	`1.00`
`ML-DSA-87 keypair`	`374429` cycles	`374603` cycles	`1.00`
`ML-DSA-87 sign`	`1021798` cycles	`1021441` cycles	`1.00`
`ML-DSA-87 verify`	`384208` cycles	`383659` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3 (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`138585` cycles	`138628` cycles	`1.00`
`ML-DSA-44 sign`	`495158` cycles	`495579` cycles	`1.00`
`ML-DSA-44 verify`	`148937` cycles	`148792` cycles	`1.00`
`ML-DSA-65 keypair`	`241460` cycles	`241330` cycles	`1.00`
`ML-DSA-65 sign`	`810228` cycles	`809886` cycles	`1.00`
`ML-DSA-65 verify`	`241222` cycles	`240937` cycles	`1.00`
`ML-DSA-87 keypair`	`396305` cycles	`396441` cycles	`1.00`
`ML-DSA-87 sign`	`1031970` cycles	`1031506` cycles	`1.00`
`ML-DSA-87 verify`	`402475` cycles	`402272` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2 (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`213442` cycles	`213493` cycles	`1.00`
`ML-DSA-44 sign`	`781132` cycles	`794089` cycles	`0.98`
`ML-DSA-44 verify`	`230277` cycles	`230005` cycles	`1.00`
`ML-DSA-65 keypair`	`380712` cycles	`381674` cycles	`1.00`
`ML-DSA-65 sign`	`1287339` cycles	`1285921` cycles	`1.00`
`ML-DSA-65 verify`	`373222` cycles	`373670` cycles	`1.00`
`ML-DSA-87 keypair`	`609594` cycles	`609555` cycles	`1.00`
`ML-DSA-87 sign`	`1644483` cycles	`1645486` cycles	`1.00`
`ML-DSA-87 verify`	`621636` cycles	`621588` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`115380` cycles	`115390` cycles	`1.00`
`ML-DSA-44 sign`	`392034` cycles	`392115` cycles	`1.00`
`ML-DSA-44 verify`	`123904` cycles	`123546` cycles	`1.00`
`ML-DSA-65 keypair`	`200071` cycles	`199986` cycles	`1.00`
`ML-DSA-65 sign`	`648490` cycles	`647905` cycles	`1.00`
`ML-DSA-65 verify`	`203071` cycles	`202802` cycles	`1.00`
`ML-DSA-87 keypair`	`327348` cycles	`327077` cycles	`1.00`
`ML-DSA-87 sign`	`819919` cycles	`819688` cycles	`1.00`
`ML-DSA-87 verify`	`331865` cycles	`331074` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`822021` cycles	`823286` cycles	`1.00`
`ML-DSA-44 sign`	`3332036` cycles	`3327209` cycles	`1.00`
`ML-DSA-44 verify`	`920516` cycles	`918657` cycles	`1.00`
`ML-DSA-65 keypair`	`1395987` cycles	`1400241` cycles	`1.00`
`ML-DSA-65 sign`	`5415850` cycles	`5443356` cycles	`0.99`
`ML-DSA-65 verify`	`1464876` cycles	`1464467` cycles	`1.00`
`ML-DSA-87 keypair`	`2296738` cycles	`2298732` cycles	`1.00`
`ML-DSA-87 sign`	`6800722` cycles	`6822286` cycles	`1.00`
`ML-DSA-87 verify`	`2402751` cycles	`2403402` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`213144` cycles	`213012` cycles	`1.00`
`ML-DSA-44 sign`	`780665` cycles	`781249` cycles	`1.00`
`ML-DSA-44 verify`	`230117` cycles	`230192` cycles	`1.00`
`ML-DSA-65 keypair`	`380413` cycles	`380850` cycles	`1.00`
`ML-DSA-65 sign`	`1304248` cycles	`1291535` cycles	`1.01`
`ML-DSA-65 verify`	`372936` cycles	`372768` cycles	`1.00`
`ML-DSA-87 keypair`	`609458` cycles	`609112` cycles	`1.00`
`ML-DSA-87 sign`	`1641897` cycles	`1642387` cycles	`1.00`
`ML-DSA-87 verify`	`621885` cycles	`621381` cycles	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`239303` cycles	`231907` cycles	`1.03`
`ML-DSA-44 sign`	`701852` cycles	`692048` cycles	`1.01`
`ML-DSA-44 verify`	`238239` cycles	`234215` cycles	`1.02`
`ML-DSA-65 keypair`	`395898` cycles	`397168` cycles	`1.00`
`ML-DSA-65 sign`	`1112619` cycles	`1103780` cycles	`1.01`
`ML-DSA-65 verify`	`392007` cycles	`380128` cycles	`1.03`
`ML-DSA-87 keypair`	`662188` cycles	`660299` cycles	`1.00`
`ML-DSA-87 sign`	`1484409` cycles	`1454152` cycles	`1.02`
`ML-DSA-87 verify`	`645366` cycles	`650049` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`239303` cycles	`231907` cycles	`1.03`
`ML-DSA-65 verify`	`392007` cycles	`380128` cycles	`1.03`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 keypair`	`315776` cycles	`311426` cycles	`1.01`
`ML-DSA-44 sign`	`1230641` cycles	`1214729` cycles	`1.01`
`ML-DSA-44 verify`	`353493` cycles	`338228` cycles	`1.05`
`ML-DSA-65 keypair`	`562601` cycles	`572363` cycles	`0.98`
`ML-DSA-65 sign`	`2009516` cycles	`1992144` cycles	`1.01`
`ML-DSA-65 verify`	`541825` cycles	`547811` cycles	`0.99`
`ML-DSA-87 keypair`	`884415` cycles	`884798` cycles	`1.00`
`ML-DSA-87 sign`	`2488138` cycles	`2501693` cycles	`0.99`
`ML-DSA-87 verify`	`912836` cycles	`901676` cycles	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite	Current: `f82f729`	Previous: `efe03f9`	Ratio
`ML-DSA-44 verify`	`353493` cycles	`338228` cycles	`1.05`

This comment was automatically generated by workflow using github-action-benchmark.

mkannwischer · 2025-10-04T01:00:46Z

Performance-wise, there is no reason to not merge this. There is even a small improvement on Cortex-A55 of 1-3% and (for reasons that are beyond me) on 4th gen AMD EPYC (c7a).

CBMC proofs are failing, but we can fix that at a later point.

Fundamentally, I believe such caching does not belong in sign.c, but should be done in fips202.c. One could make the incomplete lane part of the Keccak state which would make it a little bit cleaner, but it would still clutter the code somewhat.

WDYT @hanno-becker?

rod-chapman · 2025-10-04T02:13:21Z

I see one proof failure in mld_H. Let me take a look...

Signed-off-by: Rod Chapman <[email protected]>

hanno-becker

Thanks @bremoran! I can definitely see this being useful for 32-bit platforms.

A few requests:

I don't think this needs an API extension: Instead, the buffering of state prior to XOR'ing should be an implementation detail (add a buffer for the incomplete lane) of the existing absorb API.
We should have documentation and CBMC proofs for new functionality.
The new logic belongs to FIPS-202.

Could you adjust the PR accordingly?

mkannwischer · 2025-10-04T04:09:14Z

Thanks @bremoran! I can definitely see this being useful for 32-bit platforms.

A few requests:

I don't think this needs an API extension: Instead, the buffering of state prior to XOR'ing should be an implementation detail (add a buffer for the incomplete lane) of the existing absorb API.

We should have documentation and CBMC proofs for new functionality.

The new logic belongs to FIPS-202.

Could you adjust the PR accordingly?

I agree. Marking this as draft for now. @bremoran, please mark it as ready when you have updated the PR.
Let us know if you need help with adjusting the CBMC proofs.

bremoran requested a review from a team as a code owner September 10, 2025 13:36

bremoran force-pushed the f/refactor-fips202 branch from aa57a15 to 2cd2d61 Compare October 3, 2025 10:34

bremoran and others added 3 commits October 4, 2025 08:26

Avoid calling keccak_absorb with partial lanes

0dd40db

Signed-off-by: Brendan Moran <[email protected]>

Fix pos reset, fix linting errors

d868cc1

Signed-off-by: Brendan Moran <[email protected]>

Fix after FIPS202 function renaming

a8d2d6a

Signed-off-by: Matthias J. Kannwischer <[email protected]>

mkannwischer force-pushed the f/refactor-fips202 branch from 2cd2d61 to a8d2d6a Compare October 4, 2025 00:33

Remove contract of mld_shake256_absorb_with_residual

f82f729

This gets inlined into the proof of mld_H - no need for a separate contract if the proofs go through. Signed-off-by: Matthias J. Kannwischer <[email protected]>

mkannwischer added the benchmark label Oct 4, 2025

github-actions bot reviewed Oct 4, 2025

View reviewed changes

oqs-bot reviewed Oct 4, 2025

View reviewed changes

github-actions bot reviewed Oct 4, 2025

View reviewed changes

oqs-bot reviewed Oct 4, 2025

View reviewed changes

github-actions bot reviewed Oct 4, 2025

View reviewed changes

Increase CBMC_OBJECT_BITS for this proof to complete.

8184453

Signed-off-by: Rod Chapman <[email protected]>

hanno-becker requested changes Oct 4, 2025

View reviewed changes

mkannwischer marked this pull request as draft October 4, 2025 04:09

Avoid calling keccak_absorb with partial lanes #450

Are you sure you want to change the base?

Avoid calling keccak_absorb with partial lanes #450

Uh oh!

Conversation

bremoran commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rod-chapman commented Sep 11, 2025

Uh oh!

mkannwischer commented Oct 3, 2025

Uh oh!

mkannwischer commented Oct 4, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Mac Mini (M1, 2020) benchmarks (opt)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Mac Mini (M1, 2020) benchmarks (no-opt)

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i) (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton4

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a) (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a) (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i) (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton3

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton2

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

Graviton4 (no-opt)

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

bremoran commented Sep 10, 2025 •

edited

Loading

mkannwischer commented Oct 4, 2025 •

edited

Loading

mkannwischer commented Oct 4, 2025 •

edited

Loading