Skip to content

Conversation

AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Sep 16, 2025

Rationale for this change

Refactor the unpack function code generator and dispatcher to accommodate more use cases:

  • uint16_t and uint64_t new sizes
  • A dispatch once function returning a function pointer to the correct bit width

Update 2025-10-13: the dispatch once and uint16_t implementation were removed as they turned out to be slower.

What changes are included in this PR?

The diff are hard to look at but the important files to look at are:

  • The two python files for code generation accommodate new sizes (with the exception of uint16 on Avx512 for which the algorithms assumptions break);
  • The two code generators have a uniform structure for the "batch unpackers" they generate: each one of them is a specialization of a struct template unpack29_32 > Unpacker<uint32_t, 29>::unpack
    • Using specialization instead of hard coded number in function names makes it possible to use them in more generic code
    • Wrapping the functions in a struct makes it possible to carry information along the function (such as the number of values that said function unpacks) and to leave the door open for partial specialization for future improvements.
  • The public functions in bpacking_internal.h are also template specialization unpack32 -> unpack<uint32_t>
  • The large switch statements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with a constexpr generated jump table. The logic is in bpacking_dispatch_internal.h

From a performance perspective:

  • there is no improvement to the individual "batch unpacker" generated
    • The SIMD code is actually doing worst that scalar on SSE4.2 OR uint16_t
  • there are new sizes that can bring improvements
    • unpack<uint64_t> has an SIMD implementation that should benefit DeltaBitPackDecoder
    • Novel unpack<uint16_t> should benefit the level decoders
    • The dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but see get_unpack_fn for the building block).

Update 2025-10-13:

  • Added an unpack implementation for uint8_t and uint16_t that call the uint32_t version on a local buffer combined with a static_cast loop (what was done on the call site before).
  • The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use.

Are these changes tested?

Very much.

Are there any user-facing changes?

Copy link

⚠️ GitHub issue #47572 has been automatically assigned in GitHub to PR creator.

@AntoinePrv
Copy link
Contributor Author

The results for this are not good

@AntoinePrv
Copy link
Contributor Author

  • The dispatch once and uint16_t implementation were removed as they turned out to be slower.
  • Added an unpack implementation for uint8_t and uint16_t that call the uint32_t version on a local buffer combined with a static_cast loop (what was done on the call site before).
  • The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use.

@AntoinePrv AntoinePrv changed the title GH-47572: [C++][Parquet] Generalize bpacking and possible speedup GH-47572: [C++][Parquet] Uniform unpack interface Oct 13, 2025
@AntoinePrv AntoinePrv marked this pull request as ready for review October 13, 2025 16:47
@pitrou
Copy link
Member

pitrou commented Oct 14, 2025

I'm testing performance of this PR locally on a AMD Zen 2 CPU on Ubuntu, and I'm seeing large regressions on unpack32 performance.

  1. scalar unpack32 regresses on all input sizes:
  • before:
BM_UnpackUint32/ScalarUnaligned/1/64            23.3 ns         23.3 ns     30099581 items_per_second=2.74685G/s
BM_UnpackUint32/ScalarUnaligned/2/64            15.1 ns         15.1 ns     45528078 items_per_second=4.22707G/s
BM_UnpackUint32/ScalarUnaligned/8/64            16.2 ns         16.1 ns     43418304 items_per_second=3.96297G/s
BM_UnpackUint32/ScalarUnaligned/20/64           22.4 ns         22.4 ns     31217249 items_per_second=2.85244G/s

BM_UnpackUint32/ScalarUnaligned/1/1024           310 ns          310 ns      2251073 items_per_second=3.29834G/s
BM_UnpackUint32/ScalarUnaligned/2/1024           213 ns          213 ns      3288295 items_per_second=4.80898G/s
BM_UnpackUint32/ScalarUnaligned/8/1024           357 ns          357 ns      1957044 items_per_second=2.86556G/s
BM_UnpackUint32/ScalarUnaligned/20/1024          339 ns          339 ns      2066262 items_per_second=3.01942G/s

BM_UnpackUint32/ScalarUnaligned/1/32768         9829 ns         9828 ns        71098 items_per_second=3.33403G/s
BM_UnpackUint32/ScalarUnaligned/2/32768         6543 ns         6543 ns       107368 items_per_second=5.00848G/s
BM_UnpackUint32/ScalarUnaligned/8/32768        11410 ns        11410 ns        60956 items_per_second=2.87199G/s
BM_UnpackUint32/ScalarUnaligned/20/32768       10737 ns        10736 ns        65168 items_per_second=3.05219G/s
  • after:
BM_UnpackUint32/ScalarUnaligned/1/64            26.3 ns         26.3 ns     26597461 items_per_second=2.43124G/s
BM_UnpackUint32/ScalarUnaligned/2/64            25.9 ns         25.9 ns     27099769 items_per_second=2.47115G/s
BM_UnpackUint32/ScalarUnaligned/8/64            23.0 ns         23.0 ns     30377366 items_per_second=2.78236G/s
BM_UnpackUint32/ScalarUnaligned/20/64           41.1 ns         41.1 ns     17029900 items_per_second=1.55726G/s

BM_UnpackUint32/ScalarUnaligned/1/1024           368 ns          368 ns      1903394 items_per_second=2.78384G/s
BM_UnpackUint32/ScalarUnaligned/2/1024           342 ns          342 ns      2047462 items_per_second=2.99373G/s
BM_UnpackUint32/ScalarUnaligned/8/1024           358 ns          358 ns      1956688 items_per_second=2.86192G/s
BM_UnpackUint32/ScalarUnaligned/20/1024          538 ns          538 ns      1306908 items_per_second=1.90488G/s

BM_UnpackUint32/ScalarUnaligned/1/32768        11647 ns        11646 ns        59931 items_per_second=2.81379G/s
BM_UnpackUint32/ScalarUnaligned/2/32768        10844 ns        10843 ns        64537 items_per_second=3.02197G/s
BM_UnpackUint32/ScalarUnaligned/8/32768        11303 ns        11301 ns        61943 items_per_second=2.89948G/s
BM_UnpackUint32/ScalarUnaligned/20/32768       16915 ns        16915 ns        41375 items_per_second=1.93724G/s
  1. SIMD unpack32 regresses, but only a small input sizes:
  • before:
BM_UnpackUint32/Avx2Unaligned/1/64              6.10 ns         6.10 ns    115307711 items_per_second=10.4919G/s
BM_UnpackUint32/Avx2Unaligned/2/64              5.80 ns         5.80 ns    119616924 items_per_second=11.0319G/s
BM_UnpackUint32/Avx2Unaligned/8/64              7.84 ns         7.83 ns     89310891 items_per_second=8.16917G/s
BM_UnpackUint32/Avx2Unaligned/20/64             30.1 ns         30.1 ns     23168069 items_per_second=2.12844G/s

BM_UnpackUint32/Avx2Unaligned/1/1024            61.3 ns         61.3 ns     11549890 items_per_second=16.7084G/s
BM_UnpackUint32/Avx2Unaligned/2/1024            60.4 ns         60.4 ns     11462262 items_per_second=16.9551G/s
BM_UnpackUint32/Avx2Unaligned/8/1024             118 ns          118 ns      5936723 items_per_second=8.68313G/s
BM_UnpackUint32/Avx2Unaligned/20/1024            426 ns          426 ns      1628900 items_per_second=2.40438G/s

BM_UnpackUint32/Avx2Unaligned/1/32768           1916 ns         1916 ns       364744 items_per_second=17.1002G/s
BM_UnpackUint32/Avx2Unaligned/2/32768           2071 ns         2071 ns       340164 items_per_second=15.8259G/s
BM_UnpackUint32/Avx2Unaligned/8/32768           3814 ns         3814 ns       183339 items_per_second=8.59169G/s
BM_UnpackUint32/Avx2Unaligned/20/32768         13674 ns        13672 ns        51167 items_per_second=2.39671G/s
  • after:
BM_UnpackUint32/Avx2Unaligned/1/64              11.7 ns         11.7 ns     60043243 items_per_second=5.47447G/s
BM_UnpackUint32/Avx2Unaligned/2/64              11.7 ns         11.7 ns     59912039 items_per_second=5.47921G/s
BM_UnpackUint32/Avx2Unaligned/8/64              13.4 ns         13.3 ns     52435550 items_per_second=4.79412G/s
BM_UnpackUint32/Avx2Unaligned/20/64             37.6 ns         37.6 ns     18551367 items_per_second=1.69999G/s

BM_UnpackUint32/Avx2Unaligned/1/1024            68.2 ns         68.2 ns     10297857 items_per_second=15.0158G/s
BM_UnpackUint32/Avx2Unaligned/2/1024            68.2 ns         68.2 ns     10265410 items_per_second=15.0135G/s
BM_UnpackUint32/Avx2Unaligned/8/1024             122 ns          122 ns      5723819 items_per_second=8.37532G/s
BM_UnpackUint32/Avx2Unaligned/20/1024            460 ns          460 ns      1521877 items_per_second=2.2277G/s

BM_UnpackUint32/Avx2Unaligned/1/32768           1958 ns         1957 ns       357945 items_per_second=16.7407G/s
BM_UnpackUint32/Avx2Unaligned/2/32768           1939 ns         1939 ns       361274 items_per_second=16.9008G/s
BM_UnpackUint32/Avx2Unaligned/8/32768           3831 ns         3830 ns       182683 items_per_second=8.55518G/s
BM_UnpackUint32/Avx2Unaligned/20/32768         14430 ns        14429 ns        48513 items_per_second=2.27106G/s

@pitrou
Copy link
Member

pitrou commented Oct 14, 2025

A quick experiment shows that much of the original performance can be re-gained by replacing the array of function pointers with a switch statement in unpack_jump. It's less pretty for sure.

@AntoinePrv
Copy link
Contributor Author

@pitrou what about the Read/Decoder benchmarks?

The algorithms are mostly similar, except:

  • The way pointers are incremented: using out[0], out[1]... instead of out++ =
  • The function names and how they are grouped in files.
  • The jump table (that should have a bigger relatitve impact for SIMD since they are faster).

I'm gonna investigate but I suspect these benchmarks are highly unstable.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first round of comments (I haven't looked at all changes).

int unpack_width(const uint8_t* in, UnpackedUInt* out, int batch_size) {
using UnpackerForWidth = Unpacker<UnpackedUInt, kPackedBitWidth>;

constexpr auto kValuesUnpacked = UnpackerForWidth::kValuesUnpacked;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ARROW_DCHECK_EQ(batch_size % kValuesUnpacked, 0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the currently implicit contract of these functions is unpack up to batch_size but possibly less.
At least that's how they are called. Best would be to have the epilog in these functions so that they extract exactly batch_size values.

*test* \
*_generated.h \
*-benchmark.cc \
*_codegen.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised, does Doxygen actually look at *.py files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too but it was giving me trouble with it

util/bitmap_ops.cc
util/bpacking.cc
util/bpacking_scalar.cc
util/bpacking_simd_min.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "min" for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was either neon or sse4.2 (the latter was removed). I could rename it _neon but it does not highlight that it is not compiled any different than regular files.

@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting review Awaiting review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Oct 14, 2025
ARROW_EXPORT int unpack_scalar(const uint8_t* in, Uint* out, int batch_size,
int num_bits);

extern template ARROW_TEMPLATE_EXPORT int unpack_scalar<uint8_t>(const uint8_t*, uint8_t*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ARROW_TEMPLATE_EXPORT is useful, because this probably wouldn't be called across DLL boundaries (or would it?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only the benchmarks and tests need it. Any solution how we might change that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Well, we can keep it like this for now if that doesn't degrade performance...

@pitrou
Copy link
Member

pitrou commented Oct 14, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit dd1d942. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member

pitrou commented Oct 14, 2025

@pitrou what about the Read/Decoder benchmarks?

Indeed the picture is different when looking at the Read/Decode benchmarks, where there isn't much of a difference between the array of function pointers and the switch/case statement.

Here is the diff between this PR and git main here: https://gist.github.com/pitrou/39f2c7ea2faec482c4d74a26e1c98f06

@AntoinePrv
Copy link
Contributor Author

I've revert to a switch statement and properly handled bool.

@pitrou pitrou added the CI: Extra: C++ Run extra C++ CI label Oct 16, 2025
@pitrou
Copy link
Member

pitrou commented Oct 16, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: 38aa052

Submitted crossbow builds: ursacomputing/crossbow @ actions-beeb73de91

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failures are unrelated, I'll merge.

@pitrou pitrou merged commit 7a38744 into apache:main Oct 16, 2025
54 of 55 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Oct 16, 2025
@AntoinePrv AntoinePrv deleted the bit-packing branch October 16, 2025 14:52
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 7a38744.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants