-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47572: [C++][Parquet] Uniform unpack interface #47573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
de89288
to
60f9155
Compare
fd5ff83
to
d96f782
Compare
4512ec3
to
d57a53b
Compare
The results for this are not good |
abf7629
to
bbfc25f
Compare
|
8f67f76
to
dd1d942
Compare
I'm testing performance of this PR locally on a AMD Zen 2 CPU on Ubuntu, and I'm seeing large regressions on unpack32 performance.
|
A quick experiment shows that much of the original performance can be re-gained by replacing the array of function pointers with a switch statement in |
@pitrou what about the Read/Decoder benchmarks? The algorithms are mostly similar, except:
I'm gonna investigate but I suspect these benchmarks are highly unstable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a first round of comments (I haven't looked at all changes).
int unpack_width(const uint8_t* in, UnpackedUInt* out, int batch_size) { | ||
using UnpackerForWidth = Unpacker<UnpackedUInt, kPackedBitWidth>; | ||
|
||
constexpr auto kValuesUnpacked = UnpackerForWidth::kValuesUnpacked; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ARROW_DCHECK_EQ(batch_size % kValuesUnpacked, 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the currently implicit contract of these functions is unpack up to batch_size
but possibly less.
At least that's how they are called. Best would be to have the epilog in these functions so that they extract exactly batch_size
values.
*test* \ | ||
*_generated.h \ | ||
*-benchmark.cc \ | ||
*_codegen.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised, does Doxygen actually look at *.py
files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too but it was giving me trouble with it
cpp/src/arrow/CMakeLists.txt
Outdated
util/bitmap_ops.cc | ||
util/bpacking.cc | ||
util/bpacking_scalar.cc | ||
util/bpacking_simd_min.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "min" for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was either neon or sse4.2 (the latter was removed). I could rename it _neon
but it does not highlight that it is not compiled any different than regular files.
ARROW_EXPORT int unpack_scalar(const uint8_t* in, Uint* out, int batch_size, | ||
int num_bits); | ||
|
||
extern template ARROW_TEMPLATE_EXPORT int unpack_scalar<uint8_t>(const uint8_t*, uint8_t*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ARROW_TEMPLATE_EXPORT
is useful, because this probably wouldn't be called across DLL boundaries (or would it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only the benchmarks and tests need it. Any solution how we might change that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Well, we can keep it like this for now if that doesn't degrade performance...
@ursabot please benchmark lang=C++ |
Benchmark runs are scheduled for commit dd1d942. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Indeed the picture is different when looking at the Read/Decode benchmarks, where there isn't much of a difference between the array of function pointers and the switch/case statement. Here is the diff between this PR and git main here: https://gist.github.com/pitrou/39f2c7ea2faec482c4d74a26e1c98f06 |
I've revert to a |
@github-actions crossbow submit -g cpp |
Revision: 38aa052 Submitted crossbow builds: ursacomputing/crossbow @ actions-beeb73de91 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failures are unrelated, I'll merge.
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 7a38744. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Refactor the unpack function code generator and dispatcher to accommodate more use cases:
anduint16_t
uint64_t
new sizesA dispatch once function returning a function pointer to the correct bit widthUpdate 2025-10-13: the dispatch once and
uint16_t
implementation were removed as they turned out to be slower.What changes are included in this PR?
The diff are hard to look at but the important files to look at are:
with the exception of);uint16
on Avx512 for which the algorithms assumptions breakunpack29_32
>Unpacker<uint32_t, 29>::unpack
bpacking_internal.h
are also template specializationunpack32
->unpack<uint32_t>
switch
statements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with aconstexpr
generated jump table. The logic is inbpacking_dispatch_internal.h
From a performance perspective:
uint16_t
unpack<uint64_t>
has an SIMD implementation that should benefitDeltaBitPackDecoder
Novelunpack<uint16_t>
should benefit the level decodersThe dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but seeget_unpack_fn
for the building block).Update 2025-10-13:
uint8_t
anduint16_t
that call theuint32_t
version on a local buffer combined with astatic_cast
loop (what was done on the call site before).Are these changes tested?
Very much.
Are there any user-facing changes?