-
Notifications
You must be signed in to change notification settings - Fork 43
mpq: 2x faster normalization #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: default
Are you sure you want to change the base?
Conversation
Note that technically this is undefined behaviour now. There is a reason it was a lambda with an
https://godbolt.org/z/Mabr6117n shows how to do that, and also shows how code-gen is identical for transform and raw loop. Did you repeat your benchmark multiple times to average-out other randomness? |
I did a small benchmark to verify my assumptions. To do that, I took the latest community-listfile.csv I then tested char toLower(char c) {
return static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
}
char toSlash(char c) {
return (c == '\\') ? '/' : c;
}
char toLowerSlash(char c) {
return (c == '\\') ? '/' : toLower(c);
}
char toLowerSlashLookup(char c) {
return toLowerBackslashLookup[static_cast<unsigned char>(c)];
} in void normalizeTwoTransform(std::string& s) {
std::transform(s.begin(), s.end(), s.begin(), toLower);
std::transform(s.begin(), s.end(), s.begin(), toSlash);
}
void normalizeManualForTwoOp(std::string& s) {
for (auto& c : s) {
c = toLower(c);
c = toSlash(c);
}
} combinations. (Out-reference to avoid benchmarking memcopies in case the moves aren't perfect.) I repeated that ten times in random order to avoid caching/warmup, randomising casing per repetition. This resulted in
which is somewhat like I expected. Interestingly, apparently two transforms get half vectorized on my compiler (the slash part only, obviously), while the manual two operations don't. The single operation one doesn't get vectorized in any case for me. Nobody understands compilers, but that does explain why two+transform is faster than two+manual, but one+transform and one+manual are essentially identical. The lookup one does not get vectorized for me, but it does get an 8x loop unroll (I'm on a 64bit arch). Turning off At this point, of course the lookup table one would win. But is that actually the best? Auto-vectorizers should be better these days, and as we can see, the limit was having to call tolower. What if we implement that without lookup table, but still without locale dependency? char toLower(char c) {
bool const isUppercaseLetter = 'A' <= c && c <= 'Z';
constexpr auto const toLowerConstant = 'a' - 'A';
static_assert('A' + toLowerConstant == 'a');
return c + toLowerConstant * isUppercaseLetter;
} and we get
which has the lookup table worse than the versions using simple tolower. Looking at disassembly shows the autovectorizer working very nicely. Next step is obviously to combine that char toLowerSlash(char c) {
bool const isUppercaseLetter = 'A' <= c && c <= 'Z';
bool const isUpperslash = c == '\\';
constexpr auto const toLowerConstant = 'a' - 'A';
constexpr auto const toLowerSlashConstant = '/' - '\\';
static_assert('A' + toLowerConstant == 'a');
static_assert('\\' + toLowerSlashConstant == '/');
return c + toLowerConstant * isUppercaseLetter + toLowerSlashConstant * isUpperslash;
}
It feels like the compiler can do a slightly better job with two operations in a loop, but all of this is highly compiler and architecture dependent and in the end luck. The lookup table is least compiler dependent, but in the perfect world not ideal. The main user group is Windows users, and I have no idea how MSVC compiles these options, sadly. I'm still leaning towards lookup table because while it isn't perfect, it is likely the most-guaranteed-to-be-fine one. @Adspartan Is there any chance you can repeat this benchmark on Windows? This normalize.cpp.txt (named to avoid the amazing 1990-era security filter for file types) is the code I tested in the end. |
As expected, msvc is horrible with autovec. Two+transform is 5s, one op shaves off a 400ms, avoiding locale tolower takes of a big chunk, down to 1.5s for one and twoop, making them essentially equal. Lookup table is at 1s. So I stay at my suggestion: lookup is consistently okay, tolower() is the biggest cost, manual reimplementation is somewhat fine, but probably not worth the hope. Most other libs also do lookup tables, so just do the same. |
1483843 implements lookup table based normalisation |
Sorry, couldn't resist.
Looking at
normalized_filename()
, I couldn't shake the thought: 'I wonder how often this runs when loading chunks?'Turns out - way too often (more than 200 thousand times just for loading a single chunk).
Then came another idea - 'why not cut its execution time in half?':"
Then it hit me - 'we can make it even faster!' - and here's what I implemented:
Needs test checking on other systems.
I ran two "benchmarks":
Here are the results:
And also: QTString processes single chunk loading in 550ms, but spends an additional 400ms converting to std::string .
And same operations using the boost library take as long as 3000ms.