mpq: 2x faster normalization #123

gratural · 2025-04-03T18:16:54Z

Sorry, couldn't resist.

Looking at normalized_filename(), I couldn't shake the thought: 'I wonder how often this runs when loading chunks?'
Turns out - way too often (more than 200 thousand times just for loading a single chunk).
Then came another idea - 'why not cut its execution time in half?':"

                     , [] (char c)
                       {
                         return c == '\\' ? '/' : std::tolower(c);
                       }
                     );

Then it hit me - 'we can make it even faster!' - and here's what I implemented:

      for (char& c : filename)
      {
        c = (c == '\\') ? '/' : std::tolower(c);
      }

Needs test checking on other systems.

I ran two "benchmarks":

Load Azetoth_31_51
Walking to Stratholme

Here are the results:

Version	Single chunk (ms)	Walk to Strat(ms)
Original	1182	7833
Single-pass	785	5025
For loop	673	4182

And also: QTString processes single chunk loading in 550ms, but spends an additional 400ms converting to std::string .
And same operations using the boost library take as long as 3000ms.

bloerwald · 2025-04-03T21:00:39Z

Note that technically this is undefined behaviour now. There is a reason it was a lambda with an unsigned char, which is that subtle note in https://en.cppreference.com/w/cpp/string/byte/tolower.

std::tolower() is locale-aware, which takes some additional time. If you want perfect optimisation, also benchmark the lookup table approach.

https://godbolt.org/z/Mabr6117n shows how to do that, and also shows how code-gen is identical for transform and raw loop. Did you repeat your benchmark multiple times to average-out other randomness?

bloerwald · 2025-04-29T12:04:52Z

I did a small benchmark to verify my assumptions. To do that, I took the latest community-listfile.csv | cut -f 2 -d ';' | tr -d '\r' | shuf, then randomised the casing and slashes, resulting in 1853276 randomised strings of expected length.

I then tested

char toLower(char c) {
  return static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
}
char toSlash(char c) {
  return (c == '\\') ? '/' : c;
}
char toLowerSlash(char c) {
  return (c == '\\') ? '/' : toLower(c);
}
char toLowerSlashLookup(char c) {
  return toLowerBackslashLookup[static_cast<unsigned char>(c)];
}

in

void normalizeTwoTransform(std::string& s) {
  std::transform(s.begin(), s.end(), s.begin(), toLower);
  std::transform(s.begin(), s.end(), s.begin(), toSlash);
}
void normalizeManualForTwoOp(std::string& s) {
  for (auto& c : s) {
    c = toLower(c);
    c = toSlash(c);
  }
}

combinations. (Out-reference to avoid benchmarking memcopies in case the moves aren't perfect.) I repeated that ten times in random order to avoid caching/warmup, randomising casing per repetition.

This resulted in

normalizeTwoTransform: 2287ms
normalizeManualForTwoOp: 2341ms

normalizeOneTransform: 2249ms
normalizeManualForOneOp: 2270ms

normalizeOneTransformLookup: 462ms
normalizeManualForOneOpLookup: 449ms

which is somewhat like I expected. Interestingly, apparently two transforms get half vectorized on my compiler (the slash part only, obviously), while the manual two operations don't. The single operation one doesn't get vectorized in any case for me. Nobody understands compilers, but that does explain why two+transform is faster than two+manual, but one+transform and one+manual are essentially identical.

The lookup one does not get vectorized for me, but it does get an 8x loop unroll (I'm on a 64bit arch). Turning off -funroll-loops does not change the result in a relevant way though, which is somewhat expected as the main benefit comes from not calling locale dependent functions.

At this point, of course the lookup table one would win. But is that actually the best? Auto-vectorizers should be better these days, and as we can see, the limit was having to call tolower. What if we implement that without lookup table, but still without locale dependency?

char toLower(char c) {
  bool const isUppercaseLetter = 'A' <= c && c <= 'Z';
  constexpr auto const toLowerConstant = 'a' - 'A';
  static_assert('A' + toLowerConstant == 'a');
  return c + toLowerConstant * isUppercaseLetter;
}

and we get

normalizeManualForOneOp: 331ms
normalizeOneTransform: 365ms
normalizeManualForTwoOp: 316ms
normalizeTwoTransform: 429ms
normalizeManualForOneOpLookup: 450ms
normalizeOneTransformLookup: 466ms

which has the lookup table worse than the versions using simple tolower. Looking at disassembly shows the autovectorizer working very nicely.

Next step is obviously to combine that

char toLowerSlash(char c) {
  bool const isUppercaseLetter = 'A' <= c && c <= 'Z';
  bool const isUpperslash = c == '\\';
  constexpr auto const toLowerConstant = 'a' - 'A';
  constexpr auto const toLowerSlashConstant = '/' - '\\';
  static_assert('A' + toLowerConstant == 'a');
  static_assert('\\' + toLowerSlashConstant == '/');
  return c + toLowerConstant * isUppercaseLetter + toLowerSlashConstant * isUpperslash;
}

normalizeManualForOneOp: 329ms
normalizeOneTransform: 371ms
normalizeManualForTwoOp: 329ms
normalizeTwoTransform: 423ms
normalizeManualForOneOpLookup: 455ms
normalizeOneTransformLookup: 483ms

It feels like the compiler can do a slightly better job with two operations in a loop, but all of this is highly compiler and architecture dependent and in the end luck.

The lookup table is least compiler dependent, but in the perfect world not ideal. The main user group is Windows users, and I have no idea how MSVC compiles these options, sadly. I'm still leaning towards lookup table because while it isn't perfect, it is likely the most-guaranteed-to-be-fine one.

@Adspartan Is there any chance you can repeat this benchmark on Windows? This normalize.cpp.txt (named to avoid the amazing 1990-era security filter for file types) is the code I tested in the end.
listfile.txt.7z.totally-a.zip

bloerwald · 2025-04-30T15:14:00Z

As expected, msvc is horrible with autovec.

Two+transform is 5s, one op shaves off a 400ms, avoiding locale tolower takes of a big chunk, down to 1.5s for one and twoop, making them essentially equal. Lookup table is at 1s.

So I stay at my suggestion: lookup is consistently okay, tolower() is the biggest cost, manual reimplementation is somewhat fine, but probably not worth the hope. Most other libs also do lookup tables, so just do the same.

bloerwald · 2025-05-03T20:04:00Z

1483843 implements lookup table based normalisation

mpq: 2x faster normalization

2217003

Merge branch 'wowdev:default' into turbo_normalize

a8b0d9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mpq: 2x faster normalization #123

mpq: 2x faster normalization #123

Uh oh!

gratural commented Apr 3, 2025 •

edited

Loading

Uh oh!

bloerwald commented Apr 3, 2025

Uh oh!

bloerwald commented Apr 29, 2025 •

edited

Loading

Uh oh!

bloerwald commented Apr 30, 2025

Uh oh!

bloerwald commented May 3, 2025

Uh oh!

Uh oh!

mpq: 2x faster normalization #123

Are you sure you want to change the base?

mpq: 2x faster normalization #123

Uh oh!

Conversation

gratural commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bloerwald commented Apr 3, 2025

Uh oh!

bloerwald commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bloerwald commented Apr 30, 2025

Uh oh!

bloerwald commented May 3, 2025

Uh oh!

Uh oh!

gratural commented Apr 3, 2025 •

edited

Loading

bloerwald commented Apr 29, 2025 •

edited

Loading