Replies: 1 comment 1 reply
-
At a certain point (256kb right now, so anything more than 64k /// A non-temporal store is one that allows the CPU to bypass the cache when writing to memory.
///
/// This can be beneficial when working with large amounts of memory where the writes would otherwise
/// cause large amounts of repeated updates and evictions. The hardware optimization manuals recommend
/// the threshold to be roughly half the size of the last level of on-die cache -- that is, if you have approximately
/// 4MB of L3 cache per core, you'd want this to be approx. 1-2MB, depending on if hyperthreading was enabled.
///
/// However, actually computing the amount of L3 cache per core can be tricky or error prone. Native memcpy
/// algorithms use a constant threshold that is typically around 256KB and we match that here for simplicity. This
/// threshold accounts for most processors in the last 10-15 years that had approx. 1MB L3 per core and support
/// hyperthreading, giving a per core last level cache of approx. 512KB. The general consideration is that microbenchmarks are effectively doing nothing but running the same code over and over, so they end up benefiting greatly from caching and branch prediction, often giving results that may not line up with real world usage. A typical real world application will be running on the machine with many other processes and services going at the "same time". They will likely be running other application logic, touching a wider range of data, etc. This means that you typically won't get the "optimal" results where the branch predictor is perfectly trained for the loop or you're likely to end up in a scenario where you have more regular cache misses due to other data needing to fit into the cache causing your data to be evicted. Non-temporal stores are used (and recommended by the architecture manuals) when you know you're working with data that is roughly larger than 50% of the available L3 cache space per core. This is because you're touching "enough" data that you're going to effectively evict all other data from the cache and pessimize the rest of the system. So the tradeoff is to be a "good" citizen and keep the entire system running smoothly by allowing your own code path to run a little slower (which in effect makes the end to end app run faster). The |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I ran benchmarks to compare the
BitwiseOr
method of theTensorPrimitives
class and the pre-.NET 10 implementation of theBitArray
class because I was wondering if all the extra code inTensorPrimitives
resulted in actual performance benefits compared to the relatively simple and easy to understand implementation in the oldBitArray
. The benchmark code and results are as follows:It is nice to see that the
TensorPrimitives
method outperforms the oldBitArray
code if you store the result back into one of the inputs and that it is equally performant if the destination is an entirely different memory region, but why is the performance suddenly significantly worse when storing the results back into an input, if the spans are sufficiently large (TensorPrimitivesToX | 128000)?The source code contains the following comment that could help to explain the phenomenon:
I am thankful for all answers and everything new that I am able to learn 🙏
Beta Was this translation helpful? Give feedback.
All reactions