-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
Description
Using StringCRC is actually quite bad for calculating a hash of the data, as 1 states
CRC-32 poses the highest risk for hash collisions. This hash function is generally not recommended for use. If a hub were to contain 77,163 hash values, the chance of a hash collision occurring is 50%, which is extremely high compared to other methods.
AllenInstitute/mies-utils-xop#1 has a WIP branch with XXH32 from 2 which has the following performance characteristics.
Function dostuff()
variable ref, i, numLoops = 1e5
string str = "abcdefghijklmnopqrstuvwxyz"
str = ReplicateString(str, 10)
variable result
string resultStr
ref = stopmstimer(-2)
for(i = 0; i < numLoops; i += 1)
result = StringCrC(0, str)
endfor
printf "CRC32: %.15d, time: %f micro seconds\r", result, (stopmstimer(-2) - ref) / numLoops
ref = stopmstimer(-2)
for(i = 0; i < numLoops; i += 1)
result = MU_CalcHash(str, 0)
endfor
printf "XXHash32: %.15d, time: %f micro seconds\r", result, (stopmstimer(-2) - ref) / numLoops
ref = stopmstimer(-2)
for(i = 0; i < numLoops; i += 1)
resultStr = Hash(str, 3)
endfor
printf "MD5: %.15s, time: %f micro seconds\r", resultStr, (stopmstimer(-2) - ref) / numLoops
ref = stopmstimer(-2)
for(i = 0; i < numLoops; i += 1)
resultStr = Hash(str, 1)
endfor
printf "SHA256: %.15s, time: %f micro seconds\r", resultStr, (stopmstimer(-2) - ref) / numLoops
Endgives
•dostuff()
CRC32: 000001268438376, time: 0.198013 mikro seconds
XXHash32: 000001044162042, time: 0.467863 mikro seconds
MD5: 4e6405697346169, time: 1.934640 mikro seconds
SHA256: 6f6a4460cbd9241, time: 5.233524 mikro seconds
so while a tiny bit slower than CRC32 this is much faster as the other hash algorithms.
Adding this new hash would also allow us to move to XXhash from SHA* where we don't need cryptographic hashes.