Skip to content

Use better hashing in cache key calcuation #2162

@t-b

Description

@t-b

Using StringCRC is actually quite bad for calculating a hash of the data, as 1 states

CRC-32 poses the highest risk for hash collisions. This hash function is generally not recommended for use. If a hub were to contain 77,163 hash values, the chance of a hash collision occurring is 50%, which is extremely high compared to other methods.

AllenInstitute/mies-utils-xop#1 has a WIP branch with XXH32 from 2 which has the following performance characteristics.

Function dostuff()

	variable ref, i, numLoops = 1e5
	string str = "abcdefghijklmnopqrstuvwxyz"
	str = ReplicateString(str, 10)
	variable result
	string resultStr
	
	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		result = StringCrC(0, str)
	endfor

	printf "CRC32:    %.15d, time: %f micro seconds\r", result, (stopmstimer(-2) - ref) / numLoops

	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		result = MU_CalcHash(str, 0)
	endfor

	printf "XXHash32: %.15d, time: %f micro seconds\r", result, (stopmstimer(-2) - ref) / numLoops

	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		resultStr = Hash(str, 3)
	endfor

	printf "MD5:      %.15s, time: %f micro seconds\r", resultStr, (stopmstimer(-2) - ref) / numLoops

	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		resultStr = Hash(str, 1)
	endfor

	printf "SHA256:   %.15s, time: %f micro seconds\r", resultStr, (stopmstimer(-2) - ref) / numLoops
End

gives

•dostuff()
  CRC32:    000001268438376, time: 0.198013 mikro seconds
  XXHash32: 000001044162042, time: 0.467863 mikro seconds
  MD5:      4e6405697346169, time: 1.934640 mikro seconds
  SHA256:   6f6a4460cbd9241, time: 5.233524 mikro seconds

so while a tiny bit slower than CRC32 this is much faster as the other hash algorithms.

Adding this new hash would also allow us to move to XXhash from SHA* where we don't need cryptographic hashes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions