Is the uniqueness of CRC-32-hashes sufficient to uniquely identify strings containing filenames? - hash

I have sorted lists of filenames concatenated to strings and want to identify each such string by a unique checksum.
The size of these strings is a minimum of 100 bytes, a maximum of 4000 bytes, and an average of 1000 bytes. The total number of strings could be anything but more likely be in the range of ca. 10000.
Is CRC-32 suited for this purpose?
E.g. I need each of the following strings to have a different fixed-length (, preferably short,) checksum:
"/some/path/to/something/some/other/path"
"/some/path/to/something/another/path"
"/some/path"
...
# these strings can get __very__ long (very long strings are the norm)
Is the uniqueness of CRC-32 hashes increased by input length?
Is there a better choice of checksum for this purpose?

No.
Unless your filenames were all four characters or less, there is no assurance that the CRCs will be unique. With 10,000 names, the probability of at least two of them having the same CRC is about 1%.
This would be true for any 32-bit hash value.
The best way to assign a unique code to each name is to simply start a counter at zero for the first name, and increment for each name, assigning the counter as the code for that name. However that won't help you compute the code given just the name.
You can use a hash, such as a CRC or some other hash, but you will need to deal with collisions. There are several common approaches in the literature. You would keep a list of hashes with names assigned, and if you have a collision you could just increment the hash until you find one not used and assign that one. Then when looking up a name, you start at the computed hash and do a linear search for the name until you find it or an unused slot.
As for the hash, I would recommend XXH64. It is a very fast 64-bit hash. You do not need a cryptographic hash for this application, which would be unnecessarily slow.

Related

Best hashing function for outputting non-numeric hashes

Bit of an odd question, but tldr, I need an algorithm that will generate a unique fixed length hash that doesn't output any numbers. Is this even possible?
The names of my tables can sometimes be way too long (they are generated based on parameters and can exceed the 63 character limit enforced by PostgreSQL). Thus, I need a way too shorten the names of those tables while avoiding as much collision as possible. And the names shouldn't include any numbers.

Can the hash of 2 different inputs be the same?

According to my understanding, Hashing is a process of producing a unique fixed length (let's assume 64bit) output to an input of ANY length. (correct me if am wrong)
So if I take all the (x) possible 64bit hash values that a hash function can produce and append a 0 or 1 at the end of it. I get a list of size 2x (where each hash is 65bit long).
If I give all the 2x combinations as the input to the same hash function, how can it generate a unique hash for all the inputs?
You are correct. This is called a hash collision, and it's a real thing. The reason it's not a bigger deal is that the number of hashes is so overwhelmingly large that these types of collisions are rare. Your example of 64 bits is a little unrealistic, though. 256 bits or 512 bits is a more likely scenario. (Even 128 is no longer considered strong enough.) And the range of hashes in this case is so large that finding inputs that create a hash collision is very difficult.
By the Pigeonhole principle, hash collisions are inevitable. That is it is inevitable to find two distinct messages m1 != m2 such that their hash are equal H(m1) = H(m2)
Therefore, one cannot generate unique hashes for the inputs. With a very very small probability ( negligible ), there will be a collision. Even, inside of 264 possible values, there can be a collision for a hash function with 64-bit output.
Better use a Hash function like SHA3-512 or BLAKE2b and if you really want them unique, compare them with previous hashes that you generate. If you find a collision, you will be famous.
SHA3 family can generate 224, 256, 384, or 512-bit outputs.

Is it safe to cut the hash?

I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)

what does jenkinshash in hadoop guarantee?

I know that jenkinshash produces an integer (2^32) for a given value. The documentation at this link:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/JenkinsHash.html
says
Returns:
a 32-bit value. Every bit of the key affects every bit of the return value. Two keys differing by one or two bits will have totally different hash values.
jenkinshash can return at most 2^32 different results for given values.
What if I have more than 2^32 values?
Will it return same result for two different values?
Thanks
As most hash functions, yes, it may return duplicate hash values for different input data. The guarantee, according to the documentation you linked to, is that values that differs with one or two bits are different. As soon as they differ with 3 bits or more you have no uniqueness-guarantee.
The input data to the hash function may be of a larger size (have more unique input values) than the output of the hash. This trivially makes it so that duplicates must exist in the output data. Consider a hashing function that outputs an integer in the range 1-10 but takes an input in the range 1-100: it is obvious that multiple values must hash to the same value because you cannot enumerate the values 1-100 using only ten different integers. This is called the pigeonhole principle.
Any good hashing function will, however, try to distribute the output values evenly. In the 1-10 example you can expect a good hashing function to give a 2 approximately the same amount of times as a 6.
Hashing functions that guarantee uniqueness are called perfect hash functions. They all provide an output data of at least the same cardinality as the input data. A perfect hashing function for the input integers 1-100 must at least have 100 different output values.
Note that according to Wikipedia the Jenkins hash functions are not cryptographic. This means that you should avoid them for password security and the like, but you can use the hash for somewhat even work distribution and checksums.

Checking for string matches using hashes, without double-checking the entire string

I'm trying to check if two strings are identical as quickly as possible. Can I protect myself from hash collisions without also comparing the entire string?
I've got a cache of items that are keyed by a string. I store the hash of the string, the length of the string, and the string itself. (I'm currently using djb2 to generate the hash.)
To check if an input string is a match to an item in the cache, I compute the input's hash, and compare it to the stored hash. If that matches, I compare the length of the input (which I got as a side effect of computing the hash) to the stored length. Finally, if that matches, I do a full string comparison of the input and the stored string.
Is it necessary to do that full string comparison? For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash? If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Basically, any string comparison scheme that offers O(1) performance when the strings differ but better than O(n) performance when they match would be an improvement over what I'm doing now.
For example, is there a string hashing algorithm that can mathematically guarantee that no two strings of the same length will generate the same hash?
No, and there can't be. Think about it: The hash has a finite length, but the strings do not. Say for argument's sake that the hash is 32-bits. Can you create more than 2 billion unique strings with the same length? Of course you can - you can create an infinite number of unique strings, so comparing the hashes is not enough to guarantee uniqueness. This argument scales to longer hashes.
If not, can an algorithm guarantee that two different strings of the same length will generate different hash codes if any of the first N characters differ?
Well, yes, as long as the number of bits in the hash is as great as the number of bits in the string, but that's probably not the answer you were looking for.
Some of the algorithms used for cyclic redundancy checks have guarantees like if there's exactly one bit different then the CRC is guaranteed to be different over a certain run length of bits, but that only works for relatively short runs.
You should be safe from collisions if you use a modern hashing function such as one of the Secure Hash Algorithm (SHA) variants.