I need to generate a unique 8 character string from a sequential integer (0, 1, 2, 3, etc).
I tried hashing the int with md5/sha256/sha512 and then shortening it to 8 characters but there are quite a lot of collisions which I want to try and avoid if possible.
I've looked into algorithms such as crc32 but the hash produced from that contains too many numbers for my liking.
Can anybody suggest an alternative method of doing what I need?
Related
Background
In the past I've written an encoder/decoder for converting an integer to/from a string using an arbitrary alphabet; namely this one:
abcdefghjkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789
Lookalike characters are excluded, so 1, I, l, O, and 0 are not present in this alphabet. This was done for user convenience and to make it easier to read and to type out a value.
As mentioned above, my previous project, python-ipminify converts a 32-bit IPv4 address to a string using an alphabet similar to the above, but excluding upper-case characters. In my current undertaking, I don't have the constraint of excluding upper-case characters.
I wrote my own Python for this project using the excellent question and answer here on how to build a URL-shortener.
I have published a stand-alone example of the logic here as a Gist.
Problem
I'm now writing a performance-critical implementation of this in a compiled language, most likely Rust, but I'd need to port it to other languages as well.. I'm also having to accept an arbitrary-length array of bytes, rather than an arbitrary-width integer, as is the case in Python.
I suppose that as long as I use an unsigned integer and use consistent endianness, I could treat the byte array as one long arbitrary-precision unsigned integer and do division over it, though I'm not sure how performance will scale with that. I'd hope that arbitrary-precision unsigned integer libraries would try to use vector instructions where possible, but I'm not sure how this would work when the input length does not match a specific instruction length, i.e. when the input size in bits is not evenly divisible by supported instructions, e.g. 8, 16, 32, 64, 128, 256, 512 bits.
I have also considered breaking up the byte array into 256-bit (32 byte) blocks and using SIMD instructions (I only need to support x86_64 on recent CPUs) directly to operate on larger unsigned integers, but I'm not exactly sure how to deal with size % 32 != 0 blocks; I'd probably need to zero-pad, but I'm not clear on how I would know when to do this during decoding, i.e. when I don't know the underlying length of the source value, only that of the decoded value.
Question
If I'm going the arbitrary unsigned integer width route, I'd essentially be at the mercy of the library author, which is probably fine; I'd imagine that these libraries would be fairly optimized to vectorize as much as possible.
If I try to go the block route, I'd probably zero-pad any remaining bits in the block if the input length was not divisible by the block size during encoding. However, would it even be possible to decode such a value without knowing the decoded value size?
Along the lines of How to encode integers into other integers, I am wondering if it is possible to encode one integer or a set of integers into one smaller integer or a smaller set of integers, and if so, how it is done. For example, encoding an 8 bit integer into a 4 bit integer, a 256 integer into a 16 bit integer. It doesn't seem possible but perhaps there is something along these lines. Basically, how to get a set of integers to take up less space. Not necessarily encoding into another sequence of bytes, but maybe even into a data structure that is more compact.
Sure, you can always encode them into fewer bits. However you won't be able to decode them back to the original bits. Though you neglected to mention that step, I'm guessing that's what you're looking for.
I am comparing personal info of individuals, specifically their name, birthdate, gender, and race by hashing a string containing all of this info, and comparing the hash objects' hexdigests. This produces a 32 digit hexadecimal number, which I am using as a primary key in a database. For example, using my identifying string would work like this:
>> import hashlib
>> id_string = "BrianPeterson08041993MW"
>> byte_string = id_string.encode('utf-8')
>> hash_id = hashlib.md5(bytesring).hexdigest()
>> print(hash_id)
'3b807ad8a8b3a3569f098a575091bc79'
At this point, I am trying to ascertain collision risk. My understanding is that MD5 doesn't have significant collision risk, at least for strings that are relatively small, which mine are (about 20-40 characters in length). However, I am not using the 128-bit digest object, but the 32 digit hexdigest.
Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?
Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?
[...]
I guess my question is: don't different representations have different chances to be non-unique based on how many units of information they use to do the representation vs. how many units of information the original message takes to encode? And if so, what is the best representation to use? Um, let me preface your next answer with: talk to me like I'm 10
Old question, but yes, you were a bit off base, so to speak.
It’s the number of random bits that matters, not the length of the presentation.
The digest is just a number, an integer, which could be converted to a string using different amount of distinct digits. For example, a 128-bit number shown in some different radices:
"340106575100070649932820283680426757569" (base 10)
"ffde24cb47ecbff8d6e461a67c930dc1" (base 16, hexadecimal)
"7vroicmhvcnvsddp31kpu963e1" (base 32)
Shorter is nicer and more convenient (in auth tokens etc), but each representation has the exact same information and chance of collision. Shorter representations are shorter for the same reason as why "55" is shorter than "110111", while still encoding the same thing.
This answer might also clarify things, as well as toying with code like:
new BigInteger("340106575100070649932820283680426757569").toString(2)
...or something equivalent in other languages (Java/Scala above).
On a more practical level,
[...] which I am using as a primary key in a database
I don't see why not do away with any chance of collision by using a normal autoincremented id column (BIGINT AUTO_INCREMENT in MySQL, BIGSERIAL in PostgreSQL).
An abbreviated 32-bit hexdigest (8 hex characters) would not be long enough to effectively guarantee a collision-free database of users.
The formula for the birthday collision probability is here:
What is the probability of md5 collision if I pass in 2^32 sets of string?
Using a 32-bit key would mean that your software would start to break at around 10,000 users. The collision probability would be about 1%. It gets a lot worse very fast after that. At 100,000 users, the collision probability is 69%.
A 64-bit key, and a 10 billion users is another breaking point of about 2.7% collision rate.
For 100 billion users (a generous upper bound of the earth's population for the foreseeable future), a 96-bit key is a little risky in my opinion: collision chance is about one in 100 million. Really, you need a 128-bit key, which gives you a collision rate of about 1X10^-17.
128-bit keys are 128/4 = 32 hex characters long. If you wanted to use, a shorter key, for aesthetic purposes, you need to use 23 alphanumeric characters to exceed 128 bits. Or if you use printable characters (ASCII 32-126), you could get away with 20 characters.
So when you're talking about users, you need at least 128 bits for a collision-free random key, or a 20-32 character long string, or a 128/8 = 16 byte binary representation.
Initially, I had documents in the form of: resources:{wood:123, coal:1, silver:5} and boxes:{wood:999, coal:20}. In this example, my server's code tests (quite efficiently) if there is enough space for the wood (and it is) and enough space for the coal (and it is) and enough space for the silver (there is not, if space is 0 I don't even include it in boxes) then all is well.
I want to shorten the _id value from wood, coal, silver to a numeric representation which in turns occupies less space, packets of information are smaller when communicating to and from the client / server, etc.
I am curious about using 0, 1, 2...as numbers for the _id or _0, _1, _2...
What are the advantages of using Number or String? Are Numbers faster for queries? (ignoring index speed).
I am adding these values manually btw :P
The number of bytes necessary to represent an integer can be found by taking the integer and dividing by 256. The number of bytes necessary to represent a string are the number of characters in the string.
It takes only one byte to represent the numbers 87 or 202, but it takes two and three bytes to represent the same as a string (plus one more if you use the underscore).
Integers are almost certainly what you want here. However, if you're concerned about over-the-wire size, then you might see gains by shortening your keys. Rather than using wood, coal, and silver, you could use w, c, and s, saving you 11 bytes per record pulled.
I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)