Convert 32-char md5 string to integer - hash

What's the most efficient way to convert an md5 hash to a unique integer to perform a modulus operation?

Since the solution language was not specified, Python is used for this example.
import os
import hashlib
array = os.urandom(1 << 20)
md5 = hashlib.md5()
md5.update(array)
digest = md5.hexdigest()
number = int(digest, 16)
print(number % YOUR_NUMBER)

You haven't said what platform you're running on, or what the format of this hash is. Presumably it's hex, so you've got 16 bytes of information.
In order to convert that to a unique integer, you basically need a 16-byte (128-bit) integer type. Many platforms don't have such a type available natively, but you could use two long values in C# or Java, or a BigInteger in Java or .NET 4.0.
Conceptually you need to parse the hex string to bytes, and then convert the bytes into an integer (or two). The most efficient way of doing that will entirely depend on which platform you're using.

There is more data in a MD5 than will fit in even a 64b integer, so there's no way (without knowing what platform you are using) to get a unique integer. You can get a somewhat unique one by converting the hex version to several integers worth of data then combining them (addition or multiplication). How exactly you would go about that depends on what language you are using though.
Alot of language's will implement either an unpack or sscanf function, which are good places to start looking.

If all you need is modulus, you don't actually need to convert it to 128-byte integer. You can go digit by digit or byte by byte, like this.
mod=0
for(i=0;i<32;i++)
{
digit=md5[i]; //I presume you can convert chart to digit yourself.
mod=(mod*16+digit) % divider;
}

You'll need to define your own hash function that converts an MD5 string into an integer of the desired width. If you want to interpret the MD5 hash as a plain string, you can try the FNV algorithm. It's pretty quick and fairly evenly distributed.

Related

CRC32 integer hash to string

I was looking for a Lua implementation of CRC32 and stumbled upon this:
https://github.com/openresty/lua-nginx-module/blob/master/t/lib/CRC32.lua
However it returns the integer hash, how would I go about getting the string equivalent of it?
Using the input "something" it returns: 1850105976
Using an online CRC32 generator I get: "879fb991"
There are many CRC-32 algorithms. You can find ten different CRC-32s documented in this catalog. The Lua code you found and the online CRC32 you found (somewhere -- no link was provided) are different CRC-32s.
What you seem to mean by a "string equivalent" is the hexadecimal representation of the 32-bit integer. In Lua you can use string.format with the print format %x to get hexadecimal. For the example you gave, 1850105976, that would be 6e466078.
Your "online CRC32 generator" appears to be using the BZIP2 CRC-32, though it is showing you the bytes of the resulting CRC in reversed order (little-endian). So the actual CRC in that case in hexadecimal is 91b99f87. The Lua code you found appears to be using the MPEG-2 CRC-32. The only difference between those is the exclusive-or with ffffffff. So in fact the exclusive-or of the two CRCs you got from the two different sources, 6e466078 ^ 91b99f87 is ffffffff.

Collision-proof hash-like identificator

I need to generate a 6 chars length (letters and digits) id to identify SaaS workspace (unique per user). Of course I could just go with numbers, but it shouldn't provide any clear vision about the real workspace number (for the end user).
So even for id 1 it should be 6 chars length and something like fX8gz6 and fully decodable to 1 or 000001 or something that i can parse to real workspace id. And of course it have to be collision-proof.
What would be the best approach for that?
This is something similar to what Amazon uses for its cloud assets, but it uses 8 chars. Actually 8 chars is suitable as it is the output range after Base64 encoding of 6 binary bytes.
Assuming you have the flexibility to use 8 characters. In original question you said 6 chars, but again assuming. Here is a possible scheme:
Number your assets in Unsigned Int32, possibly auto-increment fashion. call it real-id. Use this real-id for all your internal purposes.
When you need to display it, follow something like this:
Convert your integer to 4 binary Bytes. Every language has library to extract the bytes out of integers and vice versa. Call it real-id-bytes
take a two byte random number. Again you can use libraries to generate an exact 16 bit random number. You can use cryptographic random number generators for better result, or the plain rand is just fine. Call it rand-bytes
Obtain 6 byte display-id-bytes= array-concat(rand-bytes, real-id-bytes)
Obtain display-id= Base64(display-id-bytes). This is exactly 8 chars long and has a mix of lowercase, uppercase and digits.
Now you have a seemingly random 8 character display-id which can be mapped to the real-id. To convert back:
Take the 8 character display-id
display-id-bytes= Base64Decode(display-id)
real-id-bytes= Discard-the-2-random-bytes-from(display-id-bytes)
real-id= fromBytesToInt32(real-id-bytes)
Simple. Now if you really cannot go for 8-char display-id then you have to develop some custom base-64 like algo. Also you might restrict yourself to only 1 random bytes. Also note that This is just an encoding scheme, NOT a encryption scheme. So anyone having the knowledge of your scheme can effectively break/decode the ID. You need to decide whether that is acceptable or not. If not then I guess you have to do some form of encryption. Whatever that is, surely 6-chars will be far insufficient.

Convert unknown symbols to cyrillic

I have this kind of symbols in db table (Наиме) , and I don't know who inserted this data to table.Is there any way to convert them to cyrillic ?
Yes, you can do the conversion. Since you haven't mentioned any langauge, so the logic is given:
Assuming the string length is even, take two immediate characters.
Combine the underlying byte values of two characters to give a 16 bit value. This gives you the multi-byte value of Cryllic character. You can decode the value to give its representation using a proper decoding format like utf-8.
Repeat points 1 and 2 for next two characters until the end of string.
If you want, you can implement it in any language of your choice.

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.

What are the limitations of primitive character types in D?

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.
The types are given on the website as:
char; // unsinged 8 bit UTF-8
wchar; // unsigned 16 bit UTF-16
dchar; // unsigned 32 bit UTF-32
Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?
Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?
I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.
A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.
Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).
Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.
Further reading:
the String handling section in Wikipedia's entry on D
Text in D, by Daniel Keep