Calculating partial hash collisions in openCl - hash

I would like to find 2 SHA-256 hashes of 2 strings (both of which start with "helloworld" and then have a number of random ascii characters following) where the first n characters of the hashes match, with n being as large as possible.
for example:
String 1 = helloworld\V.T ao>
String 2 = helloworld EF{B -QMl
Hash 1 = JRFqsbBDZBUx9Ot0LviMEr6rAdKmUai/kx8HD0EskxE=
Hash 2 = JRFnMO6jm0hzdZ+jYZybNl9yVnPl9g5Y0vlz0Rf/6UE=
the first three characters of the hashes match.
Currently I am using java and MessageDigest for this which is slow, so I was thinking if i could use my GPU and openCl it could run the program much faster, however i know nothing about openCL or how i would go about coding something like this.
Does anyone know of an existing tool which could do this, or maybe some code?

Related

Collision-proof hash-like identificator

I need to generate a 6 chars length (letters and digits) id to identify SaaS workspace (unique per user). Of course I could just go with numbers, but it shouldn't provide any clear vision about the real workspace number (for the end user).
So even for id 1 it should be 6 chars length and something like fX8gz6 and fully decodable to 1 or 000001 or something that i can parse to real workspace id. And of course it have to be collision-proof.
What would be the best approach for that?
This is something similar to what Amazon uses for its cloud assets, but it uses 8 chars. Actually 8 chars is suitable as it is the output range after Base64 encoding of 6 binary bytes.
Assuming you have the flexibility to use 8 characters. In original question you said 6 chars, but again assuming. Here is a possible scheme:
Number your assets in Unsigned Int32, possibly auto-increment fashion. call it real-id. Use this real-id for all your internal purposes.
When you need to display it, follow something like this:
Convert your integer to 4 binary Bytes. Every language has library to extract the bytes out of integers and vice versa. Call it real-id-bytes
take a two byte random number. Again you can use libraries to generate an exact 16 bit random number. You can use cryptographic random number generators for better result, or the plain rand is just fine. Call it rand-bytes
Obtain 6 byte display-id-bytes= array-concat(rand-bytes, real-id-bytes)
Obtain display-id= Base64(display-id-bytes). This is exactly 8 chars long and has a mix of lowercase, uppercase and digits.
Now you have a seemingly random 8 character display-id which can be mapped to the real-id. To convert back:
Take the 8 character display-id
display-id-bytes= Base64Decode(display-id)
real-id-bytes= Discard-the-2-random-bytes-from(display-id-bytes)
real-id= fromBytesToInt32(real-id-bytes)
Simple. Now if you really cannot go for 8-char display-id then you have to develop some custom base-64 like algo. Also you might restrict yourself to only 1 random bytes. Also note that This is just an encoding scheme, NOT a encryption scheme. So anyone having the knowledge of your scheme can effectively break/decode the ID. You need to decide whether that is acceptable or not. If not then I guess you have to do some form of encryption. Whatever that is, surely 6-chars will be far insufficient.

Building an ngram frequency table and dealing with multibyte runes

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.

Doing a hash by hand/mathematically

I want to learn how to do a hash by hand (like with paper and pencil). Is this feasible? Any pointers on where to learn about this would be appreciated.
That depends on the hash you want to do. You can do a really simple hash by hand pretty easily -- for example, one trivial one is to take the ASCII values of the string, and add them together, typically doing something like a left-rotate between characters. So, to hash the string "Hash", we'd start with the ASCII values of the letters (in hex): 48 61 73 68. We'll add those together, rotating our result left 4 bits (in a 16-bit word) between letters:
0048 + 0061 = 00A9
00A9 <<< 4 = 0A90
0A90 + 0073 = 0B03
B03 <<< 4 = B030
B030 + 68 = B098
Result: B098
Doing a cryptographic hash by hand would be a rather different story. It's certainly still possible, but would be extremely tedious, to put it mildly. A cryptographic hash is typically quite a bit more complex, and (more importantly) almost always has a lot of "rounds", meaning that you basically repeat a set of steps a number of times to get from the input to the output. Speaking from experience, just stepping through SHA-1 in a debugger to be sure you've implemented it correctly is a pain -- doing it all by hand would be pretty awful (but as I said, certainly possible anyway).
You can start by looking at
Hash function
I would suggest trying a CRC, since it seems to me to be the easiest to do by hand: https://en.wikipedia.org/wiki/CRC32#Computation .
You can do a smaller length than standard (it's usually 32 bit) to make things easier.

Convert 32-char md5 string to integer

What's the most efficient way to convert an md5 hash to a unique integer to perform a modulus operation?
Since the solution language was not specified, Python is used for this example.
import os
import hashlib
array = os.urandom(1 << 20)
md5 = hashlib.md5()
md5.update(array)
digest = md5.hexdigest()
number = int(digest, 16)
print(number % YOUR_NUMBER)
You haven't said what platform you're running on, or what the format of this hash is. Presumably it's hex, so you've got 16 bytes of information.
In order to convert that to a unique integer, you basically need a 16-byte (128-bit) integer type. Many platforms don't have such a type available natively, but you could use two long values in C# or Java, or a BigInteger in Java or .NET 4.0.
Conceptually you need to parse the hex string to bytes, and then convert the bytes into an integer (or two). The most efficient way of doing that will entirely depend on which platform you're using.
There is more data in a MD5 than will fit in even a 64b integer, so there's no way (without knowing what platform you are using) to get a unique integer. You can get a somewhat unique one by converting the hex version to several integers worth of data then combining them (addition or multiplication). How exactly you would go about that depends on what language you are using though.
Alot of language's will implement either an unpack or sscanf function, which are good places to start looking.
If all you need is modulus, you don't actually need to convert it to 128-byte integer. You can go digit by digit or byte by byte, like this.
mod=0
for(i=0;i<32;i++)
{
digit=md5[i]; //I presume you can convert chart to digit yourself.
mod=(mod*16+digit) % divider;
}
You'll need to define your own hash function that converts an MD5 string into an integer of the desired width. If you want to interpret the MD5 hash as a plain string, you can try the FNV algorithm. It's pretty quick and fairly evenly distributed.