Why doesn't md5 (and other hash algorithms) output in base32? - hash

It seems like most hashes (usually in base16/hex) could be easily represented in base32 in a lossless way, resulting in much shorter (and more easily readable) hash strings.
I understand that naive implementations might mix "O"s, "0"s, "1"s, and "I"s, but one could easily choose alphabetic characters without such problems. There are also enough characters to keep hashes case-insensitive. I know that shorter hash algorithms exist (like crc32), but this idea could be applied to those too for even shorter hashes.
Why, then, do most (if not all) hash algorithm implementations not output in base32, or at least provide an option to do so?

Related

iOS: 3 Strings into one hash. Find other string from hash of 2 strings

I have three strings : StrA, StrB, StrC.
And their hash is YT56ejff653499TYK
Now, if someone give me hash of StrA, StrB and its hash is IEoeuor749Hueiur7x, is there a way to extract StrC from YT56ejff653499TYK and IEoeuor749Hueiur7x
Assuming you are referring to some of the standard one-way hash functions like SHA-2 or similar, this should never be possible.
For example, if this was possible by any mean it would make the password hash salting technique essentially a disclosure of the original password.
In short, with a one-way hash function which is not broken this should not be possible.

Are there any real alternatives to unicode?

As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.

What is the limit to encoding base in case of Unicode strings as opposed to base64 having base = 64?

This is actually related to code golf in general, but also appliable elsewhere. People commonly use base64 encoding to store large amounts of binary data in source code.
Assuming all programming languages to be happy to read Unicode source code, what is the max N, for which we can reliably devise a baseN encoding?
Reliability here means being able to encode/decode any data, so every single combination of input bytes can be encoded, and then decoded. The encoded form is free from this rule.
The main goal is to minimize the character count, regardless of byte-count.
Would it be base2147483647 (32-bit) ?
Also, because I know it may vary from browser-to-browser, and we already have problems with copy-pasting code from codegolf answers to our editors, the copy-paste-ability is also a factor here. I know there is a Unicode range of characters that are not displayed.
NOTE:
I know that for binary data, base64 usually expands data, but here the character-count is the main factor.
It really depends on how reliable you want the encoding to be. Character encodings are designed with trade-offs, and in general the more characters allowed, the less likely it is to be universally accepted i.e. less reliable. Base64 isn't immune to this. RFC 3548, published in 2003, mentions that case sensitivity may be an issue, and that the characters + and / may be problematic in certain scenarios. It describes Base32 (no lowercase) and Base16 (hex digits) as potentially safer alternatives.
It does not get better with Unicode. Adding that many characters introduces many more possible points of failure. Depending on how stringent your requirements are, you might have different values for N. I'll cover a few possibilities from large N to small N, adding a requirement each time.
1,114,112: Code points. This is the number of possible code points defined by the Unicode Standard.
1,112,064: Valid UTF. This excludes the surrogates which cannot stand on their own.
1,111,998: Valid for exchange between processes. Unicode reserves 66 code points as permanent non-characters for internal use only. Theoretically, this is the maximum N you could justifiably expect for your copy-paste scenario, but as you noted, in practice many other Unicode strings will fail that exercise.
120,503: Printable characters only, depending on your definition. I've defined it to be all characters outside of the Other and Separator general categories. Also, starting from this bullet point, N is subject to change in future versions of Unicode.
103,595: NFKD normalized Unicode. Unfortunately, many processes automatically normalize Unicode input to a standardized form. If the process used NFKC or NFKD, some information may have been lost. For more reliability, the encoding should thus define a normalization form, with NFKD being better for increasing character count
101,684: No combining characters. These are "characters" which shouldn't stand on their own, such as accents, and are meant to be combined with another base character. Some processes might panic if they are left standing alone, or if there are too many combining characters on a single base character. I've now excluded the Mark category.
85: ASCII85, aka. I want my ASCII back. Okay, this is no longer Unicode, but I felt like mentioning it because it's a lesser known ASCII-only encoding. It's mainly used in Adobe's PostScript and PDF formats, and has a 5:4 encoded data size increase, rather than Base64's 4:3 ratio.

doing away with encoding in a genetic algorithmic impementation

I was wondering if encoding in a genetic algorithm is really necessary , I mean let's say I have a program that is supposed to implement a GA to guess a word a user inputs,.
I don't see the point in having the chromosomes as a binary string, I would rather have it as just a string of letters , and mutate the string and crossbreed it accordingly.
Is such a approach unorthodox ? and will it really affect the outcome , or does it violate the definition of a genetic algorithm?
I do understand different types of encoding is possible.However that isn't what I am concerned about.Please keep your answer specific to the program objective of guessing a string that is similar to the one inputted by the user.
THIS IS NOT A QUESTION ABOUT CHOICE OF ENCODING, BUT WHETHER I CAN DO AWAY WITH THE WHOLE ENCODING SCENARIO RELEVANT TO THIS QUESTION OBJECTIVE.
Though unorthodox, your approach would be perfectly valid. The crossover and mutation functionalities may have to be tweaked however. There are in fact numerous such-non standard implementations (of encodings) today including alphabetic, alphanumeric, decimal, etc.
As per your specific case, if you do not encode an alphabetic chromosome, it is the same as encoding it in an alphabetic manner with an identity map; now, for an alphabetic encoding, the normal crossover functionality should be valid though the mutation may have to be so as to generate a random alphabet at the mutation site, if any.
Binary encoding in GA is generally followed only due to the simplicity and speed of the operations involved. For example, for your case, a string/character comparison takes longer to carry out in general considering the integer/boolean alternative.

Looking for a good 64 bit hash for file paths in UTF16

I have a Unicode / UTF-16 encoded path. the path delimiters is U+005C '\'.
The paths are null-terminated root relative windows file system paths, e.g. "\windows\system32\drivers\myDriver32.sys"
I want to hash this path into a 64-bit unsigned integer.
It does not need to be "cryptographically sound".
The hashes should be case insensitive, but able to handle non-ascii letters.
Obviously, the hash also should scatter well.
There are some ideas that I had though of:
A) Using the windows file identifier as a "hash". In my case i do want the hash to change if the file gets moved, so this is not an option.
B) Just use a regular sting hash: hash += prime * hash + codepoint for the whole string.
I do have the feeling that the fact that the path consists of "segements" (folder names and the final file name) can be leveraged.
To sum up the needs:
1) 64bit hash
2) good distribution / few collisions for file system paths.
3) efficient
4) does not need to be secure
5) case insensitive
I would just use something straightforward. I don't know what language you are using, so the following is pseudocode:
ui64 res = 10000019;
for(i = 0; i < len; i += 2)
{
ui64 merge = ucase(path[i]) * 65536 + ucase(path[i + 1]);
res = res * 8191 + merge; // unchecked arithmetic
}
return res;
I'm assuming that path[i + 1] is safe on the basis that if len is odd then in the last case it will read the U+0000 safely.
I wouldn't make use of the fact that there are gaps caused by the gaps in UTF-16, by lower-case and title-case characters, and by characters invalid for paths, because these are not distributed in a way to make use of this fact something that could be used speedily. Dropping by 32 (all chars below U+0032 are invalid in path names) wouldn't be too expensive, but it wouldn't improve the hashing too much either.
Cryptographically secure hashes might not be very efficient in terms of speed, but there are implementations available for virtually any programming language.
Whether using them is feasible for your application depends on how much you depend on speed – a benchmark would give you an appropriate answer to that.
You could use a sub-string of such a hash, e.g. MD5 on your path, previously converted to lower case so that the hash is effectively case-insensitive (requires that you use a method for lower-casing which knows how to convert all UTF-16 non-standard characters that may occur in the file system).
Cryptographically secure hashes have the benefit of quite even distribution no matter which sub-string part you take because they are designed to be non-predictable, i.e. each part of the hash ideally depends on the entire hashed data as any other part of it.
Even if you do not need a cryptographic hash, you can still use one, and since your problem is not about security, then a "broken" cryptographic hash would be fine. I suggest MD4, which is quite fast. On my PC (a 2.4 GHz Core2 system, using a single core), MD4 hashes more than 700 MB/s, and even for small inputs (less than 50 bytes) it can process about 8 millions messages par second. You may find faster non-cryptographic hashes, but it already takes a rather specific situation for it to make a measurable difference.
For the specific properties you are after, you would need:
To "normalize" characters so that uppercase letters are converted to lowercase (for case insensitivity). Note that, generally speaking, case-insensitivity in the Unicode world is not an easy task. From what you explain, I gather that you are only after the same kind of case-insensitivity that Windows uses for file accesses (I think that it is ASCII-only, so conversion uppercase->lowercase is simple).
To truncate the output of MD4. MD4 produces 128 bits; just use the first 64 bits. This will be as scattered as you could wish for.
There are MD4 implementations available in may places, including right in the RFC 1320 I link to above. You may also find opensource MD4 implementations in C and Java in sphlib.
You could just create a shared library in C# and use the FileInfo class to obtain the full path of a directory or file. Then use .GetHashCode() in the path, like this:
Hash = fullPath.GetHashCode();
or
int getHashCode(string uri)
{
if (uri == null) throw new ArgumentNullException(nameof(uri));
FileInfo fileInfo = new FileInfo(uri);
return fileInfo.FullName.GetHashCode();
}
Altough this is just a 32 bits code, you duplicate it or append another HashCode based on some other characteristics of the file.