Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I've searched a lot for MD5 hash collision, but I've found binary examples only. I would like to find two UTF-8 strings, which have the same MD5 hash. Are there any, or does the collision only work for binary data?
It's definitely possible:
We all agree there are collisions for MD5 due to the birthday paradox - we are mapping infinitely many possible inputs to elements belonging to a finite sequence.
There is a solid chance that there are infinitely many collisions: we are able to produce infinite pairs of input and MD5 tries to map them uniformly.
By that alone some of these collisions are bound to be valid UTF-8 strings, but they're extremely rare, since most of these will be just random binary garbage.
If you absolutely need to find such messages, I recommend using collision finder written by Patrick Stach, which should return pair of arbitrary messages within a few hours, or my attempt to improve it. The latter uses techniques presented in later papers by Wang (the first person to demonstrate examples of MD5 collisions), Lian, Sasaki, Yajima and Klima.
I think you could also use length extension attack to some extent, but it requires deeper understanding of what happens inside MD5.
There are UTF-8 collisions. By the nature of cryptographic hashes, finding them is intentionally difficult, even for a hash as broken as MD5.
You might search for MD5 Rainbow Tables, which can be used for password cracking, and hence for UTF-8 strings. As #alk pointed out, a brute force search is going to take a very long time.
The canonical example of an MD5 hash collision (hex - from here):
Message 1:
d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89
55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b
d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0
e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70
Message 2
d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89
55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b
d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0
e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70
are, in fact, valid UTF-8 strings. They do not contain any NULL bytes and are therefore UTF-8 strings. Now, they are meaningless and look like garbage when decoded:
Message 1:
1i=\/ʵF~#X>U4 䈃%qAZQ%ɟ7<[>1V4[m6Sⴇ9cH͠3BW~Tp
Ƙ!e+o*p
(some characters were control characters)
Message 2:
1i=\/ʵF~#X>U4 䈃%AZQ%ɟr7<[>1V4[m6S49cH͠3BW~Tp(
Ƙ!eo*p
(same situation)
Oh, and before I forget, here's the MD5 hash:
79054025255fb1a26e4bc422aef54eb4
Related
It seems like most hashes (usually in base16/hex) could be easily represented in base32 in a lossless way, resulting in much shorter (and more easily readable) hash strings.
I understand that naive implementations might mix "O"s, "0"s, "1"s, and "I"s, but one could easily choose alphabetic characters without such problems. There are also enough characters to keep hashes case-insensitive. I know that shorter hash algorithms exist (like crc32), but this idea could be applied to those too for even shorter hashes.
Why, then, do most (if not all) hash algorithm implementations not output in base32, or at least provide an option to do so?
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Again and again, I keep asking myself: Why do they always insist on over-complicating everything?!
I've tried to read up about and understand Unicode many times over the years. When they start talking about endians and BOMs and all that stuff, my eyes just "zone out". I physically cannot keep reading and retain what I'm seeing. I fundamentally don't get their desire for over-complicating everything.
Why do we need UTF-16 and UTF-32 and "big endian" and "little endian" and BOMs and all this nonsense? Why wasn't Unicode just defined as "compatible with ASCII, but you can also use multiple bytes to represent all these further characters"? That would've been nice and simple, but nooo... let's have all this other stuff so that Microsoft chose UTF-16 for Windows NT and nothing is easy or straight-forward!
As always, there probably is a reason, but I doubt it's good enough to justify all this confusion and all these problems arising from insisting on making it so complex and difficult to grasp.
Unicode started out as a 16-bit character set, so naturally every character was simply encoded as two consecutive bytes. However, it quickly became clear that this would not suffice, so the limit was increased. The problem was that some programming languages and operating systems had already started implementing Unicode as 16-bit and they couldn’t just throw out everything they had already built, so a new encoding was devised that stayed backwards-compatible with these 16-bit implementations while still allowing full Unicode support. This is UTF-16.
UTF-32 represents every character as a sequence of four bytes, which is utterly impractical and virtually never used to actually store text. However, it is very useful when implementing algorithms that operate on individual codepoints – such as the various mechanisms defined by the Unicode standard itself – because all codepoints are always the same length and iterating over them becomes trivial, so you will sometimes find it used internally for buffers and such.
UTF-8 meanwhile is what you actually want to use to store and transmit text. It is compatible with ASCII and self-synchronising (unlike the other two) and it is quite space-efficient (unlike UTF-32). It will also never produce eight binary zeroes in a row (unless you are trying to represent the literal NULL character) so UTF-8 can safely be used in legacy environments where strings are null-terminated.
Endianness is just an intrinsic property of data types where the smallest significant unit is larger than one byte. Computers simply don’t always agree in what order to read a sequence of bytes. For Unicode, this problem can be circumvented by including a Byte Order Mark in the text stream, because if you read its byte representation in the wrong direction in UTF-16 or UTF-32, it will produce an invalid character that has no reason to ever occur, so you know that this particular order cannot be the right one.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
!MJXAy.... (41 characters, A-Z a-z 0-9)
I have a list of passwords from an older web app, but no longer have the source for creating / verifying the hash.
I think it may have been a django app, but cant be sure.
Im looking for either the type of hash , or a link to source so i can validate logins.
Any thoughts / suggestions?
You might simply have 40-character hashes that have been "disabled".
! and * are often placed at the beginning or end of a hash to reversibly "disable" accounts (because those characters never appear in most hashes, thereby by making any hash containing them impossible to match).
If the account ever needs to be reactivated, the obviously invalid "disabling" character can simply be removed, restoring the original password without having to know the original password or interact with the user.
If all of the hashes start with an exclamation point, perhaps all of the hashes were disabled on purpose for some reason.
And for what it's worth, the two Django formats that I'm aware of look like this (from
the Hashcat example-hashes page, both plaintexts 'hashcat'):
Django (PBKDF2-SHA256):
pbkdf2_sha256$20000$H0dPx8NeajVu$GiC4k5kqbbR9qWBlsRgDywNqC2vd9kqfk7zdorEnNas=
Django (SHA-1):
sha1$fe76b$02d5916550edf7fc8c886f044887f4b1abf9b013
Both of these are salted and don't seem to match your character set. It's possible that your hash is a custom one, with the final step being a base64 conversion (but without the trailing =?).
The closest I can see are an ASCII MD5 hash converted to base64 and a binary SHA26 hash converted to base64, but both of those are 44 characters.
Your best bet may be to trim off the ! and try to verify them with mdxfind, which will try many different kinds and chains of hashing, encoding, and truncation.
I was wondering if encoding in a genetic algorithm is really necessary , I mean let's say I have a program that is supposed to implement a GA to guess a word a user inputs,.
I don't see the point in having the chromosomes as a binary string, I would rather have it as just a string of letters , and mutate the string and crossbreed it accordingly.
Is such a approach unorthodox ? and will it really affect the outcome , or does it violate the definition of a genetic algorithm?
I do understand different types of encoding is possible.However that isn't what I am concerned about.Please keep your answer specific to the program objective of guessing a string that is similar to the one inputted by the user.
THIS IS NOT A QUESTION ABOUT CHOICE OF ENCODING, BUT WHETHER I CAN DO AWAY WITH THE WHOLE ENCODING SCENARIO RELEVANT TO THIS QUESTION OBJECTIVE.
Though unorthodox, your approach would be perfectly valid. The crossover and mutation functionalities may have to be tweaked however. There are in fact numerous such-non standard implementations (of encodings) today including alphabetic, alphanumeric, decimal, etc.
As per your specific case, if you do not encode an alphabetic chromosome, it is the same as encoding it in an alphabetic manner with an identity map; now, for an alphabetic encoding, the normal crossover functionality should be valid though the mutation may have to be so as to generate a random alphabet at the mutation site, if any.
Binary encoding in GA is generally followed only due to the simplicity and speed of the operations involved. For example, for your case, a string/character comparison takes longer to carry out in general considering the integer/boolean alternative.
Adding support for Unicode passwords it an important feature that should not be ignored by developers.
Still, adding support for Unicode in passwords is a tricky job because the same text can be encoded in different ways in Unicode and you don't want to prevent people from logging in because of this.
Let's say that you'll store the passwords as UTF-8, and mind that this question is not related to Unicode encodings and it's related to Unicode normalization.
Now the question is how you should normalize the Unicode data?
You have to be sure that you'll be able to compare it. You need to be sure that when the next Unicode standard will be released it will not invalidate your password verification.
Note: still there are some places where Unicode passwords will probably never be used, but this question is not about why or when to use Unicode passwords, it is about how to implement them in the proper way.
1st update
Is it possible to implement this without using ICU, like using OS for normalizing?
A good start is to read Unicode TR 15: Unicode Normalization Forms. Then you realize that it is a lot of work and prone to strange errors - you probably already know this part since you are asking here. Finally, you download something like ICU and let it do it for you.
IIRC, it is a multistep process. First you decompose the sequence until you cannot further decompose - for example é would become e + ´. Then you reorder the sequences into a well-defined ordering. Finally, you can encode the resulting byte stream using UTF-8 or something similar. The UTF-8 byte stream can be fed into the cryptographic hash algorithm of your choice and stored in a persistent store. When you want to check if a password matches, perform the same procedure and compare the output of the hash algorithm with what is stored in the database.
A question back to you- can you explain why you added "without using ICU"? I see a lot of questions asking for things that ICU does (we* think) pretty well, but "without using ICU". Just curious.
Secondly, you may be interested in StringPrep/NamePrep and not just normalization: StringPrep - to map strings for comparison.
Thirdly, you may be intererested in UTR#36 and UTR#39 for other Unicode security implications.
*(disclosure: ICU developer :)