Implementing MD5: Inconsistent endianness? - hash

So I tried implementing the MD5 algorithm according to RFC1321 in C# and it works, but there is one thing about the way the padding is performed that I don't understand, here's an example:
If I want to hash the string "1" (without the quotation marks) this results in the following bit representation: 10001100
The next step is appending a single "1"-Bit, represented by 00000001 (big endian), which is followed by "0"-Bits, followed by a 64-bit representation of the length of the original message (low-order word first).
Since the length of the original message is 8 (Bits) I expected 00000000000000000000000000001000 00000000000000000000000000000000 to be appended (low-order word first). However this does not result in the correct hash value, but appending 00010000000000000000000000000000 00000000000000000000000000000000 does.
This looks as if suddenly the little-endian format is being used, but that does not really seem to make any sense at all, so I guess there must be something else that I am missing?

Yes, for md5 you have to add message length in little-endian.
So, message representation for "1" -> 49 -> 00110001, followed by single bit and zeroes. And after add message length in reversed order of bytes (the least significant byte first).
You could also check permutations step by step on this site: https://cse.unl.edu/~ssamal/crypto/genhash.php.
Or there: https://github.com/MrBlackk/md5_sha256-512_debugger

Related

utf-8: How to recover from missing bit?

Quote from Wikipedia
If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize at the start of the next code point;
...
UTF-16 and UTF-32 will handle corrupt (altered) bytes by resynchronizing on the next good code point, but an odd number of lost or spurious byte (octet)s will garble all following text
I don't quite understand why UTF-8 works better than UTF-16 when corruption happens. What does "odd number" mean? Does it mean 1 in 0101010111101(bytes stream)?
In addition, Can I say that UTF-16's recovery performance is as good as UTF-8 when no odd number of lost or spurious byte (octet) occur?
Could anyone give some point at how the system work to recover from bit error when using UTF-8 or UTF-16?
This documentation quote refers to recovering the stream after having missing some code points, not correcting the characters that went bad during transmission.
And what it means is simply that: if you miss one or more byte in an utf-8 stream, as soon as the next byte is the beginning of a new code point (usually a character), you can resume your text from there.
In utf-16, all characters are 2 bytes long, and if you miss a single byte then all subsequent bytes will be misplaced, and without further, higher level, correction, no other code-point will be correct for that stream.
For visualizing that, in Python3 interactive mode we could do:
In [19]: a = "Resumé for mr. Fernando"
In [20]: b = a.encode("utf-8"); b = b[:3] + b[4:]
In [21]: print (b.decode("utf-8"))
Resmé for mr. Fernando
In [22]: b = a.encode("utf-16"); b = b[:10] + b[11:]
In [23]: print (b.decode("utf-16", errors="replace"))
Resu 昀漀爀 洀爀⸀ 䘀攀爀渀愀渀搀漀�
(The slicing notation used in Python for those unfamiliar means [<begin (inclusive)>: <end (exclusive)>] as positions. Any parameter left blank is assumed to be the beginning or end of the sliced sequence)
And why "odd" should be obvious at this point: if you miss an even number of bytes, the decoding system will still try to decode at character boundaries - with an odd number, all character boundaries will be incorrect
The magic hides in the structure of the encoding. In UTF-8 structure, all the starting byte of a UTF-8 encoded bytes cannot be misreplace by the byte that is not the starting byte. However, thing is not true in UTF-16.
UTF-8 structure
Let's simulate how a magic machine that receiving UTF-8 encoded bytes as input and outputing code points works when one byte missed in a 3-byte long UTF-8 encoded bytes.
Say, the UTF-8 encoded bytes from two characters 一二 is e4b880(一) and e4ba8c(二). Now the machine reads e4, which is 11100100, knowing that the following code point is a 3-byte long code point. Unfortunately, the following byte b8is missed. Next,80 is read. Then, the first byte of 二 is read, which is e4. However, e4 doesn't fit the rule right: When it comes to 3-byte long code point, byte 3 need to start with 10 according to the table above. Now the machine knows some bytes are missing and the code point is broken. It will try to look around to find the right decode starting point. Obviously, the next byte ba is not a good start because it doesn't fit any of the byte 1 in the table above. Then it will look back, finding that the previous byte e4 is exaclty a good start byte to decode.
In UTF-16, a byte can be misreplaced by the byte that is not the starting byte because of UTF-16 have no way to tell whether it's a starting byte or not.

yaws unicode symbols in {html, ...}

Why {html, "доуч"++[1076,1086,1091,1095]} in yaws-page gives me next error:
Yaws process died: {badarg,[{erlang,list_to_binary,
[[[[208,180,208,190,209,131,209,135,1076,
1086,1091,1095]],
...
"доуч" = [1076,1086,1091,1095] -> gives me exact match, but how yaws translate 2-byte per elem list in two times longer list with 1 byte per elem for "доуч", but doesnt do it for [1076,1086,1091,1095]. Is there some internal represintation of unicode data involed?
I want to output to the web pages lists like [1076,1086,1091,1095], but it crushed.
Erlang source files only support the ISO-LATIN-1 charset. The Erlang console can accept Unicode characters, but to enter them inside a source code file, you need to use this syntax:
K = "A weird K: \x{a740}".
See http://www.erlang.org/doc/apps/stdlib/unicode_usage.html for more info.
You have to do the following to make it work:
{html, "доуч"++ binary_to_list(unicode:characters_to_binary([1076,1086,1091,1095]))}
Why it fails?
In a bit more detail, the list_to_binary fails because it is trying to convert each item in the list to a byte, which it cannot do because each value in [1076,1086,1091,1095] would take more than a byte.
What is going on?
[1076,1086,1091,1095] is a pure unicode string representation of "доуч". Yaws tries to convert the string (list) into a binary string directly using list_to_binary and thus fails. Since each unicode character can take more than one byte, we need to convert it into a byte array. This can be done using:
unicode:characters_to_binary([1076,1086,1091,1095]).
<<208,180,208,190,209,131,209,135>>
This can now be safely converted back and forth between list and binary representations. See unicode for more details.
You can convert back to unicode as follows:
unicode:characters_to_list(<<208,180,208,190,209,131,209,135>>).
[1076,1086,1091,1095]

pySerial and reading binary data

When the device I am communicating with sends binary data, I can recover most of it. However, there always seem to be some bytes missing, replaced by non-standard characters. For instance, one individual output looks like this:
\xc4\xa5\x06\x00.\xb3\x01\x01\x02\x00\x00\x00=\xa9
The period and equals sign should be traditional bytes in hexadecimal format (I confirmed this in another application). Other times I get other weird characters such as ')' or 's'. These characters usually occur in the exact same spot (which varies with the command I passed to the device).
How can I fix this problem?
Are you displaying the output using something like this?:
print output
If some of your bytes happen to correspond with printable characters, they'll show up as characters. Try this:
print output.encode('hex')
to see hex values for all your bytes.
At first I liked #RichieHindle answer, but when I tried it the hex bytes were all bunched together.
To get a friendlier output, I use
print ' '.join(map(lambda x:x.encode('hex'),output))

Displaying Unicode Characters

I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.
I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.
My questions are:
How would I go about parsing a Unicode string, say UTF-8?
Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?
That is, if I interpreted the method of storage correctly.
This is all related to a text display system I am designing to work with OpenGL.
I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).
To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.
Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.
How would I go about parsing a Unicode string, say UTF-8?
I'm assuming that by "parsing", you mean converting to code points.
Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.
If you do need to convert to code points (UTF-32), then:
Check the first byte to see how many bytes are in the character.
Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
Use bit masking and shifting to convert the bytes to the code point.
Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
Increment your pointer by the sequence length and repeat for the next character.
Additionally, it seems to me that 2
bytes per character simply isn't
enough to represent every possible
Unicode element.
It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.
If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.
Well, I think this answers it:
http://en.wikipedia.org/wiki/UTF-8
Why it didn't show up the first time I went searching, I have no idea.

What is base 64 encoding used for?

I've heard people talking about "base 64 encoding" here and there. What is it used for?
When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).
So to get around this, people encode the binary data into characters. Base64 is one of these types of encodings.
Why 64?
Because you can generally rely on the same 64 characters being present in many character sets, and you can be reasonably confident that your data's going to end up on the other side of the wire uncorrupted.
It's basically a way of encoding arbitrary binary data in ASCII text. It takes 4 characters per 3 bytes of data, plus potentially a bit of padding at the end.
Essentially each 6 bits of the input is encoded in a 64-character alphabet. The "standard" alphabet uses A-Z, a-z, 0-9 and + and /, with = as a padding character. There are URL-safe variants.
Wikipedia is a reasonably good source of more information.
Years ago, when mailing functionality was introduced, so that was utterly text based, as the time passed, need for attachments like image and media (audio,video etc) came into existence. When these attachments are sent over internet (which is basically in the form of binary data), the probability of binary data getting corrupt is high in its raw form. So, to tackle this problem BASE64 came along.
The problem with binary data is that it contains null characters which in some languages like C,C++ represent end of character string so sending binary data in raw form containing NULL bytes will stop a file from being fully read and lead in a corrupt data.
For Example :
In C and C++, this "null" character shows the end of a string. So "HELLO" is stored like this:
H E L L O
72 69 76 76 79 00
The 00 says "stop here".
Now let’s dive into how BASE64 encoding works.
Point to be noted : Length of the string should be in multiple of 3.
Example 1 :
String to be encoded : “ace”, Length=3
Convert each character to decimal.
a= 97, c= 99, e= 101
Change each decimal to 8-bit binary representation.
97= 01100001, 99= 01100011, 101= 01100101
Combined : 01100001 01100011 01100101
Separate in a group of 6-bit.
011000 010110 001101 100101
Calculate binary to decimal
011000= 24, 010110= 22, 001101= 13, 100101= 37
Covert decimal characters to base64 using base64 chart.
24= Y, 22= W, 13= N, 37= l
“ace” => “YWNl”
Example 2 :
String to be encoded : “abcd” Length=4, it's not multiple of 3. So to make string length multiple of 3 , we must add 2 bit padding to make length= 6. Padding bit is represented by “=” sign.
Point to be noted : One padding bit equals two zeroes 00 so two padding bit equals four zeroes 0000.
So lets start the process :–
Convert each character to decimal.
a= 97, b= 98, c= 99, d= 100
Change each decimal to 8-bit binary representation.
97= 01100001, 98= 01100010, 99= 01100011, 100= 01100100
Separate in a group of 6-bit.
011000, 010110, 001001, 100011, 011001, 00
so the last 6-bit is not complete so we insert two padding bit which equals four zeroes “0000”.
011000, 010110, 001001, 100011, 011001, 000000 ==
Now, it is equal. Two equals sign at the end show that 4 zeroes were added (helps in decoding).
Calculate binary to decimal.
011000= 24, 010110= 22, 001001= 9, 100011= 35, 011001= 25, 000000=0 ==
Covert decimal characters to base64 using base64 chart.
24= Y, 22= W, 9= j, 35= j, 25= Z, 0= A ==
“abcd” => “YWJjZA==”
Base-64 encoding is a way of taking binary data and turning it into text so that it's more easily transmitted in things like e-mail and HTML form data.
http://en.wikipedia.org/wiki/Base64
It's a textual encoding of binary data where the resultant text has nothing but letters, numbers and the symbols "+", "/" and "=". It's a convenient way to store/transmit binary data over media that is specifically used for textual data.
But why Base-64? The two alternatives for converting binary data into text that immediately spring to mind are:
Decimal: store the decimal value of each byte as three numbers: 045 112 101 037 etc. where each byte is represented by 3 bytes. The data bloats three-fold.
Hexadecimal: store the bytes as hex pairs: AC 47 0D 1A etc. where each byte is represented by 2 bytes. The data bloats two-fold.
Base-64 maps 3 bytes (8 x 3 = 24 bits) in 4 characters that span 6-bits (6 x 4 = 24 bits). The result looks something like "TWFuIGlzIGRpc3Rpb...". Therefore the bloating is only a mere 4/3 = 1.3333333 times the original.
Aside from what's already been said, two very common uses that have not been listed are
Hashes:
Hashes are one-way functions that transform a block of bytes into another block of bytes of a fixed size such as 128bit or 256bit (SHA/MD5). Converting the resulting bytes into Base64 makes it much easier to display the hash especially when you are comparing a checksum for integrity. Hashes are so often seen in Base64 that many people mistake Base64 itself as a hash.
Cryptography:
Since an encryption key does not have to be text but raw bytes it is sometimes necessary to store it in a file or database, which Base64 comes in handy for. Same with the resulting encrypted bytes.
Note that although Base64 is often used in cryptography is not a security mechanism. Anyone can convert the Base64 string back to its original bytes, so it should not be used as a means for protecting data, only as a format to display or store raw bytes more easily.
Certificates
x509 certificates in PEM format are base 64 encoded. http://how2ssl.com/articles/working_with_pem_files/
In the early days of computers, when telephone line inter-system communication was not particularly reliable, a quick & dirty method of verifying data integrity was used: "bit parity". In this method, every byte transmitted would have 7-bits of data, and the 8th would be 1 or 0, to force the total number of 1 bits in the byte to be even.
Hence 0x01 would be transmited as 0x81; 0x02 would be 0x82; 0x03 would remain 0x03 etc.
To further this system, when the ASCII character set was defined, only 00-7F were assigned characters. (Still today, all characters set in the range 80-FF are non-standard)
Many routers of the day put the parity check and byte translation into hardware, forcing the computers attached to them to deal strictly with 7-bit data. This force email attachments (and all other data, which is why HTTP & SMTP protocols are text-based), to be convert into a text-only format.
Few of the routers survived into the 90s. I severely doubt any of them are in use today.
From http://en.wikipedia.org/wiki/Base64
The term Base64 refers to a specific MIME content transfer encoding.
It is also used as a generic term for any similar encoding scheme that
encodes binary data by treating it numerically and translating it into
a base 64 representation. The particular choice of base is due to the
history of character set encoding: one can choose a set of 64
characters that is both part of the subset common to most encodings,
and also printable. This combination leaves the data unlikely to be
modified in transit through systems, such as email, which were
traditionally not 8-bit clean.
Base64 can be used in a variety of contexts:
Evolution and Thunderbird use Base64 to obfuscate e-mail passwords[1]
Base64 can be used to transmit and store text that might otherwise cause delimiter collision
Base64 is often used as a quick but insecure shortcut to obscure secrets without incurring the overhead of cryptographic key management
Spammers use Base64 to evade basic anti-spamming tools, which often do not decode Base64 and therefore cannot detect keywords in encoded
messages.
Base64 is used to encode character strings in LDIF files
Base64 is sometimes used to embed binary data in an XML file, using a syntax similar to ...... e.g.
Firefox's bookmarks.html.
Base64 is also used when communicating with government Fiscal Signature printing devices (usually, over serial or parallel ports) to
minimize the delay when transferring receipt characters for signing.
Base64 is used to encode binary files such as images within scripts, to avoid depending on external files.
Can be used to embed raw image data into a CSS property such as background-image.
Some transportation protocols only allow alphanumerical characters to be transmitted. Just imagine a situation where control characters are used to trigger special actions and/or that only supports a limited bit width per character. Base64 transforms any input into an encoding that only uses alphanumeric characters, +, / and the = as a padding character.
Base64 is a binary to a text encoding scheme that represents binary data in an ASCII string format. It is designed to carry data stored in binary format across the network channels.
Base64 mechanism uses 64 characters to encode. These characters consist of:
10 numeric value: i.e., 0,1,2,3,...,9
26 Uppercase alphabets: i.e., A,B,C,D,...,Z
26 Lowercase alphabets: i.e., a,b,c,d,...,z
2 special characters (these characters depends on operating system): i.e. +,/
How base64 works
The steps to encode a string with base64 algorithm are as follow:
Count the number of characters in a String. If it is not multiple of 3, then pad it with special characters (i.e. =) to make it multiple of 3.
Convert string to ASCII binary format 8-bit using the ASCII table.
After converting to binary format, divide binary data into chunks of 6-bits.
Convert chunks of 6-bit binary data to decimal numbers.
Convert decimals to string according to the base64 Index Table. This table can be an example, but as I said, 2 special characters may vary.
Now, we got the encoded version of the input string.
Let's make an example: convert string THS to base64 encoding string.
Count the number of characters: it is already a multiple of 3.
Convert to ASCII binary format 8-bit. We got (T)01010100 (H)01001000 (S)01010011
Divide binary data into chunks of 6-bits. We got 010101 000100 100001 010011
Convert chunks of 6-bit binary data to decimal numbers.We got 21 4 33 19
Convert decimals to string according to the base64 Index Table. We got VEhT
It's used for converting arbitrary binary data to ASCII text.
For example, e-mail attachments are sent this way.
“Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport”(Wiki, 2017)
Example could be the following: you have a web service that accept only ASCII chars. You want to save and then transfer user’s data to some other location (API) but recipient want receive untouched data. Base64 is for that. . . The only downside is that base64 encoding will require around 33% more space than regular strings.
Another Example:: uenc = url encoded = aHR0cDovL2xvYy5tYWdlbnRvLmNvbS9hc2ljcy1tZW4tcy1nZWwta2F5YW5vLXhpaS5odG1s = http://loc.querytip.com/asics-men-s-gel-kayano-xii.html.
As you can see we can’t put char “/” in URL if we want to send last visited URL as parameter because we would break attribute/value rule for “MOD rewrite” – GET parameter.
A full example would be: “http://loc.querytip.com/checkout/cart/add/uenc/http://loc.magento.com/asics-men-s-gel-kayano-xii.html/product/93/”
I use it in a practical sense when we transfer large binary objects (images) via web services. So when I am testing a C# web service using a python script, the binary object can be recreated with a little magic.
[In python]
import base64
imageAsBytes = base64.b64decode( dataFromWS )
The usage of Base64 I'm going to describe here is somewhat a hack. So if you don't like hacks, please do not go on.
I went into trouble when I discovered that MySQL's utf8 does not support 4-byte unicode characters since it uses a 3-byte version of utf8. So what I did to support full 4-byte unicode over MySQL's utf8? Well, base64 encode strings when storing into the database and base64 decode when retrieving.
Since base64 encoding and decoding is very fast, the above worked perfectly.
You have the following points to take note of:
Base64 encoding uses 33% more storage
Strings stored in the database wont be human readable (You could sell that as a feature that database strings use a basic form of encryption).
You could use the above method for any storage engine that does not support unicode.
Mostly, I've seen it used to encode binary data in contexts that can only handle ascii - or a simple - character sets.
The base64 is a binary to a text encoding scheme that represents binary data in an ASCII string format. base64 is designed to carry data stored in binary format across the channels. It takes any form of data and transforms it into a long string of plain text. Earlier we can not transfer a large amount of data like files because it is made up of 2⁸ bit bytes but our actual network uses 2⁷ bit bytes. This is where base64 encoding came into the picture. But, what actually does base64 mean?
let’s understand the meaning of base64.
base64 = base+64
we can call base64 as a radix-64 representation.base64 uses only 6-bits(2⁶ = 64 characters) to ensure the printable data is human readable. but, how? we can also write base65 or base78, but why only 64? let’s prove it.
base64 encoding contains 64 characters to encode any string.
base64 contains:
10 numeric value i.e., 0,1,2,3,…..9.
26 Uppercase alphabets i.e., A,B,C,D,…….Z.
26 Lowercase alphabets i.e., a,b,c,d,……..z.
two special characters i.e., +,/. Depends upon your OS.
The steps followed by the base64 algorithm are as follow:
count the number of characters in a String.
If it is not multiple of 3 pad with special character i.e., = to
make it multiple of 3.
Encode the string in ASCII format.
Now, it will convert the ASCII to binary format 8-bit each.
After converting to binary format, it will divide binary data into
chunks of 6-bits each.
The chunks of 6-bit binary data will now be converted to decimal
number format.
Using the base64 Index Table, the decimals will be again converted
to a string according to the table format.
Finally, we will get the encoded version of our input string.
To expand a bit on what Brad is saying: many transport mechanisms for email and Usenet and other ways of moving data are not "8 bit clean", which means that characters outside the standard ascii character set might be mangled in transit - for instance, 0x0D might be seen as a carriage return, and turned into a carriage return and line feed. Base 64 maps all the binary characters into several standard ascii letters and numbers and punctuation so they won't be mangled this way.
One hexadecimal digit is of one nibble (4 bits). Two nibbles make 8 bits which are also called 1 byte.
MD5 generates a 128-bit output which is represented using a sequence of 32 hexadecimal digits, which in turn are 32*4=128 bits. 128 bits make 16 bytes (since 1 byte is 8 bits).
Each Base64 character encodes 6 bits (except the last non-pad character which can encode 2, 4 or 6 bits; and final pad characters, if any). Therefore, per Base64 encoding, a 128-bit hash requires at least ⌈128/6⌉ = 22 characters, plus pad if any.
Using base64, we can produce the encoded output of our desired length (6, 8, or 10).
If we choose to decide 8 char long output, it occupies only 8 bytes whereas it was occupying 16 bytes for 128-bit hash output.
So, in addition to security, base64 encoding is also used to reduce the space consumed.
Base64 can be used for many purposes.
The primary reason is to convert binary data to something passable.
I sometimes use it to pass JSON data around from one site to another, store information
in cookies about a user.
Note:
You "can" use it for encryption - I don't see why people say you can't, and that it's not encryption, although it would be easily breakable and is frowned upon. Encryption means nothing more than converting one string of data to another string of data that can be either later decrypted or not, and that's what base64 does.