MD5 is 128 bits but why is it 32 characters? - hash

I read some docs about md5, it said that its 128 bits, but why is it 32 characters? I can't compute the characters.
1 byte is 8 bits
if 1 character is 1 byte
then 128 bits is 128/8 = 16 bytes right?
EDIT:
SHA-1 produces 160 bits, so how many characters are there?

32 chars as hexdecimal representation, thats 2 chars per byte.

I wanted summerize some of the answers into one post.
First, don't think of the MD5 hash as a character string but as a hex number. Therefore, each digit is a hex digit (0-15 or 0-F) and represents four bits, not eight.
Taking that further, one byte or eight bits are represented by two hex digits, e.g. b'1111 1111' = 0xFF = 255.
MD5 hashes are 128 bits in length and generally represented by 32 hex digits.
SHA-1 hashes are 160 bits in length and generally represented by 40 hex digits.
For the SHA-2 family, I think the hash length can be one of a pre-determined set. So SHA-512 can be represented by 128 hex digits.
Again, this post is just based on previous answers.

A hex "character" (nibble) is different from a "character"
To be clear on the bits vs byte, vs characters.
1 byte is 8 bits (for our purposes)
8 bits provides 2**8 possible combinations: 256 combinations
When you look at a hex character,
16 combinations of [0-9] + [a-f]: the full range of 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
16 is less than 256, so one one hex character does not store a byte.
16 is 2**4: that means one hex character can store 4 bits in a byte (half a byte).
Therefore, two hex characters, can store 8 bits, 2**8 combinations.
A byte represented as a hex character is [0-9a-f][0-9a-f] and that represents both halfs of a byte (we call a half-byte a nibble).
When you look at a regular single-byte character, (we're totally going to skip multi-byte and wide-characters here)
It can store far more than 16 combinations.
The capabilities of the character are determined by the encoding. For instance, the ISO 8859-1 that stores an entire byte, stores all this stuff
All that stuff takes the entire 2**8 range.
If a hex-character in an md5() could store all that, you'd see all the lowercase letters, all the uppercase letters, all the punctuation and things like ¡°ÀÐàð, whitespace like (newlines, and tabs), and control characters (which you can't even see and many of which aren't in use).
So they're clearly different and I hope that provides the best break down of the differences.

MD5 yields hexadecimal digits (0-15 / 0-F), so they are four bits each. 128 / 4 = 32 characters.
SHA-1 yields hexadecimal digits too (0-15 / 0-F), so 160 / 4 = 40 characters.
(Since they're mathematical operations, most hashing functions' output is commonly represented as hex digits.)
You were probably thinking of ASCII text characters, which are 8 bits.

One hex digit = 1 nibble (four-bits)
Two hex digits = 1 byte (eight-bits)
MD5 = 32 hex digits
32 hex digits = 16 bytes ( 32 / 2)
16 bytes = 128 bits (16 * 8)
The same applies to SHA-1 except it's 40 hex digits long.
I hope this helps.

That's 32 hex characters - 1 hex character is 4 bits.

Those are hexidecimal digits, not characters. One digit = 4 bits.

They're not actually characters, they're hexadecimal digits.

For clear understanding, copy the MD5 calculated 128 bit hash value in the Binary to Hex convertor and see the length of the Hex value. You will get 32 characters Hex characters.

Related

Regarding unicode characters and their utf8 binary representation

Out of curiosity, i wonder why for example a character "ł" with code point 322 has a UTF8 binary representation of 11000101:10000010 in decimal 197:130 and not its actual binary representation 00000001:01000010 in decimal 1:66 ?
UTF-8 encodes Unicode code points in the range U+0000..U+007F in a single byte. Code points in the range U+0080..U+07FF use 2 bytes, code points in the range U+0800..U+FFFF use 3 bytes, and code points in the range U+10000..U+10FFFF use 4 bytes.
When the code point needs two bytes, then the first byte starts with the bit pattern 110; the remaining 5 bits are the high order 5 bits of the Unicode code point. The continuation byte starts with the bit pattern 10; the remaining 6 bits are the low order 6 bits of the Unicode code point.
You are looking at ł U+0142 LATIN SMALL LETTER L WITH STROKE (decimal 322). The bit pattern representing hexadecimal 142 is:
00000001 01000010
With the UTF-8 sub-field grouping marked by colons, that is:
00000:001 01:000010
So the UTF-8 code is:
110:00101 10:000010
11000101 10000010
0xC5 0x82
197 130
The same basic ideas apply to 3-byte and 4-byte encodings — you chop off 6-bits per continuation byte, and combine the leading bits with the appropriate marker bits (1110 for 3 bytes; 11110 for 4 bytes — there are as many leading 1 bits as there are bytes in the complete character). There are a bunch of other rules that don't matter much to you right now. For example, you never encode a UTF-16 high surrogate (U+D800..U+DBFF) or a low surrogate (U+DC00..UDFFF) in UTF-8 (or UTF-32, come to that). You never encode a non-minimal sequence (so although bytes 0xC0 0x80 could be used to encode U+0000, this is invalid). One consequence of these rules is that the bytes 0xC0 and 0xC1 are never valid in UTF-8 (and neither are 0xF5..0xFF).
UTF8 is designed for compatibility with with 7-bit ASCII.
To achieve this the most significant bit of bytes in a UTF8 encoded byte sequence is used to signal whether a byte is part of a multi-byte encoded code point. If the MSB is set, then the byte is part of a sequence of 2 or more bytes that encode a single code point. If the MSB is not set then the byte encodes a code point in the range 0..127.
Therefore in UTF8 the byte sequence [1][66] represents the two code points 1 and 66 respectively since the MSB is not set (=0) in either byte.
Furthermore, the code point #322 must be encoded using a sequence of bytes where the MSB is set (=1) in each byte.
The precise details of UTF8 encoding are quite a bit more complex but there are numerous resources that go into those details.

How can I reduce memcached key size using base64 encoding?

Here you can read:
64-bit UID's are clever ways to identify a user, but suck when printed
out. 18446744073709551616. 20 characters! Using base64 encoding, or
even just hexadecimal, you can cut that down by quite a bit
Blockquote
but as far I know 18446744073709551616 will result in a bigger string if is encoded using Base64. I know that I´m missing something, cos those memecached people are smart and in the doc they mentioned more than one time that using Base64 encoding could be useul to improve a key before store it into the memcahed. How is that?
What you're looking at is basically the decimal representation of 64 bits. They're probably talking about encoding the underlying bits directly to Base64 or hex, instead of encoding the decimal representation of the bits. They're essentially just talking about alphabet sizes. The larger the alphabet, the shorter the string:
64 bits as bits (alphabet of 2, 0 or 1) is 64 characters
64 bits as decimal (alphabet of 10, 0 - 9) is 20 characters
64 bits as hexadecimal (alphabet of 16, 0 - F) is 16 characters
etc...
Do not threat UID like a string, use rather 64 bit numerical representation. It takes exactly 64 bits or 8 bytes. Encoding 8 bytes using hexadecimal will result in string like "FF1122EE44556677" - 16 bytes. Using base64 encoding you will get even shorter string.

Clarification on Joel Spolsky's Unicode article

I'm reading the popular Unicode article from Joel Spolsky and there's one illustration that I don't understand.
What does "Hex Min, Hex Max" mean? What do those values represent? Min and max of what?
Binary can only have 1 or 0. Why do I see tons of letter "v" here?
http://www.joelonsoftware.com/articles/Unicode.html
The Hex Min/Max define the range of unicode characters (typically represented by their unicode number in HEX).
The v is referring to the bits of the original number
So the first line is saying:
The unicode characters in the range 0 (hex 00) to 127 (hex 7F) (a 7
bit number) are represented by a 1 byte bit string starting with '0'
followed by all 7 bits of the unicode number.
The second line is saying:
The unicode numbers in the range 128 (hex 0800) to 2047 (07FF) (an 11
bit number) are represented by a 2 byte bit string where the first
byte starts with '110' followed by the first 5 of the 11 bits, and the
second byte starts with '10' followed by the remaining 6 of the 11 bits
etc
Hope that makes sense
Note that the table in Joel's article covers code points that do not, and never will, exist in Unicode. In fact, UTF-8 never needs more than 4 bytes, though the scheme underlying UTF-8 could be extended much further, as shown.
A more nuanced version of the table is available in How does a file with Chinese characters know how many bytes to use per character? It points out some of the gaps. For example, the bytes 0xC0, 0xC1, and 0xF5..0xFF can never appear in valid UTF-8. You can also see information about invalid UTF-8 at Really good bad UTF-8 example test data.
In the table you showed, the Hex Min and Hex Max values are the minimum and maximum U+wxyz values that can be represented using the number of bytes in the 'byte sequence in binary' column. Note that the maximum code point in Unicode is U+10FFFF (and that is defined/reserved as a non-character). This is the maximum value that can be represented using the surrogate encoding scheme in UTF-16 using just 4 bytes (two UTF-16 code points).

Why is a SHA-1 Hash 40 characters long if it is only 160 bit?

The title of the question says it all. I have been researching SHA-1 and most places I see it being 40 Hex Characters long which to me is 640bit. Could it not be represented just as well with only 10 hex characters 160bit = 20byte. And one hex character can represent 2 byte right? Why is it twice as long as it needs to be? What am I missing in my understanding.
And couldn't an SHA-1 be even just 5 or less characters if using Base32 or Base36 ?
One hex character can only represent 16 different values, i.e. 4 bits. (16 = 24)
40 × 4 = 160.
And no, you need much more than 5 characters in base-36.
There are totally 2160 different SHA-1 hashes.
2160 = 1640, so this is another reason why we need 40 hex digits.
But 2160 = 36160 log362 = 3630.9482..., so you still need 31 characters using base-36.
I think the OP's confusion comes from a string representing a SHA1 hash takes 40 bytes (at least if you are using ASCII), which equals 320 bits (not 640 bits).
The reason is that the hash is in binary and the hex string is just an encoding of that. So if you were to use a more efficient encoding (or no encoding at all), you could take only 160 bits of space (20 bytes), but the problem with that is it won't be binary safe.
You could use base64 though, in which case you'd need about 27-28 bytes (or characters) instead of 40 (see this page).
There are two hex characters per 8-bit-byte, not two bytes per hex character.
If you are working with 8-bit bytes (as in the SHA-1 definition), then a hex character encodes a single high or low 4-bit nibble within a byte. So it takes two such characters for a full byte.
My answer only differs from the previous ones in my theory as to the EXACT origin of the OP's confusion, and in the baby steps I provide for elucidation.
A character takes up different numbers of bytes depending on the encoding used (see here). There are a few contexts these days when we use 2 bytes per character, for example when programming in Java (here's why). Thus 40 Java characters would equal 80 bytes = 640 bits, the OP's calculation, and 10 Java characters would indeed encapsulate the right amount of information for a SHA-1 hash.
Unlike the thousands of possible Java characters, however, there are only 16 different hex characters, namely 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E and F. But these are not the same as Java characters, and take up far less space than the encodings of the Java characters 0 to 9 and A to F. They are symbols signifying all the possible values represented by just 4 bits:
0 0000 4 0100 8 1000 C 1100
1 0001 5 0101 9 1001 D 1101
2 0010 6 0110 A 1010 E 1110
3 0011 7 0111 B 1011 F 1111
Thus each hex character is only half a byte, and 40 hex characters gives us 20 bytes = 160 bits - the length of a SHA-1 hash.
2 hex characters mak up a range from 0-255, i.e. 0x00 == 0 and 0xFF == 255. So 2 hex characters are 8 bit, which makes 160 bit for your SHA digest.
SHA-1 is 160 bits
That translates to 20 bytes = 40 hex characters (2 hex characters per byte)

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?
It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.
I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.
Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.