Why are the lengths of hashes always multiples of 16 - hash

I don't seem to find an answer that I can understand to the question of why hashes tend to have the multiple of 16 characters. For example: SHA-512 has 128 characters, SHA-256 has 64 characters, MD5 has 32 characters etc. What might be the case for that?

Related

How to hash two 32 bit integers into one 32 bit integers without collision?

I am looking for a one-way hashing to combine two 32 bits integers into one 32 bits integer. Not sure if it's feasible to do it without collision.
Edit:
I think my integers are generally small. One of them rarely takes more than 14 bits, and the other one rarely takes more than 20 bits.
Edit 2: Thanks for the help in the comments. I think for cases like if the combination of two integers took more than 32 bits, I can do something differently to not hash it. With that, how should I hash my integers?
Thanks!

How can I reduce memcached key size using base64 encoding?

Here you can read:
64-bit UID's are clever ways to identify a user, but suck when printed
out. 18446744073709551616. 20 characters! Using base64 encoding, or
even just hexadecimal, you can cut that down by quite a bit
Blockquote
but as far I know 18446744073709551616 will result in a bigger string if is encoded using Base64. I know that I´m missing something, cos those memecached people are smart and in the doc they mentioned more than one time that using Base64 encoding could be useul to improve a key before store it into the memcahed. How is that?
What you're looking at is basically the decimal representation of 64 bits. They're probably talking about encoding the underlying bits directly to Base64 or hex, instead of encoding the decimal representation of the bits. They're essentially just talking about alphabet sizes. The larger the alphabet, the shorter the string:
64 bits as bits (alphabet of 2, 0 or 1) is 64 characters
64 bits as decimal (alphabet of 10, 0 - 9) is 20 characters
64 bits as hexadecimal (alphabet of 16, 0 - F) is 16 characters
etc...
Do not threat UID like a string, use rather 64 bit numerical representation. It takes exactly 64 bits or 8 bytes. Encoding 8 bytes using hexadecimal will result in string like "FF1122EE44556677" - 16 bytes. Using base64 encoding you will get even shorter string.

md5 hash or crc32 which one to use in this case

I need a hash that can be represented in less than 26 chars
Md5 produces 32 chars long string , if convert it to base 36 how good will it be,
I am need of hash not for cryptography but rather for uniqueness basically identifying each input dependent on time of input and input data. currently i can think of this as
$hash=md5( str_ireplace(".","",microtime()).md5($input_data) ) ;
$unique_id= base_convert($hash,16,36) ;
should go like this or use crc32 which will give smaller hash size but i afraid it wont be that unique ?
I think a much simpler solution could take place.
According to your statement, you have 26 characters of space. However, to clarify what I understand to be character and what you understand to be character, let's do some digging.
The MD5 hash acc. to wikipedia produces 16 byte hashes.
The CRC32 algorithm prodces 4 byte hashes.
I understand "characters" (in the most simplest sense) to be ASCII characters. Each ascii character (eg. A = 65) is 8 bits long.
The MD5 aglorithm produces has 16 bytes * 8 bits per byte = 128 bits, CRC32 is 32 bits.
You must understand that hashes are not mathematically unique, but "likely to be unique."
So my solution, given your description, would be to then represent the bits of the hash as ascii characters.
If you only have the choice between MD5 and CRC32, the answer would be MD5. But you could also fit a SHA-1 160 bit hash < 26 character string (it would be 20 ascii characters long).
If you are concerned about the set of symbols that each hash uses, both hashes are in the set [A-Za-z0-9] (I believe).
Finally, when you convert what are essentially numbers from one base to another, the number doesn't change, therefore the strength of the algorithm doesn't change; it just changes the way the number is represented.

MD5 is 128 bits but why is it 32 characters?

I read some docs about md5, it said that its 128 bits, but why is it 32 characters? I can't compute the characters.
1 byte is 8 bits
if 1 character is 1 byte
then 128 bits is 128/8 = 16 bytes right?
EDIT:
SHA-1 produces 160 bits, so how many characters are there?
32 chars as hexdecimal representation, thats 2 chars per byte.
I wanted summerize some of the answers into one post.
First, don't think of the MD5 hash as a character string but as a hex number. Therefore, each digit is a hex digit (0-15 or 0-F) and represents four bits, not eight.
Taking that further, one byte or eight bits are represented by two hex digits, e.g. b'1111 1111' = 0xFF = 255.
MD5 hashes are 128 bits in length and generally represented by 32 hex digits.
SHA-1 hashes are 160 bits in length and generally represented by 40 hex digits.
For the SHA-2 family, I think the hash length can be one of a pre-determined set. So SHA-512 can be represented by 128 hex digits.
Again, this post is just based on previous answers.
A hex "character" (nibble) is different from a "character"
To be clear on the bits vs byte, vs characters.
1 byte is 8 bits (for our purposes)
8 bits provides 2**8 possible combinations: 256 combinations
When you look at a hex character,
16 combinations of [0-9] + [a-f]: the full range of 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
16 is less than 256, so one one hex character does not store a byte.
16 is 2**4: that means one hex character can store 4 bits in a byte (half a byte).
Therefore, two hex characters, can store 8 bits, 2**8 combinations.
A byte represented as a hex character is [0-9a-f][0-9a-f] and that represents both halfs of a byte (we call a half-byte a nibble).
When you look at a regular single-byte character, (we're totally going to skip multi-byte and wide-characters here)
It can store far more than 16 combinations.
The capabilities of the character are determined by the encoding. For instance, the ISO 8859-1 that stores an entire byte, stores all this stuff
All that stuff takes the entire 2**8 range.
If a hex-character in an md5() could store all that, you'd see all the lowercase letters, all the uppercase letters, all the punctuation and things like ¡°ÀÐàð, whitespace like (newlines, and tabs), and control characters (which you can't even see and many of which aren't in use).
So they're clearly different and I hope that provides the best break down of the differences.
MD5 yields hexadecimal digits (0-15 / 0-F), so they are four bits each. 128 / 4 = 32 characters.
SHA-1 yields hexadecimal digits too (0-15 / 0-F), so 160 / 4 = 40 characters.
(Since they're mathematical operations, most hashing functions' output is commonly represented as hex digits.)
You were probably thinking of ASCII text characters, which are 8 bits.
One hex digit = 1 nibble (four-bits)
Two hex digits = 1 byte (eight-bits)
MD5 = 32 hex digits
32 hex digits = 16 bytes ( 32 / 2)
16 bytes = 128 bits (16 * 8)
The same applies to SHA-1 except it's 40 hex digits long.
I hope this helps.
That's 32 hex characters - 1 hex character is 4 bits.
Those are hexidecimal digits, not characters. One digit = 4 bits.
They're not actually characters, they're hexadecimal digits.
For clear understanding, copy the MD5 calculated 128 bit hash value in the Binary to Hex convertor and see the length of the Hex value. You will get 32 characters Hex characters.

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
Unicode has the hexadecimal amount of 110000, which is 1114112