How can I reduce memcached key size using base64 encoding? - encoding

Here you can read:
64-bit UID's are clever ways to identify a user, but suck when printed
out. 18446744073709551616. 20 characters! Using base64 encoding, or
even just hexadecimal, you can cut that down by quite a bit
Blockquote
but as far I know 18446744073709551616 will result in a bigger string if is encoded using Base64. I know that I´m missing something, cos those memecached people are smart and in the doc they mentioned more than one time that using Base64 encoding could be useul to improve a key before store it into the memcahed. How is that?

What you're looking at is basically the decimal representation of 64 bits. They're probably talking about encoding the underlying bits directly to Base64 or hex, instead of encoding the decimal representation of the bits. They're essentially just talking about alphabet sizes. The larger the alphabet, the shorter the string:
64 bits as bits (alphabet of 2, 0 or 1) is 64 characters
64 bits as decimal (alphabet of 10, 0 - 9) is 20 characters
64 bits as hexadecimal (alphabet of 16, 0 - F) is 16 characters
etc...

Do not threat UID like a string, use rather 64 bit numerical representation. It takes exactly 64 bits or 8 bytes. Encoding 8 bytes using hexadecimal will result in string like "FF1122EE44556677" - 16 bytes. Using base64 encoding you will get even shorter string.

Related

Which hashing algorithm generates alphanumeric output?

I am looking for an hashing algorithm that generates alphanumeric output. I did few tests with MD5 , SHA3 etc and they produce hexadecimal output.
Example:
Input: HelloWorld
Output[sha3_256]: 92dad9443e4dd6d70a7f11872101ebff87e21798e4fbb26fa4bf590eb440e71b
The 1st character in the above output is 9. Since output is in HEX format, maximum possible values are [0-9][a-f]
I am trying to achieve maximum possible values for the 1st character. [0-9][a-z][A-Z]
Any ideas would be appreciated . Thanks in advance.
Where MD5 computes a 128bit hash and SHA256 a 256bit hash, the output they provide is nothing more than a 128, respectively 256 long binary number. In short, that are a lot of zero's and ones. In order to use a more human-friendly representation of binary-coded values, Software developers and system designers use hexadecimal numbers, which is a representation in base(16). For example, an 8-bit byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.
You could convert this binary number into a base(32) if you want. This is represented using the characters "A-Z2-7". Or you could use base(64) which needs the characters "A-Za-z0-9+/". In the end, it is just a representation.
There is, however, some practical use to base(16) or hexadecimal. In computer lingo, a byte is 8 bits and a word consists of two bytes (16 bits). All of these can be comfortably represented hexadecimally as 28 = 24×24 = 16×16. Where 28 = 25×23 = 32×8. Hence, in base(32), a byte is not cleanly represented. You already need 5 bytes to have a clean base(32) representation with 8 characters. That is not comfortable to deal with on a daily basis.

Why does my base64 encoded SHA-1 hash contain 56 chars?

Maybe a completely stupid question but I just cannot work it out...
First I need to generate an SHA-1 hash using part of my submission markup. The hash is correct and the output is;
0623f7917a1e2e09e7bcc700482392fba620e6a2
Next I need to base64 encode this hash to a 28 character sting. This is where I am struggling as when I run my code (or use the online generators) I get a 56 character sting. The sting I get is;
MDYyM2Y3OTE3YTFlMmUwOWU3YmNjNzAwNDgyMzkyZmJhNjIwZTZhMg==
Question is 1) Is it possible to get a 28 char string from the hash above? and 2) how... where could I be going wrong.
Thank you for any help provided.
A SHA-1 hash is 20 bytes long, but those bytes are unlikely to all be printable characters.
Hence if we want to display those 20 bytes to a human we have to encode them in printable characters.
One way to do this is hexadecimal, where we take each byte, chop it in half and represent each half (a 4-bit value, numerically 0-15) with characters in the range 0123456789abcdef.
Thus each byte is encoded into 2 hex values, so our 20-byte hash value is encoded in 40 bytes of printable characters.
Hex is simple to calculate and it's easy for a human to look at an encoding and work out what the bytes actually look like, but it's not the most efficient as we're only using 16 out of the 95 ASCII printable characters.
Another way to encode arbitrary binary data into printable characters is Base 64. This is more efficient, encoding (on average) 3 bytes in 4 base64 values, but it's a lot harder for a human to parse the encoding.
The behaviour you are seeing is due to encoding a 20-byte hash value into 40 bytes of hex, and then encoding those 40 bytes of hex into 56 bytes (40 / 3 * 4, then rounded up to the nearest 4 bytes) of base64 data.
You need to either encode directly to base64 from the raw hash bytes (if available), or decode the hexadecimal value to bytes before encoding to base64.

MD5 is 128 bits but why is it 32 characters?

I read some docs about md5, it said that its 128 bits, but why is it 32 characters? I can't compute the characters.
1 byte is 8 bits
if 1 character is 1 byte
then 128 bits is 128/8 = 16 bytes right?
EDIT:
SHA-1 produces 160 bits, so how many characters are there?
32 chars as hexdecimal representation, thats 2 chars per byte.
I wanted summerize some of the answers into one post.
First, don't think of the MD5 hash as a character string but as a hex number. Therefore, each digit is a hex digit (0-15 or 0-F) and represents four bits, not eight.
Taking that further, one byte or eight bits are represented by two hex digits, e.g. b'1111 1111' = 0xFF = 255.
MD5 hashes are 128 bits in length and generally represented by 32 hex digits.
SHA-1 hashes are 160 bits in length and generally represented by 40 hex digits.
For the SHA-2 family, I think the hash length can be one of a pre-determined set. So SHA-512 can be represented by 128 hex digits.
Again, this post is just based on previous answers.
A hex "character" (nibble) is different from a "character"
To be clear on the bits vs byte, vs characters.
1 byte is 8 bits (for our purposes)
8 bits provides 2**8 possible combinations: 256 combinations
When you look at a hex character,
16 combinations of [0-9] + [a-f]: the full range of 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
16 is less than 256, so one one hex character does not store a byte.
16 is 2**4: that means one hex character can store 4 bits in a byte (half a byte).
Therefore, two hex characters, can store 8 bits, 2**8 combinations.
A byte represented as a hex character is [0-9a-f][0-9a-f] and that represents both halfs of a byte (we call a half-byte a nibble).
When you look at a regular single-byte character, (we're totally going to skip multi-byte and wide-characters here)
It can store far more than 16 combinations.
The capabilities of the character are determined by the encoding. For instance, the ISO 8859-1 that stores an entire byte, stores all this stuff
All that stuff takes the entire 2**8 range.
If a hex-character in an md5() could store all that, you'd see all the lowercase letters, all the uppercase letters, all the punctuation and things like ¡°ÀÐàð, whitespace like (newlines, and tabs), and control characters (which you can't even see and many of which aren't in use).
So they're clearly different and I hope that provides the best break down of the differences.
MD5 yields hexadecimal digits (0-15 / 0-F), so they are four bits each. 128 / 4 = 32 characters.
SHA-1 yields hexadecimal digits too (0-15 / 0-F), so 160 / 4 = 40 characters.
(Since they're mathematical operations, most hashing functions' output is commonly represented as hex digits.)
You were probably thinking of ASCII text characters, which are 8 bits.
One hex digit = 1 nibble (four-bits)
Two hex digits = 1 byte (eight-bits)
MD5 = 32 hex digits
32 hex digits = 16 bytes ( 32 / 2)
16 bytes = 128 bits (16 * 8)
The same applies to SHA-1 except it's 40 hex digits long.
I hope this helps.
That's 32 hex characters - 1 hex character is 4 bits.
Those are hexidecimal digits, not characters. One digit = 4 bits.
They're not actually characters, they're hexadecimal digits.
For clear understanding, copy the MD5 calculated 128 bit hash value in the Binary to Hex convertor and see the length of the Hex value. You will get 32 characters Hex characters.

If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding.
So, what's the truth?
If it's 8-bits encoding, then what's the difference between ASCII and UTF-8?
If it's not, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003
Excerpt from above:
Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.
And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.
I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).
UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.
The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.
Unicode itself is a 21-bit character set. There are a number of encodings for it:
UTF-32 where each Unicode code point is stored in a 32-bit integer
UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.
So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.
Just complementing the other answer about UTF-8 coding, that uses 1 to 4 bytes
As people said above, a code with 4 bytes totals 32 bits, but of these 32 bits, 11 bits are used as a prefix in the control bytes, i.e. to identify the code size of a Unicode symbol between 1 and 4 bytes and also enable to recover a text easily even in the middle of the text.
The gold question is: Why we need so much bits (11) for control in a 32 bits code? Wouldn't it be useful to have more than 21 bits for codification?
The point is that the planned scheme needs to be such that it is easily known to go back to the 1st. bite of a code.
Thus, bytes besides the first byte cannot have all their bits released for codify a Unicode symbol because otherwise they could easily to be confused as the first byte of a valid code UTF-8.
So the model is
0UUUUUUU for 1 byte code. We have 7 Us, so there are 2^7 = 128
possibilities that are the traditional ASCII codes.
110UUUUU 10UUUUUU for 2 bytes code. Here we have 11 Us so there
are 2^11 = 2,048 - 128 = 1,921 possibilities. It discounts the previous
gross number 2^7 because you need to discount the codes up to 2^7 = 127, corresponding to the 1 byte legacy ASCII.
1110UUUU 10UUUUUU 10UUUUUU for 3 bytes code. Here we have 16 Us so
there are 2^16 = 65,536 - 2,048 = 63,488 possibilities)
11110UUU 10UUUUUU 10UUUUUU 10UUUUUU for 4 bytes code. Here we have 21
Us so there are 2^21 = 2,097,152 - 65,536 = 2,031,616 possibilities,
where U is a bit 0 or 1 used to codify a Unicode UTF-8 symbol.
So the total possibilities are 127 + 1,921 + 63,488 + 2,031,616 = 2,097,152 Unicode symbols.
In the Unicode tables available (for example, in the Unicode Pad App for Android or here) appear the Unicode code in form (U+H), where H is a hex number of 1 to 6 digits. For example U+1F680 represents a rocket icon: 🚀.
This code translates the bits U of the right to left symbol code (21 to 4 bytes, 16 to 3 bytes, 11 to 2 bytes and 7 to 1 byte), grouped in bytes, and with the incomplete byte on the left completed with 0s.
Below we will try to explain why one needs to have 11 bits of control. Part of the choices made was merely a random choice between 0 and 1, which lacks a rational explanation.
As 0 is used to indicate one byte code, what makes 0 .... always equivalent to the ASCII code of 128 characters (backwards compatibility)
For symbols that uses more than 1 byte, the 10 in the start of 2nd., 3rd. and 4th. byte always serves to know we are in the middle of a code.
To settle confusion, if the first byte starts with 11, it indicates that the 1st. byte represents a Unicode character with 2, 3 or 4 bytes code. On the other hand, 10 represents a middle byte, that is, it never initiates the codification of a Unicode symbol.(Obviously the prefix for continuation bytes could not be 1 because 0... and 1... would exhaust all possible bytes)
If there were no rules for non-initial byte, it would be very ambiguous.
With this choice, we know that the first initial byte bit starts with 0 or 11, which never gets confused with a middle byte, which starts with 10. Just looking at byte we already know if it is a character ASCII, the beginning of a byte sequence (2, 3 or 4 bytes) or the byte from the middle of a byte sequence (2, 3 or 4 bytes).
It could be the opposite choice: The prefix 11 could indicate the middle byte and the prefix 10 the start byte in a code with 2, 3 or 4 bytes. That choice is just a matter of convention.
Also for matter of choice, the 3rd. bit 0 of the 1st. byte means 2 bytes UTF-8 code and the 3rd. bit 1 of the 1st. byte means 3 or 4 bytes UTF-8 code (again, it's impossible adopt prefix '11' for 2 bytes symbol, it also would exhaust all possible bytes: 0..., 10... and 11...).
So a 4th bit is required in the 1st. byte to distinguish 3 ou 4 bytes Unicode UTF-8 codification.
A 4th bit with 0 is for 3 bytes code and 1 is for 4 bytes code, which still uses an additional bit 0 that would be needless at first.
One of the reasons, beyond the pretty symmetry (0 is always the last prefix bit in the starting byte), for having the additional 0 as 5th bit in the first byte for the 4 bytes Unicode symbol, is in order to make an unknown string almost recognizable as UTF-8 because there is no byte in the range from 11111000 to 11111111 (F8 to FF or 248 to 255).
If hypothetically we use 22 bits (Using the last 0 of 5 bits in the first byte as part of character code that uses 4 bytes, there would be 2^22 = 4,194,304 possibilities in total (22 because there would be 4 + 6 + 6 + 6 = 22 bits left for UTF-8 symbol codification and 4 + 2 + 2 + 2 = 10 bits as prefix)
With adopted UTF-8 coding system (5th bit is fixed with 0 for 4 bytes code) , there are 2^21 = 2,097,152 possibilities, but only 1,112,064 of these are valid Unicodes symbols (21 because there are 3 + 6 + 6 + 6 = 21 bits left for UTF-8 symbol codification and 5 + 2 + 2 + 2 = 11 bits as prefix)
As we have seen, not all possibilities with 21 bits are used (2,097,152). Far from it (just 1,112,064). So saving one bit doesn't bring tangible benefits.
Other reason is the possibility of using this unused codes for control functions, outside Unicode world.

3-bit encoding = Octal; 4-bit encoding = Hexadecimal; 5-bit encoding =?

Is there an encoding that uses 5 bits as one group to encode a binary data?
A-Z contain 26 chars and 0-9 contain 10 chars. There are totally 36 chars which are sufficient for a 5-bit encoding (32 combinations only).
Why don't we use a 5-bit encoding instead of Octal or Hexadecimal?
As mentioned in a comment by #S.Lott, base64 (6-bit) is often used for encoding binary data as text when compactness is important.
For debugging purposes (e.g. hex dumps), we use hex because the size of a byte is evenly divisible by 4 bits, so each byte has one unique 2-digit hex representation no matter what other bytes are around it. That makes it easy to "see" the individual bytes when looking at a hex dump, and it's relatively easy to mentally convert between 8-bit binary and 2-digit hex as well. (In base64 there's no 1:1 correspondence between bytes and encoded characters; the same byte can produce different characters depending on its position and the values of other adjacent bytes.)
Yes. It's Base32. Maybe the name would be Triacontakaidecimal but it's too long and hard to remember so people simply call it base 32. Similarly there's also Base64 for groups of 6 bits
Base32 is much less common in use than hexadecimal and base64 because it wastes to much space compared to base64, and uses an odd number of bits compared to hexadecimal (which is exactly divisible by 8, 16, 32 or 64). 6 is also an even number, hence will be better than 5 on a binary computer
Sure, why not:
Welcome to Clozure Common Lisp Version 1.7-dev-r14614M-trunk (DarwinX8664)!
? (let ((*print-base* 32)) (print 1234567))
15LK7
1234567
? (let ((*print-base* 32)) (print (expt 45 19)))
KAD5A5KM53ADJVMPNTHPL
25765451768359987049102783203125
?