Why does my base64 encoded SHA-1 hash contain 56 chars?

Why does my base64 encoded SHA-1 hash contain 56 chars? - hash

Maybe a completely stupid question but I just cannot work it out...
First I need to generate an SHA-1 hash using part of my submission markup. The hash is correct and the output is;
0623f7917a1e2e09e7bcc700482392fba620e6a2
Next I need to base64 encode this hash to a 28 character sting. This is where I am struggling as when I run my code (or use the online generators) I get a 56 character sting. The sting I get is;
MDYyM2Y3OTE3YTFlMmUwOWU3YmNjNzAwNDgyMzkyZmJhNjIwZTZhMg==
Question is 1) Is it possible to get a 28 char string from the hash above? and 2) how... where could I be going wrong.
Thank you for any help provided.

A SHA-1 hash is 20 bytes long, but those bytes are unlikely to all be printable characters.
Hence if we want to display those 20 bytes to a human we have to encode them in printable characters.
One way to do this is hexadecimal, where we take each byte, chop it in half and represent each half (a 4-bit value, numerically 0-15) with characters in the range 0123456789abcdef.
Thus each byte is encoded into 2 hex values, so our 20-byte hash value is encoded in 40 bytes of printable characters.
Hex is simple to calculate and it's easy for a human to look at an encoding and work out what the bytes actually look like, but it's not the most efficient as we're only using 16 out of the 95 ASCII printable characters.
Another way to encode arbitrary binary data into printable characters is Base 64. This is more efficient, encoding (on average) 3 bytes in 4 base64 values, but it's a lot harder for a human to parse the encoding.
The behaviour you are seeing is due to encoding a 20-byte hash value into 40 bytes of hex, and then encoding those 40 bytes of hex into 56 bytes (40 / 3 * 4, then rounded up to the nearest 4 bytes) of base64 data.
You need to either encode directly to base64 from the raw hash bytes (if available), or decode the hexadecimal value to bytes before encoding to base64.

Related

Which hashing algorithm generates alphanumeric output?

I am looking for an hashing algorithm that generates alphanumeric output. I did few tests with MD5 , SHA3 etc and they produce hexadecimal output.
Example:
Input: HelloWorld
Output[sha3_256]: 92dad9443e4dd6d70a7f11872101ebff87e21798e4fbb26fa4bf590eb440e71b
The 1st character in the above output is 9. Since output is in HEX format, maximum possible values are [0-9][a-f]
I am trying to achieve maximum possible values for the 1st character. [0-9][a-z][A-Z]
Any ideas would be appreciated . Thanks in advance.

Where MD5 computes a 128bit hash and SHA256 a 256bit hash, the output they provide is nothing more than a 128, respectively 256 long binary number. In short, that are a lot of zero's and ones. In order to use a more human-friendly representation of binary-coded values, Software developers and system designers use hexadecimal numbers, which is a representation in base(16). For example, an 8-bit byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.
You could convert this binary number into a base(32) if you want. This is represented using the characters "A-Z2-7". Or you could use base(64) which needs the characters "A-Za-z0-9+/". In the end, it is just a representation.
There is, however, some practical use to base(16) or hexadecimal. In computer lingo, a byte is 8 bits and a word consists of two bytes (16 bits). All of these can be comfortably represented hexadecimally as 28 = 24×24 = 16×16. Where 28 = 25×23 = 32×8. Hence, in base(32), a byte is not cleanly represented. You already need 5 bytes to have a clean base(32) representation with 8 characters. That is not comfortable to deal with on a daily basis.

Is there some standard encoding that encodes binary data into a sequence of one UTF-8 representable unicode character per byte of data?

This is related to the following question:
Why is base128 not used?
If we want to represent binary data as printable characters, we can hex encode it using a set of 16 printable 'digits' from the ASCII set (yielding 2 digits per byte of data) or we can base64 encoding using a set of 64 printable characters of the ASCII set (yielding roughly 1.33 characters per byte of data)
There is no base128 encoding using ASCII characters because ASCII contains only contains 95 printable characters (there is Ascii85 though which uses 85 characters https://en.wikipedia.org/wiki/Ascii85)
What I wonder is whether there is any standardized representation that uses a selection of 256 printable unicode characters that can be represented in UTF-8, effectively yielding an encoding with 1 printable character per byte of data?

There is no such standard encoding. But it can easily be created. Choose 256 random Unicode characters an used them to encode bytes 0 to 255.
Some of the characters will require 2 or more bytes to encode in UTF-8 as only 94 printable characters have a 1 byte encoding.
The most compact encoding you can achieve with this approach is to take these 94 characters (U+0021 to U+007E) and add 162 printable characters requiring 2 bytes for encoding, e.g. U+00A1 to U+0142. It results in an encoding requiring about 1.63 output bytes per input byte. So it's less efficient than Base64. That's probably the reason it hasn't been standardized.

Because it is not useful.
To encode 12-bits (just a codepoint sequence from 0 to 0x7FF), you need 2 bytes in UTF-8.
But in BASE64 you need also 2 bytes, and it is much simpler.
For 16-bits you can use 3 bytes. Base64 can encode 18-bits in 3 bytes.
So: more complex and less efficient.
But it will be also more difficult. Correct Unicode text have restricted Unicode sequences: combining characters position. Number of such combining characters. Some codepoints should not be used (either only internally, or never).

Is a base64 encoded string unique?

I can't find an answer to this. If I encode a string with Base64 will the encoded output be unique based on the string? I ask because I want to create a token which will contain user information so I need make sure the output will be unique depending on the information.
For example if I encode "UnqUserId:987654321 Timestamp:01/02/03" will this be unique so no matter what other userid I put it in there will never be a collision?

Two years late, but here we go:
The short answer is yes, unique binary/hex values will always encode to a unique base64 encoded string.
BUT, multiple base64 encoded strings may represent a single binary/hex value.
This is because hex bytes are not aligned with base64 'digits'. A single hex byte is represented by 8 bits while a single base64 digit is represented by 6 bits. Therefore, any hex value that is not 6-bit aligned can have multiple base64 representations (though correctly implemented base64 encoders should encode to the same base64 representation).
An example of this misalignment is the hex value '0x433356c1'. This value is represented by 32-bits and base64 encodes into 'QzNWwQ=='. This 32-bit value, however, is not 6-bit aligned. So what happens? The base64 encoder pads four zero bits onto the end of the binary representation in this case to make the sequence 36-bits and consequently 6-bit aligned.
When decoding, the base64 decoder now has to decode into an 8-bit aligned value. It truncates the padded bits and decodes the first 32 bits into a hex value. For example, 'QzNWwc==' and 'QzNWwQ==' are different base64 encoded strings, but decode to the same hex value, 0x433356c1. If we look carefully, we notice that the first 32 bits are the same for both of these encoded strings:
'QzNWwc==':
010000 110011 001101 010110 110000 011100
'QzNWwQ==':
010000 110011 001101 010110 110000 010000
The only difference is the last four bits, which are ignored. Keep in mind that no base64 encoder should ever generate 'QzNWwc==' or any other base64 value for 0x433356c1 other than 'QzNWwQ==' since added padding bytes should always be zeros.
In conclusion, it is safe to assume that a unique binary/hex value will always encode to a unique base64 representation using correctly implemented base64 encoders. A 'collision' will only occur during decoding if base64 strings are generated without zeroing padding/alignment bytes.

How can I reduce memcached key size using base64 encoding?

Here you can read:
64-bit UID's are clever ways to identify a user, but suck when printed
out. 18446744073709551616. 20 characters! Using base64 encoding, or
even just hexadecimal, you can cut that down by quite a bit
Blockquote
but as far I know 18446744073709551616 will result in a bigger string if is encoded using Base64. I know that I´m missing something, cos those memecached people are smart and in the doc they mentioned more than one time that using Base64 encoding could be useul to improve a key before store it into the memcahed. How is that?

What you're looking at is basically the decimal representation of 64 bits. They're probably talking about encoding the underlying bits directly to Base64 or hex, instead of encoding the decimal representation of the bits. They're essentially just talking about alphabet sizes. The larger the alphabet, the shorter the string:
64 bits as bits (alphabet of 2, 0 or 1) is 64 characters
64 bits as decimal (alphabet of 10, 0 - 9) is 20 characters
64 bits as hexadecimal (alphabet of 16, 0 - F) is 16 characters
etc...

Do not threat UID like a string, use rather 64 bit numerical representation. It takes exactly 64 bits or 8 bytes. Encoding 8 bytes using hexadecimal will result in string like "FF1122EE44556677" - 16 bytes. Using base64 encoding you will get even shorter string.

3-bit encoding = Octal; 4-bit encoding = Hexadecimal; 5-bit encoding =?

Is there an encoding that uses 5 bits as one group to encode a binary data?
A-Z contain 26 chars and 0-9 contain 10 chars. There are totally 36 chars which are sufficient for a 5-bit encoding (32 combinations only).
Why don't we use a 5-bit encoding instead of Octal or Hexadecimal?

As mentioned in a comment by #S.Lott, base64 (6-bit) is often used for encoding binary data as text when compactness is important.
For debugging purposes (e.g. hex dumps), we use hex because the size of a byte is evenly divisible by 4 bits, so each byte has one unique 2-digit hex representation no matter what other bytes are around it. That makes it easy to "see" the individual bytes when looking at a hex dump, and it's relatively easy to mentally convert between 8-bit binary and 2-digit hex as well. (In base64 there's no 1:1 correspondence between bytes and encoded characters; the same byte can produce different characters depending on its position and the values of other adjacent bytes.)

Yes. It's Base32. Maybe the name would be Triacontakaidecimal but it's too long and hard to remember so people simply call it base 32. Similarly there's also Base64 for groups of 6 bits
Base32 is much less common in use than hexadecimal and base64 because it wastes to much space compared to base64, and uses an odd number of bits compared to hexadecimal (which is exactly divisible by 8, 16, 32 or 64). 6 is also an even number, hence will be better than 5 on a binary computer

Sure, why not:
Welcome to Clozure Common Lisp Version 1.7-dev-r14614M-trunk (DarwinX8664)!
? (let ((*print-base* 32)) (print 1234567))
15LK7
1234567
? (let ((*print-base* 32)) (print (expt 45 19)))
KAD5A5KM53ADJVMPNTHPL
25765451768359987049102783203125
?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse