Which hashing algorithm generates alphanumeric output? - hash

I am looking for an hashing algorithm that generates alphanumeric output. I did few tests with MD5 , SHA3 etc and they produce hexadecimal output.
Example:
Input: HelloWorld
Output[sha3_256]: 92dad9443e4dd6d70a7f11872101ebff87e21798e4fbb26fa4bf590eb440e71b
The 1st character in the above output is 9. Since output is in HEX format, maximum possible values are [0-9][a-f]
I am trying to achieve maximum possible values for the 1st character. [0-9][a-z][A-Z]
Any ideas would be appreciated . Thanks in advance.

Where MD5 computes a 128bit hash and SHA256 a 256bit hash, the output they provide is nothing more than a 128, respectively 256 long binary number. In short, that are a lot of zero's and ones. In order to use a more human-friendly representation of binary-coded values, Software developers and system designers use hexadecimal numbers, which is a representation in base(16). For example, an 8-bit byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.
You could convert this binary number into a base(32) if you want. This is represented using the characters "A-Z2-7". Or you could use base(64) which needs the characters "A-Za-z0-9+/". In the end, it is just a representation.
There is, however, some practical use to base(16) or hexadecimal. In computer lingo, a byte is 8 bits and a word consists of two bytes (16 bits). All of these can be comfortably represented hexadecimally as 28 = 24×24 = 16×16. Where 28 = 25×23 = 32×8. Hence, in base(32), a byte is not cleanly represented. You already need 5 bytes to have a clean base(32) representation with 8 characters. That is not comfortable to deal with on a daily basis.

Related

UTF8, codepoints, and their representation in Erlang and Elixir

going through Elixir's handling of unicode:
iex> String.codepoints("abc§")
["a", "b", "c", "§"]
very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.
The ? operator (or is it a macro? can't find the answer) tells me that
iex(69)> ?§
167
Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating ?§ earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:
iex(72)> <<0xc2, 0xa7>>
"§"
And to make me go completely bananas, this is what I get in Erlang shell:
24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.
<<"§">>
!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?
The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.
The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.
When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).
In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:
24> <<167>>.
<<"§">> %% this is encoded in latin1
25> <<"\x{a7}">>.
<<"§">> %% still latin1
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says
27> <<"\x{c2a7}">>.
<<"§">> %% this is latin1
The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.
The code point is a number assigned to the character. It's an abstract value, not dependent on any particular representation in actual memory somewhere.
In order to store the character, you have to convert the code point to some sequence of bytes. There are several different ways to do this; each is called a Unicode Transformation Format, and named UTF-n, where the n is the number of bits in the basic unit of encoding. There used to be a UTF-7, used where 7-bit ASCII was assumed and even the 8th bit of a byte couldn't be reliably transmitted; in modern systems, there are UTF-8, UTF-16, and UTF-32.
Since the largest code point value fits comfortably in 21 bits, UTF-32 is the simplest; you just store the code point as a 32-bit integer. (There could theoretically be a UTF-24 or even a UTF-21, but common modern computing platforms deal naturally with values that take up either exactly 8 or a multiple of 16 bits, and have to work harder to deal with anything else.)
So UTF-32 is simple, but inefficient. Not only does it have 11 extra bits that will never be needed, it has 5 bits that are almost never needed. Far and away most Unicode characters found in the wild are in the Basic Multilingual Plane, U+0000 through U+FFFF. UTF-16 lets you represent all of those code points as a plain integer, taking up half the space of UTF-32. But it can't represent anything from U+10000 on up that way, so part of the 0000-FFFF range is reserved as "surrogate pairs" that can be put together to represent a high-plane Unicode character with two 16-bit units, for a total of 32 bits again but only when needed.
Java uses UTF-16 internally, but Erlang (and therefore Elixir), along with most other programming systems, uses UTF-8. UTF-8 has the advantage of completely transparent compatibility with ASCII - all characters in the ASCII range (U+0000 through U+007F, or 0-127 decimal) are represented by single bytes with the corresponding value. But any characters with code points outside the ASCII range require more than one byte each - even those in the range U+0080 through U+00FF, decimal 128 through 255, which only take up one byte in the Latin-1 encoding that used to be the default before Unicode.
So with Elixir/Erlang "binaries", unless you go out of your way to encode things differently, you are using UTF-8. If you look at the high bit of the first byte of a UTF-8 character, it's either 0, meaning you have a one-byte ASCII character, or it's 1. If it's 1, then the second-highest bit is also 1, because the number of consecutive 1-bits counting down from the high bit before you get to a 0 bit tells you how many bytes total the character takes up. So the pattern 110xxxxx means the character is two bytes, 1110xxxx means three bytes, and 11110xxx means four bytes. (There is no legal UTF-8 character that requires more than four bytes, although the encoding could theoretically support up to seven.)
The rest of the bytes all have the two high bits set to 10, so they can't be mistaken for the start of a character. And the rest of the bits are the code point itself.
To use your case as an example, the code point for "§" is U+00A7 - that is, hexadecimal A7, which is decimal 167 or binary 10100111. Since that's greater than decimal 127, it will require two bytes in UTF-8. Those two bytes will have the binary form 110abcde 10fghijk, where the bits abcdefghijk will hold the code point. So the binary representation of the code point, 10100111, is padded out to 00010100111 and split unto the sequences 00010, which replaces abcde in the UTF-8 template, and 100111, which replaces fghijk. That yields two bytes with binary values 11000010 and 10100111, which are C2 and A7 in hexadecimal, or 194 and 167 in decimal.
You'll notice that the second byte coincidentally has the same value as the code point you're encoding, but t's important to realize that this correspondence is just a coincidence. There are a total of 64 code points, from 128 (U+0080) through 191 (U+00BF), that work out that way: their UTF-8 encoding consists of a byte with decimal value 194 followed by a byte whose value is equal to the code point itself. But for the other 1,114,048 code points possible in Unicode, that is not the case.

Why does my base64 encoded SHA-1 hash contain 56 chars?

Maybe a completely stupid question but I just cannot work it out...
First I need to generate an SHA-1 hash using part of my submission markup. The hash is correct and the output is;
0623f7917a1e2e09e7bcc700482392fba620e6a2
Next I need to base64 encode this hash to a 28 character sting. This is where I am struggling as when I run my code (or use the online generators) I get a 56 character sting. The sting I get is;
MDYyM2Y3OTE3YTFlMmUwOWU3YmNjNzAwNDgyMzkyZmJhNjIwZTZhMg==
Question is 1) Is it possible to get a 28 char string from the hash above? and 2) how... where could I be going wrong.
Thank you for any help provided.
A SHA-1 hash is 20 bytes long, but those bytes are unlikely to all be printable characters.
Hence if we want to display those 20 bytes to a human we have to encode them in printable characters.
One way to do this is hexadecimal, where we take each byte, chop it in half and represent each half (a 4-bit value, numerically 0-15) with characters in the range 0123456789abcdef.
Thus each byte is encoded into 2 hex values, so our 20-byte hash value is encoded in 40 bytes of printable characters.
Hex is simple to calculate and it's easy for a human to look at an encoding and work out what the bytes actually look like, but it's not the most efficient as we're only using 16 out of the 95 ASCII printable characters.
Another way to encode arbitrary binary data into printable characters is Base 64. This is more efficient, encoding (on average) 3 bytes in 4 base64 values, but it's a lot harder for a human to parse the encoding.
The behaviour you are seeing is due to encoding a 20-byte hash value into 40 bytes of hex, and then encoding those 40 bytes of hex into 56 bytes (40 / 3 * 4, then rounded up to the nearest 4 bytes) of base64 data.
You need to either encode directly to base64 from the raw hash bytes (if available), or decode the hexadecimal value to bytes before encoding to base64.

pbkdf2 key length

What is the $key_length in PBKDF2
It says that it will be derived from the input, but I see people using key_lengths of 256 and greater, but when I enter 256 as a key_length the output is 512 characters. Is this intentional? Can I safely use 64 as the key_length so the output is 128 characters long?
$key_length is the number of output bytes that you desire from PBKDF2. (Note that if key_length is more than the number of output bytes of the hash algorithm, the process is repeated twice, slowing down that hashing perhaps more than you desire. SHA256 gives 32 bytes of output, for example, so asking for 33 bytes will take roughly twice as long as asking for 32.)
The doubling of the length that you mention is because the code converts the output bytes to hexadecimal (i.e. 2 characters per 1 byte) unless you specify $raw_output = true. The test vectors included specify $raw_output = false, since hexadecimal is simply easier to work with and post online. Depending on how you are storing the data in your application, you can decide if you want to store the results as hex, base64, or just raw binary data.
In the IETF specification of Password-Based Cryptography Specification Version 2.0 the key length is defined as
"intended length in octets of the derived key, a positive integer, at most
(2^32 - 1) * hLen" Here hLen denotes the length in octets of the pseudorandom function output. For further details on pbkdf2 you can refer How to store passwords securely with PBKDF2

Why do md5 and sha-* only use alphanumeric characters in their hash result?

I understand not wanting to use '\0', but all the rest in the extended ASCII range is usable right?
Wouldn't this provide a much better/secure/"less coliding" hash?
You're starting from false premise -- they produce a result that can (does) include all 8-bit values from 0 to 255. Just for example, one of the test vectors for SHA-256 is an input of "abc". The result from this (in hexadecimal) is:
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
Just within that test, the result includes bytes with values from 0x03 to 0xff.
For display, that may be (often is) rendered in something like hexadecimal. For transmission in email they're often encoded with something like MIME or UUENCODE. The hash itself, however, is not limited in this way.
Transforming the result this way makes no difference to collision resistance -- you still have 160/256/whatever bits of actual data, but the representation is expanded.
The result is just hexadecimal encoded to be better readable.
In fact, those hash algorithms are outputting numbers, not strings. They use only letters a-f in combination with numbers 0-9, which makes the output a hexadecimal number.
MD5 produces an 128 bit hash. (16 byte)
sha, depending of whether is sha1 or sha256 produces either 160 bit (20 byte) or 256 bit (32 byte) hash.
Note that I'm talking about binary length/strength. The longer the less likely a collision occurs.
The fact that most users stick it into a DB field or whatnot makes it convenient to convert it to ASCII using varions binary-ascii conversion algos. This should not influence the strength of collision probability at all since you'll end up with a larger ASCII string.
FWIW I've been using SHA1, SHA256 in crypto products in binary form for over 5 years and I'd recommend choosing hashes in this following order, from the strongest to the weakest: SHA256, SHA1, MD5. There is a website that can "reverse" MD5 so I'd strongly suggest against it.

md5 hash or crc32 which one to use in this case

I need a hash that can be represented in less than 26 chars
Md5 produces 32 chars long string , if convert it to base 36 how good will it be,
I am need of hash not for cryptography but rather for uniqueness basically identifying each input dependent on time of input and input data. currently i can think of this as
$hash=md5( str_ireplace(".","",microtime()).md5($input_data) ) ;
$unique_id= base_convert($hash,16,36) ;
should go like this or use crc32 which will give smaller hash size but i afraid it wont be that unique ?
I think a much simpler solution could take place.
According to your statement, you have 26 characters of space. However, to clarify what I understand to be character and what you understand to be character, let's do some digging.
The MD5 hash acc. to wikipedia produces 16 byte hashes.
The CRC32 algorithm prodces 4 byte hashes.
I understand "characters" (in the most simplest sense) to be ASCII characters. Each ascii character (eg. A = 65) is 8 bits long.
The MD5 aglorithm produces has 16 bytes * 8 bits per byte = 128 bits, CRC32 is 32 bits.
You must understand that hashes are not mathematically unique, but "likely to be unique."
So my solution, given your description, would be to then represent the bits of the hash as ascii characters.
If you only have the choice between MD5 and CRC32, the answer would be MD5. But you could also fit a SHA-1 160 bit hash < 26 character string (it would be 20 ascii characters long).
If you are concerned about the set of symbols that each hash uses, both hashes are in the set [A-Za-z0-9] (I believe).
Finally, when you convert what are essentially numbers from one base to another, the number doesn't change, therefore the strength of the algorithm doesn't change; it just changes the way the number is represented.