Correct SHA256 implementation with UTF-8 characters - hash

I'm running into issues comparing SHA256 hashes generated by different languages/functions.
For example, SHA256("í") either returns:
f3df1f9c358ae8eceb8fce7c00614288d113ad55315f4ebb909774a7daadfc84
-or-
127035a8ff26256ea0541b5add6dcc3ecdaeea603e606f84e0fd63492fbab2c5
Which of the above hash is correct for a string of one character, and what's the correct way of handling UTF-8 strings?

Which of the above hash is correct for a string of one character
There is no "correct" answer. What's being hashed is the bytes, not the "character". What bytes are hashed exactly depends on the encoding of the string.
"í" in Windows-1252 is byte ED, which hashes as:
f3df1f9c358ae8eceb8fce7c00614288d113ad55315f4ebb909774a7daadfc84
"í" in UTF-8 is bytes C3 AD, which hashes as:
127035a8ff26256ea0541b5add6dcc3ecdaeea603e606f84e0fd63492fbab2c5
"í" in UTF-16LE is bytes ED 00, which hashes as:
430e2ca27910b5ee6e0ec56a12b81325c763376cb8e25a60362dce9444424f95
How exactly that works in various programming languages depends on the languages and the encodings they use for strings.

Related

Base64 SHA-512 hash not working as intended

Hello I'm trying to get the Base64 encoded value of a SHA512 hash. I want my output to match the output using this site but I can't seem to get it when I try step by step. For example,
The string admin gives x61Ey612Kl2gpFL56FT9weDnpSo4AV8j8+qx2AuTHdRyY036xxzTTrw10Wq3+4qQyB+XURPWx1ONxp3Y3pB37A== when I use the site above.
When I try it step by step, I use a SHA-512 hash generator on admin which results in C7AD44CBAD762A5DA0A452F9E854FDC1E0E7A52A38015F23F3EAB1D80B931DD472634DFAC71CD34EBC35D16AB7FB8A90C81F975113D6C7538DC69DD8DE9077EC
and then I use a Base64 encoder on that which gives me QzdBRDQ0Q0JBRDc2MkE1REEwQTQ1MkY5RTg1NEZEQzFFMEU3QTUyQTM4MDE1RjIzRjNFQUIxRDgwQjkzMURENDcyNjM0REZBQzcxQ0QzNEVCQzM1RDE2QUI3RkI4QTkwQzgxRjk3NTExM0Q2Qzc1MzhEQzY5REQ4REU5MDc3RUM=
which is different. How do I obtain the first output above?
There's two different transformations in play here: the SHA-512 hash of an input and the Base64 encoding of an input. They can be combined or used alone.
C7AD44CBAD762A5DA0A452F9E854FDC1E0E7A52A38015F23F3EAB1D80B931DD472634DFAC71CD34EBC35D16AB7FB8A90C81F975113D6C7538DC69DD8DE9077EC is the SHA-512 hash of the text admin represented in uppercase hexadecimal.
QzdBRDQ0Q0JBRDc2MkE1REEwQTQ1MkY5RTg1NEZEQzFFMEU3QTUyQTM4MDE1RjIzRjNFQUIxRDgwQjkzMURENDcyNjM0REZBQzcxQ0QzNEVCQzM1RDE2QUI3RkI4QTkwQzgxRjk3NTExM0Q2Qzc1MzhEQzY5REQ4REU5MDc3RUM= is the SHA-512 hash of the text admin represented in uppercase hexadecimal and then encoded with Base64.
x61Ey612Kl2gpFL56FT9weDnpSo4AV8j8+qx2AuTHdRyY036xxzTTrw10Wq3+4qQyB+XURPWx1ONxp3Y3pB37A== is the SHA-512 hash of the text admin in encoded with Base64. There was no intermediate transformation to hexadecimal.
In other words, x61Ey612Kl2gpFL56FT9weDnpSo4AV8j8+qx2AuTHdRyY036xxzTTrw10Wq3+4qQyB+XURPWx1ONxp3Y3pB37A== is the Base64 encoding of the hash output bytes, and QzdBRDQ0Q0JBRDc2MkE1REEwQTQ1MkY5RTg1NEZEQzFFMEU3QTUyQTM4MDE1RjIzRjNFQUIxRDgwQjkzMURENDcyNjM0REZBQzcxQ0QzNEVCQzM1RDE2QUI3RkI4QTkwQzgxRjk3NTExM0Q2Qzc1MzhEQzY5REQ4REU5MDc3RUM= is the Base64 encoding of the hash output text (in uppercase hexadecimal).

Are MD5 hashes always either capital or lowercase?

I'm passing an HMAC-MD5 encoded parameter into a form and the vendor is returning it as invalid. However, it matches what their hash generator gives me, with the exception of capitalization on the letters. What I did to get around this was use an lcase command. I'm wondering if this will cause me trouble later. Coldfusion generates the hashed string in capital letters, the vendor always seems to use lowercase; is it always one or the other or will they ever be mixed?
MD5 as every other hash function will produce binary output, in case of MD5 it is 16 bytes.
Because those bytes are difficult to handle, they are encoded to a string. In case of MD5 they are usually encoded to 32 lowercase hexadecimal digits, so every byte is represented by 2 characters.
Whether the target system accepts upper- or lowercase encodings or both is up to the system, it is unrelated to the hash function, both are different representations of a the same MD5 hash. So to answer your question, format the output as the target system requires it.
While RFC-1321 MD5 Message-Digest Algorithm doesn't discuss hexadecimal string encoding, the test suite does show results in lowercase.
The MD5 test suite (driver option "-x") should print the following results:
MD5 test suite:
MD5 ("") = d41d8cd98f00b204e9800998ecf8427e
MD5 ("a") = 0cc175b9c0f1b6a831c399e269772661
MD5 ("abc") = 900150983cd24fb0d6963f7d28e17f72
MD5 ("message digest") = f96b697d7cb7938d525a2f31aaf161d0
MD5 ("abcdefghijklmnopqrstuvwxyz") = c3fcd3d76192e4007dfb496cca67e13b
MD5 ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789") =
d174ab98d277d9f5a5611c2c9f419d9f
MD5 ("123456789012345678901234567890123456789012345678901234567890123456
78901234567890") = 57edf4a22be3c955ac49da2e2107b67a
Lowercase is simply the outcome of C/C++ printf() format specifier %02x, not a requirement: "should print", not "must print".
Ref: RFC-1321 Appendix A.5 Test suite
A hex string can contain anything in the 0-9 and a-f, A-F range, so you should anticipate both upper and lower-case versions.
If you're really stuck trying to interface between two highly opinionated systems, force upper or lower case depending on your requirements.

Correct Hashing Algorithm/Function

Are there any secure hashing algorithms/functions that give all the letters and numbers, and not just 0-9,a-f.
So the output could contain: 0-9, a-z, A-Z and even some symbols.
Any hashing algorithm, really.
Hexadecimal is just a common representation for them. Look at this code snippet (using perl, because you didn't tag a programming language):
use Digest::MD5 qw/md5 md5_hex/;
use MIME::Base64;
my $str = 'Foobar';
# Hexadecimal representation
print md5_hex($str),"\n";
# Base64 encoded representation
print encode_base64(md5($str));
Output:
89d5739baabbbe65be35cbe61c88e06d
idVzm6q7vmW+NcvmHIjgbQ==
The first output is the hexadecimal representation of the MD5 digest of the string; the second is the Base64 encoded representation of the raw digest.
This would work with any digesting algorithm. It does not, however, affect how secure the underlying algorithm actually is.
Use your favorite hashing algorithm/function and convert the output to base64. A mechanism to do that in Java is here: how to convert hex to base64.
Note that the hash value will still be the same, but the presentation will be different. If there's a reason you want to use a fuller symbol set, perhaps you could edit your question.

How can I work with raw bytes in perl

Documentation all directs me to unicode support, yet I don't think my request has anything to do with Unicode. I want to work with raw bytes within the context of a single scalar; I need to be able to figure out its length (in bytes), take substrings of it (in bytes), write the bytes to disc, and over the network. Is there an easy way to do this, without treating the bytes as any sort of encoding in perl?
EDIT
More explicitly,
my $data = "Perl String, unsure of encoding and don't need to know";
my #data_chunked_into_1024_bytes_each = #???
Perl strings are, conceptually, strings of characters, which are positive 32-bit integers that (normally) represent Unicode code points. A byte string, in Perl, is just a string in which all the characters have values less than 256.
(That's the conceptual view. The internal representation is somewhat more complicated, as the perl interpreter tries to store byte strings — in the above sense — as actual byte strings, while using a generalized UTF-8 encoding for strings that contain character values of 256 or higher. But this is all supposed to be transparent to the user, and in fact mostly is, except for some ugly historical corner cases like the bitwise not (~) operator.)
As for how to turn a general string into a byte string, that really depends on what the string you have contains and what the byte string is supposed to contain:
If your string already is a string of bytes — e.g. if you read it from a file in binary mode — then you don't need to do anything. The string shouldn't contain any characters above 255 to being with, and if it does, that's an error and will probably be reported as such by the encryption code.
Similarly, if your string is supposed to encode text in the ASCII or ISO-8859-1 encodings (which encode the 7- and 8-bit subsets of Unicode respectively), then you don't need to do anything: any characters up to 255 are already correctly encoded, and any higher values are invalid for those encodings.
If your input string contains (Unicode) text that you want to encode in some other encoding, then you'll need to convert the string to that encoding. The usual way to do that is by using the Encode module, like this:
use Encode;
my $byte_string = encode( "name of encoding", $text_string );
Obviously, you can convert the byte string back to the corresponding character string with:
use Encode;
my $text_string = decode( "name of encoding", $byte_string );
For the special case of the UTF-8 encoding, it's also possible to use the built-in utf8::encode() function instead of Encode::encode():
utf8::encode( $string );
which does essentially the same thing as:
use Encode;
$string = encode( "utf8", $string );
Note that, unlike Encode::encode(), the utf8::encode() function modifies the input string directly. Also note that the "utf8" above refers to Perl's extended UTF-8 encoding, which allows values outside the official Unicode range; for strictly standards-compliant UTF-8 encoding, use "utf-8" with a hyphen (see Encode documentation for the gory details). And, yes, there's also a utf8::decode() function that does pretty much what you'd expect.
If I understood your question correctly, what you want is the pack/unpack functions: http://perldoc.perl.org/functions/pack.html
As long as your string doesn't contain characters above codepoint 255, it will mostly work as plain byte string, with length and substr operating on bytes. Additionally, most output functions like print expect octets/bytes by default and will actually complain if you try to stuff anything else to them.
You may need to explicitly encode/decode your output if it is known to be in some encoding, but more details can only be added if you ask another specific question for each problematic part of your program.

Play! hash password returns bad result

I'm using Play 1.2.1. I want to hash my users password. I thought that Crypto.passwordHash will be good, but it isn't. passwordHash documentation says it returns MD5 password hash. I created some user accounts in fixture, where I put md5 password hash:
...
User(admin):
login: admin
password: f1682b54de57d202ba947a0af26399fd
fullName: Administrator
...
The problem is, when I try to log in, with something like this:
user.password.equals(Crypto.passwordHash(password))
and it doesn't work. So I put a log statement in my autentify method:
Logger.info("\nUser hashed password is %s " +
"\nPassed password is %s " +
"\nHashed passed password is %s",
user.password, password, Crypto.passwordHash(password));
And the password hashes are indeed different, but hey! The output of passwordHash method isn't even an MD5 hash:
15:02:16,164 INFO ~
User hashed password is f1682b54de57d202ba947a0af26399fd
Passed password is <you don't have to know this :P>
Hashed passed password is 8WgrVN5X0gK6lHoK8mOZ/Q==
How about that? How to fix it? Or maybe I have to implement my own solution?
Crypto.passwordHash returns base64-encoded password hash, while you are comparing to hex-encoded.
MD5 outputs a sequence of 16 bytes, each byte having (potentially) any value between 0 and 255 (inclusive). When you want to print the value, you need to convert the bytes to a sequence of "printable characters". There are several possible conventions, the two main being hexadecimal and Base64.
In hexadecimal notation, each byte value is represented as two "hexadecimal digits": such a digit is either a decimal digit ('0' to '9') or a letter (from 'a' to 'f', case is irrelevant). The 16 bytes thus become 32 characters.
In Base64 encoding, each group of three successive bytes is encoded as four characters, taken in a list of 64 possible characters (digits, lowercase letters, uppercase letters, '+' and '/'). One or two final '=' signs may be added so that the encoded string consists in a number of characters which is multiple of 4.
Here, '8WgrVN5X0gK6lHoK8mOZ/Q==' is the Base64 encoding of a sequence of 16 bytes, the first one having value 241, the second one 104, then 43, and so on. In hexadecimal notation, the first byte would be represented by 'f1', the second by '68', the third by '2b'... and the hexadecimal notation of the complete sequence of 16 bytes is then 'f1682b54de57d202ba947a0af26399fd', the value that you expected.
The play.libs.Codec class contains methods for decoding and encoding Base64 and hexadecimal notations. It also contains Codec.hexMD5() which performs MD5 hashing and returns the value in hexadecimal notation instead of Base64.
as Nickolay said you are comparing Hex vs Base-64 strings. Also, I would recommend using BCrypt for that, not the Crypto tool of Play.