What pool of characters do MD5 and SHA have? - hash

Does MD5 and SHA only contain alphanumeric characters? (i.e., from A to Z and 0 to 9, or do they exclude some characters?)

MD5 and SHA hashes in raw form are binary, however their common representation is a hex-encoded string, which contains characters [a-fA-F0-9].
So if this is what you meant, then characters G-Z, g-z are "excluded".
Another, less common, representation is Base64 encoding [0-9a-zA-Z+/].

Related

Why does PowerShell generated base64 string have dots in it when decoding with something else than PowerShell

I have my code like:
$x = "This text needs to be encoded"
$z = [System.Text.Encoding]::Unicode.GetBytes($x)
$y = [System.Convert]::ToBase64String($z)
Write-Host("$y")
And the following gets printed to the console:
VABoAGkAcwAgAHQAZQB4AHQAIABuAGUAZQBkAHMAIAB0AG8AIABiAGUAIABlAG4AYwBvAGQAZQBkAA==
Now if I were to decode this b64 with powershell like:
$v = [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($y))
Write-Host("$v")
It would get decoded properly like:
This text needs to be encoded
However, if I was to put the aforementioned b64 encoded string to, say CyberChef and try to decode it with the "From base64" recipe, would the decoded string be filled in with dots like:
T.h.i.s. .t.e.x.t. .n.e.e.d.s. .t.o. .b.e. .e.n.c.o.d.e.d.
My question is, why does this happen?
Santiago Squarzon has provided the crucial pointer:
What CyberChef's recipe most likely expects is for the bytes that the Base64 string encodes to be based on the UTF-8 encoding of the original string.
By contrast, the - poorly named - [System.Text.Encoding]::Unicode encoding is the UTF-16LE encoding, where characters are represented by (at least) two bytes (with the least significant byte coming first).
Characters whose Unicode code point is less than or equal to 0xFF (255), which includes the entire ASCII range that all characters in your input string fall into, therefore have a NUL byte (value 0x0) as the second byte of their two-byte representation; e.g., the letter T encoded as UTF-16LE is composed of the two-byte sequence 0x54 0x0, where 0x54 by itself represents the letter T in ASCII encoding - and therefore also in UTF-8, which is a superset of ASCII that represents (only) non-ASCII characters as multi-byte sequences.
Therefore, the two-byte sequence 0x54 0x0 is interpreted as two characters in the context of UTF-8: letter T (0x54) and NUL (0x0). NUL has no visual representation per se (it is a non-printable character), but a common convention is to visualize it as ., which is what you saw.
Therefore, create your Base64-encoded string as follows:
$orig = "This text needs to be encoded"
$base64 =
[System.Convert]::ToBase64String(
[System.Text.Encoding]::UTF8.GetBytes($orig)
)
Note: Even though [System.Text.Encoding]::UTF8 is - up to at least .NET 6 - a UTF-8 encoding with BOM, a BOM is (fortunately) not prepended to the input string by the .GetBytes() method. As an aside: Changing this encoding to be BOM-less altogether is being considered prior to .NET 7.
$base64 then contains: VGhpcyB0ZXh0IG5lZWRzIHRvIGJlIGVuY29kZWQ=

What does the first bit(i.e. binary 0) mean in UTF-8 encoding standard?

I'm a PHP Developer by profession.
Consider below example :
I want to encode the word "hello" using UTF-8 encoding.
So,
Equivalent Code Points of each of the letters of the word "hello" are as below :
h = 104
e = 101
l = 108
o = 111
So, we can say that the list of decimal numbers represent the string "hello":
104 101 108 108 111
UTF-8 encoding will store "hello" like this (binary):
01101000 01100101 01101100 01101100 01101111
If you observe the above binary encoded value closely, you will come to know that every binary equivalent of decimal number has been preceded with the binary bit value 0.
My question is why this initial 0 has been prefixed to every storable character? What's the purpose of using it in UTF-8 encoding?
What has been done when the same string is encoded using UTF-16 format?
If is it necessary then can the initial extra character be a bit value 1?
Does NUL Byte mean the binary character 0?
UTF-8 is backwards compatible with ASCII. ASCII uses the values 0 - 127 and has assigned characters to them. That means bytes 0000 0000 through 0111 1111. UTF-8 keeps that same mapping for those same first 128 characters.
Any character not found in ASCII is encoded in the form of 1xxx xxxx in UTF-8, i.e. for any non-ASCII character the high bit of every encoded byte is 1. Those characters are encoded in multiple bytes in UTF-8. The first bits of the first byte in the sequence tell the decoder of how many bytes the character consists of; 110x xxxx signals that it's a 2-byte character, 1110 xxxx a 3-byte character and 1111 0xxx a 4-byte character. Subsequenct bytes in the sequence are in the form 10xx xxxx. So, no, you can't just set it to 1 arbitrarily.
There are various extensions to ASCII (e.g. ISO-8859) which set that first bit as well and thereby add another 128 characters of the form 1xxx xxxx.
There's also 7-bit ASCII which omits the first 0 bit and just uses 000 0000 through 111 1111.
Does NUL Byte mean the binary character 0?
It means the bit sequence 0000 0000, i.e. an all-zero byte with the decimal/hex/octal value 0.
You may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
UTF-8 encodes Unicode codepoints U+0000 - U+007F (which are the ASCII characters 0-127) using 7 bits. The eighth bit is used to signal when additional bytes are necessary only when encoding Unicode codepoints U+0080 - U+10FFFF.
For example, è is codepoint U+00E8, which is encoded in UTF-8 as bytes 0xC3 0xA8 (11000011 10101000 in binary).
Wikipedia explains quite well how UTF-8 is encoded.
Does NUL Byte mean the binary character 0?
Yes.

Is a Base64 encoded string completely alphanumeric?

Is a Base64 encoded string completely alphanumeric except for the "=" at the end?
As there are only 26 letters in the alphabet, and ten digits, you only have 26+26+10=62 distinct alpanumeric characters. As base64 obviously needs 64, two additional characters are needed. These two are + and slash /. Additionally, as you said, = is used as padding at the end of message, if necessary.

Check characters inside string for their Unicode value

I would like to replace characters with certain Unicode values in a variable with dash. I have two ideas which might work, but I do not know how to check for the value of character:
1/ processing variable as string, checking every characters value and placing these characters in a new variable (replacing those characters which are invalid)
2/ use these magic :-)
$variable = s/[$char_range]/-/g;
char_range should be similar to [0-9] or [A-Z], but it should be values for utf-8 characters. I need range from 0x00 to 0x7F to be exact.
The following expression should replace anything that is not ASCII with a hyphen, which is (I think) what you want to do:
s/[\N{U+0080}-\N{U+FFFF}]/-/g
There's no such thing as UTF-8 characters. There are only characters that you encode into UTF-8. Even then, you don't want to make ranges outside of the magical ones that Perl knows about. You're likely to get more than you expect.
To get the ordinal value for a character, use ord:
use utf8;
my $code_number = ord '😸'; # U+1F638
say sprintf "%#x", $code_number;
However, I don't think that's what you need. It sounds like you want to replace characters in the ASCII range with a -. You can specify ranges of code numbers:
s/[\000-\177]/-/g; # in octal
s/[\x00-\x7f]/-/g; # in hexadecimal
You can specify wide character ordinal values in braces:
s/[\x80-\x{10ffff}]/-/g; # wide characters, replace non-ASCII in this case
When the characters have a common property, you can use that:
s/\p{ASCII}/-/g;
However, if you are replacing things character for character, you might want a transliteration:
$string =~ tr/\000-\177/-/;

Which encoding uses the \x (backslash x) prefix?

I'm attempting to decode text which is prefixing certain 'special characters' with \x. I've worked out the following mappings by hand:
\x28 (
\x29 )
\x3a :
e.g. 12\x3a39\x3a03 AM
Does anyone recognise what this encoding is?
It's ASCII. All occurrences of the four characters \xST are converted to 1 character, whose ASCII code is ST (in hexadecimal), where S and T are any of 0123456789abcdefABCDEF.
The '\xAB' notation is used in C, C++, Perl, and other languages taking a cue from C, as a way of expressing hexadecimal character codes in the middle of a string.
The notation '\007' means use octal for the character code, when there are digits after the backslash.
In C99 and later, you can also use \uabcd and \U00abcdef to encode Unicode characters in hexadecimal (with 4 and 8 hex digits required; the first two hex digits in \U must be 0 to be valid, and often the third digit will be 0 too — 1 is the only other valid value).
Note that in C, octal escapes are limited to a maximum of 3 digits but hexadecimal escapes are not limited to 2 or 3 digits; the hexadecimal escape ends at the first character that's not a hexadecimal digit. In the question, the sequence is "12\x3a39\x3a03". That is a string containing 4 characters: 1, 2, \x3a39 and \x3a03. The actual value used for the 4-digit hex characters is implementation-defined. To achieve the desired result (using \x3A to represent a colon :), the code would have to use string concatenation:
"12\x3a" "39\x3a" "03"
This now contains 8 characters: 1, 2, :, 3, 9, :, 0, 3.
I use CyberChef for this sort of thing.
If you drop it in the input field and drag Magic from the Favourites list into the recipe it'll tell you the conversion and that you could've used the From_Hex recipe with a \x delimiter.
I'm guessing that what you are dealing with is a unicode string that has been encoded differently than the output stream it was sent to. ie. a utf-16 string output to a latin-1 device. In that situation, certain characters will be outputted as escape values to avoid sending control characters or wrong characters to the output device. This happens in python at least.