Can UTF-8 contain zero byte in the middle? - encoding

UTF-8 uses 1-4 bytes for each character. Here is 4-byte character ๐Ÿ
So in Python 2, len('๐Ÿ') == 4 and in JavaScript encodeURI('๐Ÿ') === "%F0%9F%90%8D".
The question is, can UTF-8 contain a zero byte in the middle?
For example, the first Russian letter ะ consists of 2 bytes: 0xD0, 0x90.
May there exist a letter that has not leading zero or zero in the middle, like this 0xAB, 0, 0xCD?

The only zero byte in a valid UTF-8 stream would be a representation of U+0000 NULL, which is just 00 (hex) in UTF-8.
No valid encoding of any other character in UTF-8 will produce a full byte without any bits set.
In other words: if your input characters does not contain the NULL character, then your output byte stream is guaranteed to not contain any zero bytes.

Related

Why does PowerShell generated base64 string have dots in it when decoding with something else than PowerShell

I have my code like:
$x = "This text needs to be encoded"
$z = [System.Text.Encoding]::Unicode.GetBytes($x)
$y = [System.Convert]::ToBase64String($z)
Write-Host("$y")
And the following gets printed to the console:
VABoAGkAcwAgAHQAZQB4AHQAIABuAGUAZQBkAHMAIAB0AG8AIABiAGUAIABlAG4AYwBvAGQAZQBkAA==
Now if I were to decode this b64 with powershell like:
$v = [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($y))
Write-Host("$v")
It would get decoded properly like:
This text needs to be encoded
However, if I was to put the aforementioned b64 encoded string to, say CyberChef and try to decode it with the "From base64" recipe, would the decoded string be filled in with dots like:
T.h.i.s. .t.e.x.t. .n.e.e.d.s. .t.o. .b.e. .e.n.c.o.d.e.d.
My question is, why does this happen?
Santiago Squarzon has provided the crucial pointer:
What CyberChef's recipe most likely expects is for the bytes that the Base64 string encodes to be based on the UTF-8 encoding of the original string.
By contrast, the - poorly named - [System.Text.Encoding]::Unicode encoding is the UTF-16LE encoding, where characters are represented by (at least) two bytes (with the least significant byte coming first).
Characters whose Unicode code point is less than or equal to 0xFF (255), which includes the entire ASCII range that all characters in your input string fall into, therefore have a NUL byte (value 0x0) as the second byte of their two-byte representation; e.g., the letter T encoded as UTF-16LE is composed of the two-byte sequence 0x54 0x0, where 0x54 by itself represents the letter T in ASCII encoding - and therefore also in UTF-8, which is a superset of ASCII that represents (only) non-ASCII characters as multi-byte sequences.
Therefore, the two-byte sequence 0x54 0x0 is interpreted as two characters in the context of UTF-8: letter T (0x54) and NUL (0x0). NUL has no visual representation per se (it is a non-printable character), but a common convention is to visualize it as ., which is what you saw.
Therefore, create your Base64-encoded string as follows:
$orig = "This text needs to be encoded"
$base64 =
[System.Convert]::ToBase64String(
[System.Text.Encoding]::UTF8.GetBytes($orig)
)
Note: Even though [System.Text.Encoding]::UTF8 is - up to at least .NET 6 - a UTF-8 encoding with BOM, a BOM is (fortunately) not prepended to the input string by the .GetBytes() method. As an aside: Changing this encoding to be BOM-less altogether is being considered prior to .NET 7.
$base64 then contains: VGhpcyB0ZXh0IG5lZWRzIHRvIGJlIGVuY29kZWQ=

in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character?

.. as the title says, in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character ?
you may wonder why anyone would want to know that?
well, when trying to bypass the function
function quoteLinuxShellArgument(string $argument): string {
if(false!==strpos($argument,"\x00")){error it is impossible to quote null bytes in shell arguments}
return "'" . str_replace ( "'", "'\\''", $argument ) . "'";
}
among my first questions was the one in the title.. is there any?
In UTF-8, any Unicode codepoint that is outside of the ASCII range (U+0000 - U+007F) is required to be encoded using multiple bytes. All of those bytes will have their high bit set to 1.
So no, byte 0x27 (b00100111) will never appear in a multi-byte sequence. 0x27 can only ever be used to encode codepoint U+0027 APOSTROPHE as a single byte.
All of the multi-byte UTF-8 characters have the upper bit set, so there's no chance of colliding with a regular ASCII character. That includes your single quote.

What does the first bit(i.e. binary 0) mean in UTF-8 encoding standard?

I'm a PHP Developer by profession.
Consider below example :
I want to encode the word "hello" using UTF-8 encoding.
So,
Equivalent Code Points of each of the letters of the word "hello" are as below :
h = 104
e = 101
l = 108
o = 111
So, we can say that the list of decimal numbers represent the string "hello":
104 101 108 108 111
UTF-8 encoding will store "hello" like this (binary):
01101000 01100101 01101100 01101100 01101111
If you observe the above binary encoded value closely, you will come to know that every binary equivalent of decimal number has been preceded with the binary bit value 0.
My question is why this initial 0 has been prefixed to every storable character? What's the purpose of using it in UTF-8 encoding?
What has been done when the same string is encoded using UTF-16 format?
If is it necessary then can the initial extra character be a bit value 1?
Does NUL Byte mean the binary character 0?
UTF-8 is backwards compatible with ASCII. ASCII uses the values 0 - 127 and has assigned characters to them. That means bytes 0000 0000 through 0111 1111. UTF-8 keeps that same mapping for those same first 128 characters.
Any character not found in ASCII is encoded in the form of 1xxx xxxx in UTF-8, i.e. for any non-ASCII character the high bit of every encoded byte is 1. Those characters are encoded in multiple bytes in UTF-8. The first bits of the first byte in the sequence tell the decoder of how many bytes the character consists of; 110x xxxx signals that it's a 2-byte character, 1110 xxxx a 3-byte character and 1111 0xxx a 4-byte character. Subsequenct bytes in the sequence are in the form 10xx xxxx. So, no, you can't just set it to 1 arbitrarily.
There are various extensions to ASCII (e.g. ISO-8859) which set that first bit as well and thereby add another 128 characters of the form 1xxx xxxx.
There's also 7-bit ASCII which omits the first 0 bit and just uses 000 0000 through 111 1111.
Does NUL Byte mean the binary character 0?
It means the bit sequence 0000 0000, i.e. an all-zero byte with the decimal/hex/octal value 0.
You may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
UTF-8 encodes Unicode codepoints U+0000 - U+007F (which are the ASCII characters 0-127) using 7 bits. The eighth bit is used to signal when additional bytes are necessary only when encoding Unicode codepoints U+0080 - U+10FFFF.
For example, รจ is codepoint U+00E8, which is encoded in UTF-8 as bytes 0xC3 0xA8 (11000011 10101000 in binary).
Wikipedia explains quite well how UTF-8 is encoded.
Does NUL Byte mean the binary character 0?
Yes.

get Unicode value [PostgreSQL]

If you see this link
Its all about unicode code range
example :
U+0644 ู„ d9 84 ARABIC LETTER LAM
In PostgreSQL its easy to get hex value :
select encode('ู„','hex')
it will return the hex value, d9 84.
but how to get the unicode code point ?
Thanks
If your input string is in UTF-8, you can use the ascii function:
ascii(string) int
ASCII code of the first character of the argument.
For UTF8 returns the Unicode code point of the character. For other
multibyte encodings. the argument must be a strictly ASCII character.

Which encoding uses the \x (backslash x) prefix?

I'm attempting to decode text which is prefixing certain 'special characters' with \x. I've worked out the following mappings by hand:
\x28 (
\x29 )
\x3a :
e.g. 12\x3a39\x3a03 AM
Does anyone recognise what this encoding is?
It's ASCII. All occurrences of the four characters \xST are converted to 1 character, whose ASCII code is ST (in hexadecimal), where S and T are any of 0123456789abcdefABCDEF.
The '\xAB' notation is used in C, C++, Perl, and other languages taking a cue from C, as a way of expressing hexadecimal character codes in the middle of a string.
The notation '\007' means use octal for the character code, when there are digits after the backslash.
In C99 and later, you can also use \uabcd and \U00abcdef to encode Unicode characters in hexadecimal (with 4 and 8 hex digits required; the first two hex digits in \U must be 0 to be valid, and often the third digit will be 0 too โ€” 1 is the only other valid value).
Note that in C, octal escapes are limited to a maximum of 3 digits but hexadecimal escapes are not limited to 2 or 3 digits; the hexadecimal escape ends at the first character that's not a hexadecimal digit. In the question, the sequence is "12\x3a39\x3a03". That is a string containing 4 characters: 1, 2, \x3a39 and \x3a03. The actual value used for the 4-digit hex characters is implementation-defined. To achieve the desired result (using \x3A to represent a colon :), the code would have to use string concatenation:
"12\x3a" "39\x3a" "03"
This now contains 8 characters: 1, 2, :, 3, 9, :, 0, 3.
I use CyberChef for this sort of thing.
If you drop it in the input field and drag Magic from the Favourites list into the recipe it'll tell you the conversion and that you could've used the From_Hex recipe with a \x delimiter.
I'm guessing that what you are dealing with is a unicode string that has been encoded differently than the output stream it was sent to. ie. a utf-16 string output to a latin-1 device. In that situation, certain characters will be outputted as escape values to avoid sending control characters or wrong characters to the output device. This happens in python at least.