in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character? - unicode

.. as the title says, in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character ?
you may wonder why anyone would want to know that?
well, when trying to bypass the function
function quoteLinuxShellArgument(string $argument): string {
if(false!==strpos($argument,"\x00")){error it is impossible to quote null bytes in shell arguments}
return "'" . str_replace ( "'", "'\\''", $argument ) . "'";
}
among my first questions was the one in the title.. is there any?

In UTF-8, any Unicode codepoint that is outside of the ASCII range (U+0000 - U+007F) is required to be encoded using multiple bytes. All of those bytes will have their high bit set to 1.
So no, byte 0x27 (b00100111) will never appear in a multi-byte sequence. 0x27 can only ever be used to encode codepoint U+0027 APOSTROPHE as a single byte.

All of the multi-byte UTF-8 characters have the upper bit set, so there's no chance of colliding with a regular ASCII character. That includes your single quote.

Related

Why does PowerShell generated base64 string have dots in it when decoding with something else than PowerShell

I have my code like:
$x = "This text needs to be encoded"
$z = [System.Text.Encoding]::Unicode.GetBytes($x)
$y = [System.Convert]::ToBase64String($z)
Write-Host("$y")
And the following gets printed to the console:
VABoAGkAcwAgAHQAZQB4AHQAIABuAGUAZQBkAHMAIAB0AG8AIABiAGUAIABlAG4AYwBvAGQAZQBkAA==
Now if I were to decode this b64 with powershell like:
$v = [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($y))
Write-Host("$v")
It would get decoded properly like:
This text needs to be encoded
However, if I was to put the aforementioned b64 encoded string to, say CyberChef and try to decode it with the "From base64" recipe, would the decoded string be filled in with dots like:
T.h.i.s. .t.e.x.t. .n.e.e.d.s. .t.o. .b.e. .e.n.c.o.d.e.d.
My question is, why does this happen?
Santiago Squarzon has provided the crucial pointer:
What CyberChef's recipe most likely expects is for the bytes that the Base64 string encodes to be based on the UTF-8 encoding of the original string.
By contrast, the - poorly named - [System.Text.Encoding]::Unicode encoding is the UTF-16LE encoding, where characters are represented by (at least) two bytes (with the least significant byte coming first).
Characters whose Unicode code point is less than or equal to 0xFF (255), which includes the entire ASCII range that all characters in your input string fall into, therefore have a NUL byte (value 0x0) as the second byte of their two-byte representation; e.g., the letter T encoded as UTF-16LE is composed of the two-byte sequence 0x54 0x0, where 0x54 by itself represents the letter T in ASCII encoding - and therefore also in UTF-8, which is a superset of ASCII that represents (only) non-ASCII characters as multi-byte sequences.
Therefore, the two-byte sequence 0x54 0x0 is interpreted as two characters in the context of UTF-8: letter T (0x54) and NUL (0x0). NUL has no visual representation per se (it is a non-printable character), but a common convention is to visualize it as ., which is what you saw.
Therefore, create your Base64-encoded string as follows:
$orig = "This text needs to be encoded"
$base64 =
[System.Convert]::ToBase64String(
[System.Text.Encoding]::UTF8.GetBytes($orig)
)
Note: Even though [System.Text.Encoding]::UTF8 is - up to at least .NET 6 - a UTF-8 encoding with BOM, a BOM is (fortunately) not prepended to the input string by the .GetBytes() method. As an aside: Changing this encoding to be BOM-less altogether is being considered prior to .NET 7.
$base64 then contains: VGhpcyB0ZXh0IG5lZWRzIHRvIGJlIGVuY29kZWQ=

Can UTF-8 contain zero byte in the middle?

UTF-8 uses 1-4 bytes for each character. Here is 4-byte character 🐍
So in Python 2, len('🐍') == 4 and in JavaScript encodeURI('🐍') === "%F0%9F%90%8D".
The question is, can UTF-8 contain a zero byte in the middle?
For example, the first Russian letter А consists of 2 bytes: 0xD0, 0x90.
May there exist a letter that has not leading zero or zero in the middle, like this 0xAB, 0, 0xCD?
The only zero byte in a valid UTF-8 stream would be a representation of U+0000 NULL, which is just 00 (hex) in UTF-8.
No valid encoding of any other character in UTF-8 will produce a full byte without any bits set.
In other words: if your input characters does not contain the NULL character, then your output byte stream is guaranteed to not contain any zero bytes.

Mail subject decoding?

I have a email subject like this:
Subject: =?gbk?Q?=B3=F6=C3=C0=C1=E2=C7=BF=C1=A6=B3=E9=CA=AA=BB=FA=D2=BB=CC=A8?=
=?gbk?Q?=A3=AC=D6=E9=BA=A3=B9=E3=D6=DD=C9=FA=BB=EE=B1=D8=B1=B8?=
But I don't know what kind of encoding is this?
Could someone help? Newbie to email protocol.
This subject is encoded in GBK, an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.
As defined in the RFC1342 specification, to represent non-ASCII text in Internet Message headers, you have to encode it with the MIME encoded-word syntax:
encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="
charset = token ; legal charsets defined by RFC 1341
encoding = token ; Either "B" or "Q"
token = 1*
tspecials = "(" / ")" / "<" / ">" / "#" / "," / ";" / ":" / "\" /
<"> / "/" / "[" / "]" / "?" / "." / "="
encoded-text = 1* (but see "Use of encoded-words in message
; headers", below)
The "B" encoding:
The "B" encoding is identical to the "BASE64" encoding defined by
RFC
1341.
The "Q" encoding:
The "Q" encoding is similar to the "Quoted-Printable" content-
transfer-encoding defined in RFC 1341. It is designed to allow text
containing mostly ASCII characters to be decipherable on an ASCII
terminal without decoding.
(1) Any 8-bit value may be represented by a "=" followed by two
hexadecimal digits. For example, if the character set in use
were ISO-8859-1, the "=" character would thus be encoded as
"=3D", and a SPACE by "=20". (Upper case should be used for
hexadecimal digits "A" through "F".)
(2) The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the ""
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
(3) 8-bit values which correspond to printable ASCII characters
other
than "=", "?", and "_" (underscore), MAY be represented as those
characters. (But see section 5 for restrictions.) In
particular, SPACE and TAB MUST NOT be represented as themselves
within encoded words.
In your subject:
Subject:
=?gbk?Q?=B3=F6=C3=C0=C1=E2=C7=BF=C1=A6=B3=E9=CA=AA=BB=FA=D2=BB=CC=A8?= =?gbk?Q?=A3=AC=D6=E9=BA=A3=B9=E3=D6=DD=C9=FA=BB=EE=B1=D8=B1=B8?=
We can see that the Quoted-Printable encoding has been used, hence the presence of = as escape character instead of %.
You can find an online encode here, and an online MIME Headers Decoder here.
Finally, here is your decoded subject:
Subject: 出美菱强力抽湿机一台,珠海广州生活必备

Is a Base64 encoded string completely alphanumeric?

Is a Base64 encoded string completely alphanumeric except for the "=" at the end?
As there are only 26 letters in the alphabet, and ten digits, you only have 26+26+10=62 distinct alpanumeric characters. As base64 obviously needs 64, two additional characters are needed. These two are + and slash /. Additionally, as you said, = is used as padding at the end of message, if necessary.

Which encoding uses the \x (backslash x) prefix?

I'm attempting to decode text which is prefixing certain 'special characters' with \x. I've worked out the following mappings by hand:
\x28 (
\x29 )
\x3a :
e.g. 12\x3a39\x3a03 AM
Does anyone recognise what this encoding is?
It's ASCII. All occurrences of the four characters \xST are converted to 1 character, whose ASCII code is ST (in hexadecimal), where S and T are any of 0123456789abcdefABCDEF.
The '\xAB' notation is used in C, C++, Perl, and other languages taking a cue from C, as a way of expressing hexadecimal character codes in the middle of a string.
The notation '\007' means use octal for the character code, when there are digits after the backslash.
In C99 and later, you can also use \uabcd and \U00abcdef to encode Unicode characters in hexadecimal (with 4 and 8 hex digits required; the first two hex digits in \U must be 0 to be valid, and often the third digit will be 0 too — 1 is the only other valid value).
Note that in C, octal escapes are limited to a maximum of 3 digits but hexadecimal escapes are not limited to 2 or 3 digits; the hexadecimal escape ends at the first character that's not a hexadecimal digit. In the question, the sequence is "12\x3a39\x3a03". That is a string containing 4 characters: 1, 2, \x3a39 and \x3a03. The actual value used for the 4-digit hex characters is implementation-defined. To achieve the desired result (using \x3A to represent a colon :), the code would have to use string concatenation:
"12\x3a" "39\x3a" "03"
This now contains 8 characters: 1, 2, :, 3, 9, :, 0, 3.
I use CyberChef for this sort of thing.
If you drop it in the input field and drag Magic from the Favourites list into the recipe it'll tell you the conversion and that you could've used the From_Hex recipe with a \x delimiter.
I'm guessing that what you are dealing with is a unicode string that has been encoded differently than the output stream it was sent to. ie. a utf-16 string output to a latin-1 device. In that situation, certain characters will be outputted as escape values to avoid sending control characters or wrong characters to the output device. This happens in python at least.