Why does PowerShell generated base64 string have dots in it when decoding with something else than PowerShell - powershell

I have my code like:
$x = "This text needs to be encoded"
$z = [System.Text.Encoding]::Unicode.GetBytes($x)
$y = [System.Convert]::ToBase64String($z)
Write-Host("$y")
And the following gets printed to the console:
VABoAGkAcwAgAHQAZQB4AHQAIABuAGUAZQBkAHMAIAB0AG8AIABiAGUAIABlAG4AYwBvAGQAZQBkAA==
Now if I were to decode this b64 with powershell like:
$v = [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($y))
Write-Host("$v")
It would get decoded properly like:
This text needs to be encoded
However, if I was to put the aforementioned b64 encoded string to, say CyberChef and try to decode it with the "From base64" recipe, would the decoded string be filled in with dots like:
T.h.i.s. .t.e.x.t. .n.e.e.d.s. .t.o. .b.e. .e.n.c.o.d.e.d.
My question is, why does this happen?

Santiago Squarzon has provided the crucial pointer:
What CyberChef's recipe most likely expects is for the bytes that the Base64 string encodes to be based on the UTF-8 encoding of the original string.
By contrast, the - poorly named - [System.Text.Encoding]::Unicode encoding is the UTF-16LE encoding, where characters are represented by (at least) two bytes (with the least significant byte coming first).
Characters whose Unicode code point is less than or equal to 0xFF (255), which includes the entire ASCII range that all characters in your input string fall into, therefore have a NUL byte (value 0x0) as the second byte of their two-byte representation; e.g., the letter T encoded as UTF-16LE is composed of the two-byte sequence 0x54 0x0, where 0x54 by itself represents the letter T in ASCII encoding - and therefore also in UTF-8, which is a superset of ASCII that represents (only) non-ASCII characters as multi-byte sequences.
Therefore, the two-byte sequence 0x54 0x0 is interpreted as two characters in the context of UTF-8: letter T (0x54) and NUL (0x0). NUL has no visual representation per se (it is a non-printable character), but a common convention is to visualize it as ., which is what you saw.
Therefore, create your Base64-encoded string as follows:
$orig = "This text needs to be encoded"
$base64 =
[System.Convert]::ToBase64String(
[System.Text.Encoding]::UTF8.GetBytes($orig)
)
Note: Even though [System.Text.Encoding]::UTF8 is - up to at least .NET 6 - a UTF-8 encoding with BOM, a BOM is (fortunately) not prepended to the input string by the .GetBytes() method. As an aside: Changing this encoding to be BOM-less altogether is being considered prior to .NET 7.
$base64 then contains: VGhpcyB0ZXh0IG5lZWRzIHRvIGJlIGVuY29kZWQ=

Related

in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character?

.. as the title says, in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character ?
you may wonder why anyone would want to know that?
well, when trying to bypass the function
function quoteLinuxShellArgument(string $argument): string {
if(false!==strpos($argument,"\x00")){error it is impossible to quote null bytes in shell arguments}
return "'" . str_replace ( "'", "'\\''", $argument ) . "'";
}
among my first questions was the one in the title.. is there any?
In UTF-8, any Unicode codepoint that is outside of the ASCII range (U+0000 - U+007F) is required to be encoded using multiple bytes. All of those bytes will have their high bit set to 1.
So no, byte 0x27 (b00100111) will never appear in a multi-byte sequence. 0x27 can only ever be used to encode codepoint U+0027 APOSTROPHE as a single byte.
All of the multi-byte UTF-8 characters have the upper bit set, so there's no chance of colliding with a regular ASCII character. That includes your single quote.

Can UTF-8 contain zero byte in the middle?

UTF-8 uses 1-4 bytes for each character. Here is 4-byte character ๐Ÿ
So in Python 2, len('๐Ÿ') == 4 and in JavaScript encodeURI('๐Ÿ') === "%F0%9F%90%8D".
The question is, can UTF-8 contain a zero byte in the middle?
For example, the first Russian letter ะ consists of 2 bytes: 0xD0, 0x90.
May there exist a letter that has not leading zero or zero in the middle, like this 0xAB, 0, 0xCD?
The only zero byte in a valid UTF-8 stream would be a representation of U+0000 NULL, which is just 00 (hex) in UTF-8.
No valid encoding of any other character in UTF-8 will produce a full byte without any bits set.
In other words: if your input characters does not contain the NULL character, then your output byte stream is guaranteed to not contain any zero bytes.

What does the first bit(i.e. binary 0) mean in UTF-8 encoding standard?

I'm a PHP Developer by profession.
Consider below example :
I want to encode the word "hello" using UTF-8 encoding.
So,
Equivalent Code Points of each of the letters of the word "hello" are as below :
h = 104
e = 101
l = 108
o = 111
So, we can say that the list of decimal numbers represent the string "hello":
104 101 108 108 111
UTF-8 encoding will store "hello" like this (binary):
01101000 01100101 01101100 01101100 01101111
If you observe the above binary encoded value closely, you will come to know that every binary equivalent of decimal number has been preceded with the binary bit value 0.
My question is why this initial 0 has been prefixed to every storable character? What's the purpose of using it in UTF-8 encoding?
What has been done when the same string is encoded using UTF-16 format?
If is it necessary then can the initial extra character be a bit value 1?
Does NUL Byte mean the binary character 0?
UTF-8 is backwards compatible with ASCII. ASCII uses the values 0 - 127 and has assigned characters to them. That means bytes 0000 0000 through 0111 1111. UTF-8 keeps that same mapping for those same first 128 characters.
Any character not found in ASCII is encoded in the form of 1xxx xxxx in UTF-8, i.e. for any non-ASCII character the high bit of every encoded byte is 1. Those characters are encoded in multiple bytes in UTF-8. The first bits of the first byte in the sequence tell the decoder of how many bytes the character consists of; 110x xxxx signals that it's a 2-byte character, 1110 xxxx a 3-byte character and 1111 0xxx a 4-byte character. Subsequenct bytes in the sequence are in the form 10xx xxxx. So, no, you can't just set it to 1 arbitrarily.
There are various extensions to ASCII (e.g. ISO-8859) which set that first bit as well and thereby add another 128 characters of the form 1xxx xxxx.
There's also 7-bit ASCII which omits the first 0 bit and just uses 000 0000 through 111 1111.
Does NUL Byte mean the binary character 0?
It means the bit sequence 0000 0000, i.e. an all-zero byte with the decimal/hex/octal value 0.
You may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
UTF-8 encodes Unicode codepoints U+0000 - U+007F (which are the ASCII characters 0-127) using 7 bits. The eighth bit is used to signal when additional bytes are necessary only when encoding Unicode codepoints U+0080 - U+10FFFF.
For example, รจ is codepoint U+00E8, which is encoded in UTF-8 as bytes 0xC3 0xA8 (11000011 10101000 in binary).
Wikipedia explains quite well how UTF-8 is encoded.
Does NUL Byte mean the binary character 0?
Yes.

CAM::PDF returning non ascii character instead of quotes

I am having trouble with non ascii characters being returned. I am not sure at which level the issue resides. It could be the actual PDF encoding, the decoding used by CAM::PDF (which is FlateDecode) or CAM::PDF itself. The following returns a string full of the commands used to create the PDF (Tm, Tj, etc).
use CAM::PDF;
my $filename = "sample.pdf";
my $cam_obj = CAM::PDF->new($filename) or die "$CAM::PDF::errstr\n";
my $tree = $cam_obj->getPageContentTree(1);
my $page_string = $tree->toString();
print $page_string;
You can download sample.pdf here
The text returned in the Tj often has one character which is non ASCII. In the PDF, the actual character is almost always a quote, single or double.
While reproducing this I found that the returned character is consistent within the PDF but varies amongst PDFs. I also noticed the PDF is using a specific font file. I'm now looking into font files to see if the same character can be mapped to varying binary values.
:edit:
Regarding Windows-1252. My PDF returns an "ร•" instead of apostrophes. The ร• character is hex 0xD5 in Windows-1252 and UTF-8. If the idea is that the character is encoded with Windows-1252, then it should be a hex 0x91 or 0x92 which it is not. Which is why the following does nothing to the character:
use Encode qw(decode encode);
my $page_string = 'ร•';
my $characters = decode 'Windows-1252', $page_string;
my $octets = encode 'UTF-8', $characters;
open STS, ">TEST.txt";
print STS $octets . "\n";
I'm the author of CAM-PDF. Your PDF is non-compliant. From the PDF 1.7 specification, section 3.2.3 "String Objects":
"Within a literal string, the backslash (\) is used as an escape
character for various purposes, such as to include newline characters,
nonprinting ASCII characters, unbalanced parentheses, or the backslash
character itself in the string. [...] The \ddd escape sequence provides
a way to represent characters outside the printable ASCII character set."
If you have large quantities of non-ASCII characters, you can represent them using hexadecimal string notation.
EDIT: Perhaps my interpretation of the spec is incorrect, given a_note's alternative answer. I'll have to revisit this... Certainly, the spec could be clearer in this area.
Sorry to intrude, and with all due respect, sir, but file IS compliant. Section 3.2.3 further states:
[The \ddd] notation provides a way to specify characters outside the
7-bit ASCII character set by using ASCII characters only. However,
any 8-bit value may appear in a string.
"receiving" - where? You get "ร•" instead of expected what? And doing exactly what? You know that windows command prompt uses dos code page, not windows-1252, right? (oops, new thread again... probably i should register here :-) )

Unable to encode to iso-8859-1 encoding for some chars using Perl Encode module

I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding.
My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (โ€™)
Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem.
I am pasting the sample text having the char causing the issue:
my $str = "This is a test string to test the encoding of some chars like โ€™ โ€œ โ€ etc these are failing to encode; some of them which encode correctly are รฉ ยซ etc.";
Thanks
There's a third argument to encode, which controls the checking it does. The default is to use a substitution character, but you can set it to FB_CROAK to get an error message.
The fundamental problem is that the characters represented by โ€™, โ€œ, and โ€ do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.
Some possibilities:
Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.
Re-encode the entities outside the ISO-8859-1 range (plus &), before converting from utf-8 to ISO-8859-1:
my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);
(The no warnings bit is needed because U+10FFFF hasn't actually been assigned yet.)
There are other possibilities. It really depends on what you're trying to accomplish.