Byte Order Mask: confusing the UTF encoding

Byte Order Mask: confusing the UTF encoding - unicode

The Byte Order Mask (BOM) uses the Unicode character U+FEFF to determine the encoding of a text file according to the following rule:
+-------------+-----------------------+
| Bytes | Encoding Form |
+-------------+-----------------------+
| 00 00 FE FF | UTF-32, big-endian |
| FF FE 00 00 | UTF-32, little-endian |
| FE FF | UTF-16, big-endian |
| FF FE | UTF-16, little-endian |
| EF BB BF | UTF-8 |
+-------------+-----------------------+
My question is: is there any combination of bytes that can make one UTF encoding to be confused with another UTF encoding?
For example, if I have a UTF-16 big-endian encoded file without BOM and with the characters U+EFBB and U+BF40 (EF BB BF 40) can it be confused with an UTF-8 encoded file with BOM and the ASCII character #?

Sure, without knowing the encoding, a sequence of U+0000 characters has an unknown length.
00 00 00 00 UTF-8 U+0000 U+0000 U+0000 U+0000
00 00 00 00 UTF-16 U+0000 U+0000
00 00 00 00 UTF-32 U+0000
BTW—Bytes that look like a byte order mark cannot be used to determine the encoding of a text file. In general, it's an unsolvable problem—data loss.

The BOM is designed to find the byte order when the size is known. So there is no U+FFFE code. There is no further limitation on charset, so there can be some overlapping codes. (#TomBlodget has an example of a "degenerate" case)
BOM in UTF-8 is not really needed, but it should be preserved, in order to do a perfect round conversion from other unicode encoding. Just Windows started to use it to distinguish UTF-8 from other encoding (especially outside unicode encoding), and that it is not 100% reliable.
C0 and C1 are bytes not allowed on UTF-8, along various sequences (first bits on byte 1 defines the length of sequence, and so there should be exactly so many bytes with "continuation prefix" (0b10). So usually it is easy to find if a string it is UTF-8 (if not too short or "degenerate").
UTF-32 has valid values just from 0 to U+10FFFF, so this could be used to distinguish it from UTF16 (again, "degenerate" and short strings are not discriminable, OTOH we should expect very often 00 00 in UTF32, and usually no 00 00 on UTF16 normal text, but ev. at the end.).
Control characters and private character set should not be used on "public" Unicode text (but if you agree on the protocol, but so that should not be the case of the question).

Related

Avoid unexpected characters in powershell stream-to-file output

I would expect the following powershell command:
echo 'abc' > /test.txt
to fill the file /test.txt with exactly 3 bytes: 0x61, 0x62, 0x63.
If I inspect the file, I see that its bytes are (all values are hex):
ff fe 61 00 62 00 63 00 0d 00 0a 00
Regardless of what output I try to stream into a file, the following steps apply in the given order, transforming the resulting file contents:
Append \r\n (0d 0a) to the end of the output
Insert a null byte (00) after every character of the output
Prepend ff fe to the entire output
Transformation #1 is relatively tolerable, but the other two transformations are a real pain.
How can I stream content more precisely (without these transformations) using powershell?
Thanks!

Try this
'abc' | out-file -filepath .\test.txt -encoding ascii -nonewline
should get you 3 bytes instead of 12
61 62 63

Powershell Format-Hex does not show end of line. Why?

I cannot see any end of line byte
echo "hello" | Format-Hex -Raw -Encoding Ascii
is there a way to show them?
Edit: I also have a file that shows the same behaviour, and this one contains multiple lines, as confirmed by both cat and notepad.
PS C:\dev\cur CMR-27473_AMI_not_stopping_in_ecat_fault 97984 > cat .\x.txt
helo
helo2
PS C:\dev\cur CMR-27473_AMI_not_stopping_in_ecat_fault 97984 > Get-Content .\x.txt | Format-Hex -Raw
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6F helo
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6F 32 helo2
I do see the two records. But I want to see the end of line characters instead, that is, the raw bytes content.

If you mean newline, there isn't one in the source string. Thus, Format-Hex won't show one.
Windows uses CR LF sequence (0x0a, 0x0d) for newline. To see the control characters, append a newline into the string. Like so,
"hello"+[environment]::newline | Format-Hex -Raw -Encoding Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6C 6F 0D 0A hello..
One can also use Powershell's backtick escape sequence: "hello`r`n" for the same effect as appending [Environment]::NewLine, though only the latter is platform-aware.
Addendum as per the comment and edit:
Powershell's Get-Content is trying to be smart. In most of the use cases[citation needed], data read from text files does not need to include the newline characters. Get-Content will populate an array and each line read from the file will be in its own element. What use would a newline be?
When output is redirected to a file, Powershell is trying to be smart again. In most of the use cases[citation needed], adding text into a text file means adding new lines of data. Not appending existing a line. There's actually a separate switch for preventing the linefeed: Add-Content -NoNewLine.
What's more, high level languages do not have a specific string termination character. When one has a string object, like the modern languages, the length of the string is stored as an attribute of the string object.
In low level languages, there is no concept of a string. It's just a bunch of characters stuffed together. How, then would one know where a "string" begins and ends? Pascal's approach is to allocate byte in the beginning to contain actual string data length. C uses null-terminated strings. In DOS, assembly programs used dollar -terminated strings.

To complement vonPryz's helpful answer:
tl;dr:
Format-Hex .\x.txt
is the only way to inspect a file's raw byte content in PowerShell; i.e., you need to pass the input file path as a direct argument (to the implied -Path parameter).
Once the pipeline is involved, any strings you're dealing with are by definition .NET string objects, which are inherently UTF-16-encoded.
echo "hello", which is really Write-Output "hello", given that echo is a built-in alias for Write-Output, writes a single string object to the pipeline, as-is - and given that it has no embedded newline, Format-Hex doesn't show one.
For more, read on.
Generally, PowerShell has no concept of sending raw data through a pipeline: you're always dealing with instances of .NET types (objects).
Therefore, when Format-Hex receives pipeline input, it never sees raw byte streams, it operates on .NET strings, which are inherently UTF-16 ("Unicode") strings.
It is only then that the -Encoding parameter applies: it re-encodes the .NET strings on output.
By default, the output encoding is ASCII in Windows PowerShell, and UTF-8 in PowerShell Core.
Note: In Windows PowerShell, this means that by default characters outside the 7-bit ASCII range are transcoded in a "lossy" fashion to the literal ? character (whose Unicode code point and byte value is 0x3F).
The -Raw switch only make sense in combination with [int] (System.Int32)-typed input in Windows PowerShell v5.1 and is obsolete in PowerShell Core, where it has no effect whatsoever.[1]
echo is a built-in alias for the Write-Output cmdlet, and it accepts objects to write to the pipeline.
In your case, that object is a single-line string (an object of type [string] (System.String)), which, as stated, has no embedded newline sequence.
As an aside: PowerShell implicitly outputs anything that isn't captured (assigned to a variable or redirected elsewhere), so your command can be written more idiomatically as:
"hello" | Format-Hex
Similarly, cat is a built-in alias for the Get-Content cmdlet, which reads a text file's content as an array of lines, i.e., into a string array whose elements do not end in a newline.
It is the array elements that are written to the pipeline, one by one, and Format-Hex renders the bytes of each separately - but, again, without any newlines, because the input objects (array elements representing lines without a trailing newline) do not contain any.
The only way to see newlines is to read the file as a whole, which is what the - somewhat confusingly named - -Raw switch does:
Get-Content -Raw .\x.txt | Format-Hex
While this now does reflect the actual newlines present in the file, note that it is not a raw byte representation of the file, for the reasons mentioned.
[1] -Raw's purpose up to v5.1 was never documented, but it is now described as obsolete and having no effect.
In short: [int]-typed input was not necessarily represented by the 4 bytes it comprises - single-byte or double-byte sequences were used, if the value was small enough, in favor of more compact output; -Raw would deactivate this and output the faithful 4-byte representation.
In PS Core [v6+], you now always and invariably get the faithful byte representation, and -Raw has no effect; for the full story see this GitHub pull request.

What does the first bit(i.e. binary 0) mean in UTF-8 encoding standard?

I'm a PHP Developer by profession.
Consider below example :
I want to encode the word "hello" using UTF-8 encoding.
So,
Equivalent Code Points of each of the letters of the word "hello" are as below :
h = 104
e = 101
l = 108
o = 111
So, we can say that the list of decimal numbers represent the string "hello":
104 101 108 108 111
UTF-8 encoding will store "hello" like this (binary):
01101000 01100101 01101100 01101100 01101111
If you observe the above binary encoded value closely, you will come to know that every binary equivalent of decimal number has been preceded with the binary bit value 0.
My question is why this initial 0 has been prefixed to every storable character? What's the purpose of using it in UTF-8 encoding?
What has been done when the same string is encoded using UTF-16 format?
If is it necessary then can the initial extra character be a bit value 1?
Does NUL Byte mean the binary character 0?

UTF-8 is backwards compatible with ASCII. ASCII uses the values 0 - 127 and has assigned characters to them. That means bytes 0000 0000 through 0111 1111. UTF-8 keeps that same mapping for those same first 128 characters.
Any character not found in ASCII is encoded in the form of 1xxx xxxx in UTF-8, i.e. for any non-ASCII character the high bit of every encoded byte is 1. Those characters are encoded in multiple bytes in UTF-8. The first bits of the first byte in the sequence tell the decoder of how many bytes the character consists of; 110x xxxx signals that it's a 2-byte character, 1110 xxxx a 3-byte character and 1111 0xxx a 4-byte character. Subsequenct bytes in the sequence are in the form 10xx xxxx. So, no, you can't just set it to 1 arbitrarily.
There are various extensions to ASCII (e.g. ISO-8859) which set that first bit as well and thereby add another 128 characters of the form 1xxx xxxx.
There's also 7-bit ASCII which omits the first 0 bit and just uses 000 0000 through 111 1111.
Does NUL Byte mean the binary character 0?
It means the bit sequence 0000 0000, i.e. an all-zero byte with the decimal/hex/octal value 0.
You may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

UTF-8 encodes Unicode codepoints U+0000 - U+007F (which are the ASCII characters 0-127) using 7 bits. The eighth bit is used to signal when additional bytes are necessary only when encoding Unicode codepoints U+0080 - U+10FFFF.
For example, è is codepoint U+00E8, which is encoded in UTF-8 as bytes 0xC3 0xA8 (11000011 10101000 in binary).
Wikipedia explains quite well how UTF-8 is encoded.
Does NUL Byte mean the binary character 0?
Yes.

How can I detect the codepage of a serial of text,2 byte for a character,It's polish

How can I detect the codepage of a serial of text,2 byte for a charactor,It's polish.And for normal English charactor ,just add 0x00 to the ansi code, for special Polish character,the two byte have the special meaning. there is no file head ,just bytes stream like this.
Sample here
string: Połączenia
bytes: 50 00/6f 00/42 01/05 01/63 00/7a 00/65 00/69 00/61 00
I think it's not unicode ,because 0x4201 in unicode is a Chinese charactor
not Polish.
So Any one can help me? thanks very much!

Its UTF-16 Big Endian.
$ echo -n "Połączenia" | iconv -f UTF8 -t UTF16BE | hexdump
0000000 5000 6f00 4201 0501 6300 7a00 6500 6e00
0000010 6900 6100

get Unicode value [PostgreSQL]

If you see this link
Its all about unicode code range
example :
U+0644 ل d9 84 ARABIC LETTER LAM
In PostgreSQL its easy to get hex value :
select encode('ل','hex')
it will return the hex value, d9 84.
but how to get the unicode code point ?
Thanks

If your input string is in UTF-8, you can use the ascii function:
ascii(string) int
ASCII code of the first character of the argument.
For UTF8 returns the Unicode code point of the character. For other
multibyte encodings. the argument must be a strictly ASCII character.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse