Unicode BOM for UTF-16LE vs UTF32-LE - unicode

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:
FF FE 00 00 00 00 00 00
How can I tell if this file contains:
The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?
Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?

As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.
A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.

It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^

I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.
However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.
When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.
So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.
To solve my problem, i will swap the first and second character. :-)

Related

Does PowerShell try to figure out a script's encoding?

When I execute the following simple script in PowerShell 7.1, I get the (correct) value of 3, regardless of whether the script's encoding is Latin1 or UTF8.
'Bär'.length
This surprises me because I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
Because both scripts evaluate the expression to 3, I am forced to conclude that PowerShell 7.1 applies some heuristic method to infer a Script's encoding when executing it.
Is my conclusion correct and is this documented somewhere?
I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
There are two distinct default character encodings to consider:
The default output encoding used by various cmdlets (Out-File, Set-Content) and the redirection operators (>, >>) when writing a file.
This encoding varies wildly across cmdlets in Windows PowerShell (PowerShell versions up to 5.1) but now - fortunately - consistently defaults to BOM-less UTF-8 in PowerShell [Core] v6+ - see this answer for more information.
Note: This encoding is always unrelated to the encoding of a file that data may have been read from originally, because PowerShell does not preserve this information and never passes text as raw bytes through - text is always converted to .NET ([string], System.String) instances by PowerShell before the data is processed further.
The default input encoding, when reading a file - both source code read by the engine and files read by Get-Content, for instance, which applies only to files without a BOM (because files with BOMs are always properly recognized).
In the absence of a BOM:
Windows PowerShell assumes the system's active ANSI code page, such as Windows-1252 on US-English systems. Note that this means that systems with different active system locales (settings for non-Unicode applications) can interpret a given file differently.
PowerShell [Core] v6+ more sensibly assumes UTF-8, which is capable of representing all Unicode characters and whose interpretation doesn't depend on system settings.
Note that these are fixed, deterministic assumptions - no heuristic is employed.
The upshot is that for cross-edition source code the best encoding to use is UTF-8 with BOM, which both editions recognize properly.
As for a source-code file containing 'Bär'.length:
If the source-code file's encoding is properly recognized, the result is always 3, given that a .NET string instance ([string], System.String) is constructed, which in memory is always composed of UTF-16 code units ([char], System.Char), and given that .Length counts the number of these code units.[1]
Leaving broken files out of the picture (such as a UTF-16 file without a BOM, or a file with a BOM that doesn't match the actual encoding):
The only scenario in which .Length does not return 3 is:
In Windows PowerShell, if the file was saved as a UTF-8 file without a BOM.
Since ANSI code pages use a fixed-width single-byte encoding, each byte that is part of a UTF-8 byte sequence is individually (mis-)interpreted as a character, and since ä (LATIN SMALL LETTER A WITH DIAERESIS, U+00E4) is encoded as 2 bytes in UTF-8, 0xc3 and 0xa4, the resulting string has 4 characters.
Thus, the string renders as Bär
By contrast, in PowerShell [Core] v6+, a BOM-less file that was saved based on the active ANSI (or OEM code) page (e.g., with Set-Content in Windows PowerShell) causes all non-ASCII characters (in the 8-bit range) to be considered invalid characters - because they cannot be interpreted as UTF-8.
All such invalid characters are simply replaced with � (REPLACEMENT CHARACTER, U+FFFD) - in other words: information is lost.
Thus, the string renders as B�r - and its .Length is still 3.
[1] A single UTF-16 code unit is capable of directly encoding all 65K characters in the so-called BMP (Basic Multi-Lingual Plane) of Unicode, but for characters outside this plane pairs of code units encode a single Unicode character. The upshot: .Length doesn't always return the count of characters, notably not with emoji; e.g., '👋'.length is 2
The encoding is unrelated to this case: you are calling string.Length which is documented to return the number of UTF-16 code units. This roughly correlates to letters (when you ignore combining characters and high codepoints like emoji)
Encoding only comes into play when converting implicitly or explicitly to/from a byte array, file, or p/invoke. It doesn’t affect how .Net stores the data backing a string.
Speaking to the encoding for PS1 files, that is dependent upon version. Older versions have a fallback encoding of Encoding.ASCII, but will respect a BOM for UTF-16 or UTF-8. Newer versions use UTF-8 as the fallback.
In at least 5.1.19041.1, loading the file 'Bär'.Length (27 42 C3 A4 72 27 2E 4C 65 6E 67 74 68) and running it with . .\Bar.ps1 will result in 4 printing.
If the same file is saved as Windows-1252 (27 42 E4 72 27 2E 4C 65 6E 67 74 68), then it will print 3.
tl;dr: string.Length always returns number of UTF-16 code units. PS1 files should be in UTF-8 with BOM for cross version compatibility.
I think without a BOM, PS 5 assumes ansi or windows-1252, while PS 7 assumes utf8 no bom. This file saved as ansi in notepad works in PS 5 but not perfectly in PS 7. Just like a utf8 no bom file with special characters wouldn't work perfectly in PS 5. A utf16 ps1 file would always have a BOM or encoding signature. A powershell string in memory would always be utf16, but a character is considered to have a length of 1 except for emoji's. If you have emacs, esc-x hexl-mode is a nice way to look at it.
'¿Cómo estás?'
format-hex file.ps1
Label: C:\Users\js\foo\file.ps1
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 27 BF 43 F3 6D 6F 20 65 73 74 E1 73 3F 27 0D 0A '¿Cómo estás?'��

Why UTF-8 encoding doesn't need a Byte Order Mark?

Unicode FAQ mentions that UTF-8 doesn't need BOM.
Q: Is the UTF-8 encoding scheme the same irrespective of whether the
underlying processor is little endian or big endian?
A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
endian problem as there is for encoding forms that use 16-bit or
32-bit code units. Where a BOM is used with UTF-8, it is only used as
an encoding signature to distinguish UTF-8 from other encodings — it
has nothing to do with byte order.
For code points above U+0744, UTF-8 needs 2 to 4 bytes to represent them. Doesn't it need a BOM to specify the endianness of these bytes or does UTF-8 adopt a default?
UTF-8 gives a strict definition for the order of the bytes that encode a character. No variation between computing platforms is allowed.
For example, the Euro sign U+20AC must be encoded as the byte sequence \xE2\x82\xAC. No other ordering of these bytes is permitted.
UTF-8 uses 1-byte code units, so there is no need for a BOM to indicate a byte order, because there is only 1 byte order possible, and the encoding algorithm determines the ordering of the bytes. For example, U+0744 is encoded in UTF-8 as code units 0xDD 0x84, which are represented in bytes as DD 84. Bytes 84 DD would be an illegal UTF-8 sequence.
Unlike UTF-16 and UTF-32, which use 2-byte and 4-byte code units, respectively. The encoding algorithm determines the order of the code units, but since the code units themselves are multi-byte, they are subject to endian. For example, U+0744 is encoded in UTF-16 as code unit 0x0744, and in UTF-32 as code unit 0x00000744, which are represented in bytes as 07 44 or 44 07 in UTF-16, and as 07 44 00 00 or 00 00 44 07 in UTF-32, depending on endian.
So, a BOM makes sense to indicate which endian is actually being used for UTF-16/32, but not for UTF-8.

Is a .txt expected to be in UTF-8 encoding these days? Must I end it with .utf8?

I'm producing plain-text files. I do not use ASCII/ANSI but UTF-8 encoding, since the year is 2020 and not 1995. Unicode/UTF-8 is very well established now and it would be madness to assume no UTF-8 support these days.
At the same time, I have a feeling that plain-text files (.txt) are associated with ANSI/ASCII encoding, as in, because it's so primitive-looking it must also be primitive in the encoding it uses.
However, I wish to use all kinds of Unicode characters, and not just be limited to the basic ANSI/ASCII ones.
Since plain-text has no metadata like HTML does, there is (beknownst to me) no way to tell the reader that this .txt uses Unicode/UTF-8, and from what I have learned, you cannot detect it reliably but have to make "educated guesses".
I have seen people add .utf8 to the end of text files before, but this seems kind of ugly and I strongly question how widespread support for this is...
Should I do this?
test.txt.utf8
Whenever the .txt file is using UTF-8? Or will it just make it even harder for people to open them with no actual benefit in terms of detecting it as UTF-8?
You do not elaborate on the use case of the text files you generate, but actually the "way to tell the reader that this .txt uses Unicode/UTF-8" is the Byte Order Mark at the beginning of the text file. By the way it is represented in actual bytes, it tells the reader which Unicode encoding to use to read the file.
From the Unicode FAQ:
Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

How to identify encoding from hex values?

I have text on a website that displays like that: o¨ instead of ö
I extracted the text out of the CMS and analysed it's hex values:
the ö's that are displays correctly have c3 b6 - UTF-8
the ö's that are displayed incorrect have 6f cc 88
I couldn't find out what encoding this is. What's a good way to identify the encoding?
6F is the UTF-8 (ASCII) encoding of "o", nothing spectacular.
CC 88 is the UTF-8 encoding of U+0308, COMBINING DIAERESIS.
You're simply looking at the decomposed form of the o-umlaut. A combining diaereses character should visually be rendered, well, combined with the previous character. If your system doesn't do that, it means it doesn't treat Unicode correctly, and/or the font you have chosen is somewhat broken. Perhaps you have to normalise your strings into the composed Unicode form instead for your system to handle it correctly.

Convert Unicode code point to UTF-8 sequence

I am not sure I've got my nomenclature right, so please correct me :)
I've received a text file representing a Pāli dictionary: a list of words separated by newline \n (0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.
Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example
mhiti
A closer look at the actual bytes then reveal the following (using hexdump -C)
0a 0a 1e 6d 68 69 74 69 0a 0a ...mhiti..
which seems to me the Unicode code point U+1E6D ("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad.
My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?
EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes
01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā k i ñ c a ñ ñ ā y a t a n a
where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.
EDIT: Here's one that I can't quite figure out what it's supposed to be:
01 1e 37 01 01 76 61 6b 61
? ā v a k a
I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in
01 00 1e 37 01 01 76 61 6b 61
Ā ḷ ā v a k a
Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?
I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.
From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).
There's no invocation of iconv that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).
I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.
It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.
Using wget to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.