Convert Unicode code point to UTF-8 sequence - unicode

I am not sure I've got my nomenclature right, so please correct me :)
I've received a text file representing a Pāli dictionary: a list of words separated by newline \n (0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.
Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example
mhiti
A closer look at the actual bytes then reveal the following (using hexdump -C)
0a 0a 1e 6d 68 69 74 69 0a 0a ...mhiti..
which seems to me the Unicode code point U+1E6D ("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad.
My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?
EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes
01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā k i ñ c a ñ ñ ā y a t a n a
where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.
EDIT: Here's one that I can't quite figure out what it's supposed to be:
01 1e 37 01 01 76 61 6b 61
? ā v a k a
I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in
01 00 1e 37 01 01 76 61 6b 61
Ā ḷ ā v a k a
Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?

I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.
From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).
There's no invocation of iconv that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).

I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.
It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.
Using wget to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.

Related

Does PowerShell try to figure out a script's encoding?

When I execute the following simple script in PowerShell 7.1, I get the (correct) value of 3, regardless of whether the script's encoding is Latin1 or UTF8.
'Bär'.length
This surprises me because I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
Because both scripts evaluate the expression to 3, I am forced to conclude that PowerShell 7.1 applies some heuristic method to infer a Script's encoding when executing it.
Is my conclusion correct and is this documented somewhere?
I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
There are two distinct default character encodings to consider:
The default output encoding used by various cmdlets (Out-File, Set-Content) and the redirection operators (>, >>) when writing a file.
This encoding varies wildly across cmdlets in Windows PowerShell (PowerShell versions up to 5.1) but now - fortunately - consistently defaults to BOM-less UTF-8 in PowerShell [Core] v6+ - see this answer for more information.
Note: This encoding is always unrelated to the encoding of a file that data may have been read from originally, because PowerShell does not preserve this information and never passes text as raw bytes through - text is always converted to .NET ([string], System.String) instances by PowerShell before the data is processed further.
The default input encoding, when reading a file - both source code read by the engine and files read by Get-Content, for instance, which applies only to files without a BOM (because files with BOMs are always properly recognized).
In the absence of a BOM:
Windows PowerShell assumes the system's active ANSI code page, such as Windows-1252 on US-English systems. Note that this means that systems with different active system locales (settings for non-Unicode applications) can interpret a given file differently.
PowerShell [Core] v6+ more sensibly assumes UTF-8, which is capable of representing all Unicode characters and whose interpretation doesn't depend on system settings.
Note that these are fixed, deterministic assumptions - no heuristic is employed.
The upshot is that for cross-edition source code the best encoding to use is UTF-8 with BOM, which both editions recognize properly.
As for a source-code file containing 'Bär'.length:
If the source-code file's encoding is properly recognized, the result is always 3, given that a .NET string instance ([string], System.String) is constructed, which in memory is always composed of UTF-16 code units ([char], System.Char), and given that .Length counts the number of these code units.[1]
Leaving broken files out of the picture (such as a UTF-16 file without a BOM, or a file with a BOM that doesn't match the actual encoding):
The only scenario in which .Length does not return 3 is:
In Windows PowerShell, if the file was saved as a UTF-8 file without a BOM.
Since ANSI code pages use a fixed-width single-byte encoding, each byte that is part of a UTF-8 byte sequence is individually (mis-)interpreted as a character, and since ä (LATIN SMALL LETTER A WITH DIAERESIS, U+00E4) is encoded as 2 bytes in UTF-8, 0xc3 and 0xa4, the resulting string has 4 characters.
Thus, the string renders as Bär
By contrast, in PowerShell [Core] v6+, a BOM-less file that was saved based on the active ANSI (or OEM code) page (e.g., with Set-Content in Windows PowerShell) causes all non-ASCII characters (in the 8-bit range) to be considered invalid characters - because they cannot be interpreted as UTF-8.
All such invalid characters are simply replaced with � (REPLACEMENT CHARACTER, U+FFFD) - in other words: information is lost.
Thus, the string renders as B�r - and its .Length is still 3.
[1] A single UTF-16 code unit is capable of directly encoding all 65K characters in the so-called BMP (Basic Multi-Lingual Plane) of Unicode, but for characters outside this plane pairs of code units encode a single Unicode character. The upshot: .Length doesn't always return the count of characters, notably not with emoji; e.g., '👋'.length is 2
The encoding is unrelated to this case: you are calling string.Length which is documented to return the number of UTF-16 code units. This roughly correlates to letters (when you ignore combining characters and high codepoints like emoji)
Encoding only comes into play when converting implicitly or explicitly to/from a byte array, file, or p/invoke. It doesn’t affect how .Net stores the data backing a string.
Speaking to the encoding for PS1 files, that is dependent upon version. Older versions have a fallback encoding of Encoding.ASCII, but will respect a BOM for UTF-16 or UTF-8. Newer versions use UTF-8 as the fallback.
In at least 5.1.19041.1, loading the file 'Bär'.Length (27 42 C3 A4 72 27 2E 4C 65 6E 67 74 68) and running it with . .\Bar.ps1 will result in 4 printing.
If the same file is saved as Windows-1252 (27 42 E4 72 27 2E 4C 65 6E 67 74 68), then it will print 3.
tl;dr: string.Length always returns number of UTF-16 code units. PS1 files should be in UTF-8 with BOM for cross version compatibility.
I think without a BOM, PS 5 assumes ansi or windows-1252, while PS 7 assumes utf8 no bom. This file saved as ansi in notepad works in PS 5 but not perfectly in PS 7. Just like a utf8 no bom file with special characters wouldn't work perfectly in PS 5. A utf16 ps1 file would always have a BOM or encoding signature. A powershell string in memory would always be utf16, but a character is considered to have a length of 1 except for emoji's. If you have emacs, esc-x hexl-mode is a nice way to look at it.
'¿Cómo estás?'
format-hex file.ps1
Label: C:\Users\js\foo\file.ps1
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 27 BF 43 F3 6D 6F 20 65 73 74 E1 73 3F 27 0D 0A '¿Cómo estás?'��

How to identify encoding from hex values?

I have text on a website that displays like that: o¨ instead of ö
I extracted the text out of the CMS and analysed it's hex values:
the ö's that are displays correctly have c3 b6 - UTF-8
the ö's that are displayed incorrect have 6f cc 88
I couldn't find out what encoding this is. What's a good way to identify the encoding?
6F is the UTF-8 (ASCII) encoding of "o", nothing spectacular.
CC 88 is the UTF-8 encoding of U+0308, COMBINING DIAERESIS.
You're simply looking at the decomposed form of the o-umlaut. A combining diaereses character should visually be rendered, well, combined with the previous character. If your system doesn't do that, it means it doesn't treat Unicode correctly, and/or the font you have chosen is somewhat broken. Perhaps you have to normalise your strings into the composed Unicode form instead for your system to handle it correctly.

How my file names have been encoded?

After a long time, I come to review the contents of my HDD and see a weird file name.
I'm not sure what tool or program has changed it this way, but when I see the content of the file I could find its original name.
Anyway, I'm encountering a type of encoding and I want to find it. It's not complicated. Mostly for those who are familiar with unicode and utf8. Now I map them and you guess what has happened.
In the following, I give a table which maps the characters. In the second column, there's utf8 form and in the first column there's its equivalent character which is converted.
I need to know what happened and how is it converted to convert it back to utf8. that is, what I have is in the first column, and what I need to get is in the second column:
Hide Copy Code
638 2020 646
639 AF 6AF
637 A7 627
637 B1 631
637 B3 633
637 6BE 62A
20 20
638 67E 641
63A 152 6CC
For more description, consider the first row, utf8 form is 46 06 (type bytes) or 0x0646. The file name for this character is converted into two wide-characters, 0x0638 0x2020.
I found the solution myself.
In Notepad++:
Select "Encode in ANSI" from Encoding menu.
Paste the corrupted text.
Select "Encode in UTF-8" from Encoding menu.
That's it. The correct text will be displayed.
If so, how can I do the same with Perl?

Unicode BOM for UTF-16LE vs UTF32-LE

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:
FF FE 00 00 00 00 00 00
How can I tell if this file contains:
The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?
Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?
As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.
A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.
It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^
I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.
However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.
When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.
So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.
To solve my problem, i will swap the first and second character. :-)

Character conversion in SAXParser

I have a problem … a very peculiar one could you please guide.
Original message: Kevätsunnuntaisin lentää
The flow of data is HttpConnector -> WSDLConnector -> to the underlying system
The following is the encoding of the first 7 characters
4b 65 76 c3 a4 74 73 75 – In Http Connector – the request XML has UTF-8 encoding
4b 65 76 a3 74 73 75 – in WSDL Connector -
InputSource inputSource = new InputSource(myInputStream);
inputSource.setEncoding("UTF-8");
parser.parse(inputSource);
The original string gets converted to Kev£tsunnuntaisin lent££.Also, there is a loss of a byte.
Could you please guide me where I am going wrong? What must I do to avoid this character conversion?
Thanks for your help!!!
This is very simple: The data in myInputStream is not encoded as UTF-8, hence the decoding fails.
My guess is that you save the output of the HTML connector as a string and then use that as the input for the WSDL connector. In the string, the data is unicode, not UTF-8. Use String.getBytes('UTF-8') to get an array of bytes with the correct encoding.
As for all encoding issues: Always tell the computer with which encoding it should work instead of hoping that it will guess correctly. Bytes have no encoding and the computer is not telepathic :) And I hope it never will be ...