How my file names have been encoded? - unicode

After a long time, I come to review the contents of my HDD and see a weird file name.
I'm not sure what tool or program has changed it this way, but when I see the content of the file I could find its original name.
Anyway, I'm encountering a type of encoding and I want to find it. It's not complicated. Mostly for those who are familiar with unicode and utf8. Now I map them and you guess what has happened.
In the following, I give a table which maps the characters. In the second column, there's utf8 form and in the first column there's its equivalent character which is converted.
I need to know what happened and how is it converted to convert it back to utf8. that is, what I have is in the first column, and what I need to get is in the second column:
Hide Copy Code
638 2020 646
639 AF 6AF
637 A7 627
637 B1 631
637 B3 633
637 6BE 62A
20 20
638 67E 641
63A 152 6CC
For more description, consider the first row, utf8 form is 46 06 (type bytes) or 0x0646. The file name for this character is converted into two wide-characters, 0x0638 0x2020.

I found the solution myself.
In Notepad++:
Select "Encode in ANSI" from Encoding menu.
Paste the corrupted text.
Select "Encode in UTF-8" from Encoding menu.
That's it. The correct text will be displayed.
If so, how can I do the same with Perl?

Related

How to identify encoding from hex values?

I have text on a website that displays like that: o¨ instead of ö
I extracted the text out of the CMS and analysed it's hex values:
the ö's that are displays correctly have c3 b6 - UTF-8
the ö's that are displayed incorrect have 6f cc 88
I couldn't find out what encoding this is. What's a good way to identify the encoding?
6F is the UTF-8 (ASCII) encoding of "o", nothing spectacular.
CC 88 is the UTF-8 encoding of U+0308, COMBINING DIAERESIS.
You're simply looking at the decomposed form of the o-umlaut. A combining diaereses character should visually be rendered, well, combined with the previous character. If your system doesn't do that, it means it doesn't treat Unicode correctly, and/or the font you have chosen is somewhat broken. Perhaps you have to normalise your strings into the composed Unicode form instead for your system to handle it correctly.

Convert Unicode code point to UTF-8 sequence

I am not sure I've got my nomenclature right, so please correct me :)
I've received a text file representing a Pāli dictionary: a list of words separated by newline \n (0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.
Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example
mhiti
A closer look at the actual bytes then reveal the following (using hexdump -C)
0a 0a 1e 6d 68 69 74 69 0a 0a ...mhiti..
which seems to me the Unicode code point U+1E6D ("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad.
My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?
EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes
01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā k i ñ c a ñ ñ ā y a t a n a
where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.
EDIT: Here's one that I can't quite figure out what it's supposed to be:
01 1e 37 01 01 76 61 6b 61
? ā v a k a
I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in
01 00 1e 37 01 01 76 61 6b 61
Ā ḷ ā v a k a
Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?
I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.
From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).
There's no invocation of iconv that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).
I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.
It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.
Using wget to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.

HEX-edit UTF-8 file

I am trying to create a UTF-8/no-BOM file with a HEX-Editor. My desired UTF character is the TUGRIK SIGN, which is e2 82 ae in UTF-8.
I created an UTF-8/no BOM file with N++, copied the character in N++ and saved the file.
Voilà, looks fine in the HEX-Editor, fancy e2 82 ae !
So I tried it the other way arround, saving the 3 bytes e2 82 ae to a file with wxHexEdtior. Crap, N++ thinks for some reason that the file is ANSI(Latin1) encoded.
I don't get it at all.
Might there be a collision with the windows -CP1252- encoding?
Another interesting thing (which I also don't get at all), is that wxHexEditor shows some disassembly for the files.
The disassembly for the N++ created file is okay for wxHexEditor, but the wxHexEditor created file has invalid disassembly.
I would be really glad if someone could explan that black magic to me.
The file itself contains no encoding information, so your editor has either to guess the encoding or just display it in some default encoding, and Latin1 is a common default. In my version of N++ (6.1.2) it opens and displays correctly as UTF-8.
If your version doesn't guess correctly, then perhaps when you created the file in N++ you told N++ in advance that you were about to create a UTF-8 file with no BOM, and that's how it knew to display it correctly at that time.
About the assembler... First, it's not a case of assembler being "linked to" or "associated with" a file, but rather that your hexeditor just tries to disassemble any file you give it.
The reason the assembler is different is that in the "good" file you happen to have selected the first byte (or nothing) and so wxHexEditor disassembles the entire file. In the "bad" version you've probably selected the second byte, and this 82 ae does not disassemble to any valid code.

Some UTF-8 characters do not show up on browser

Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?
http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)
I found that this can be handled with html entities. Is there any way to display this without converting to html entities?
The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities – and – represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.
You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.
I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".
See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...
Two reasons come to mind:
Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
The font you are using doesn't have a glyph defined at this code point.

VerQueryValue and multi codepage Unicode characters

In our application we use VerQueryValue() API call to fetch version info such as ProductName etc. For some applications running on a machine in Traditional Chinese (code page 950), the ProductName which has Unicode sequences that span multiple code pages, some characters are not translated properly. For instance,in the sequence below,
51 00 51 00 6F 8F F6 4E A1 7B 06 74
Some characters are returned as invalid Unicode 0x003f (question mark)
In the above sequence, the Unicode '8F 6F' is not picked up & converted properly by the WinAPI call and is just filled with the invalid Unicode '00 3F' - since '8F 6F' is present in codepage 936 only (ie., Simplified Chinese)
The .exe has just one translation table - as '\StringFileInfo\080404B0' - which refers to a language ID of '804' for Traditional Chinese only
How should one handle such cases - where the ProductName refers to Unicode from both 936 and 950 even though the translation table has one entry only ? Is there any other API call to use ?
Also, if I were to right-click on the exe and view 'details' tab, it shows the Productname correctly ! So it appears Microsoft uses a different API call or somehow
handle this correctly. I need to know how it so done.
Thanks in advance,
Venkat
It looks somewhat waierd to have contents compatible with codepage1 only in a block marked as codepage2. This is the source of your problem.
The best way to handle multi-codepages issues is obviously to turn your app to a Unicode-aware application. There will be no conversion to any codepages anymore, which will make everyone happy.
The LANGID (0804) is only an indication about the language of the contents in the block. If a version info has several blocks, you may program your app to lookup the block in the language of your user.
When you call VerQueryValue() in an ANSI application, this LANGID is not taken into account when converting the Unicode contents to ANSI: You're ANSI, so Windows assume you only understand the machine's default ANSI codepage.
Note about display in console
Beware of the console! It's an old creature that is not totally Unicode-aware. It is based on codepages. Therefore, you should expect display problems which can't be addressed. Even worse: It uses its own codepage (called OEM codepage) which may be different that the usual ANSI codepage (Although for East Asian languages, OEM codepage = ANSI codepage).
HTH.