Emacs displays chinese character if I open xml file - emacs

I have an xml-file. When I open it with Emacs it displays chinese characters (see attachment). This happens on my Windows 7 PC with Emacs and Notepad and also on my Windows XP (see figure A). Figure B is the hexl-mode of A.
If I use the Windows XP PC of a collegue and open the file with Notepad there are no chinese characters but a strange character character. I saved it as txt-file and sent it by email to my Windows7-PC (see figure C). The strange character was replaced with "?". (Due to restriction I could not use the PC of my collegue and reproduce the notepad file with the strange character).
My questions: it seems that there are characters in the XML-file which creates problems. I don't know how to cope with that. Does anybody has an idea how I can manage this problem? Does it have something to do with encoding? Thanks for hints.

By figure B, it looks like this file is encoded with a mixture of big-endian and little-endian UTF-16. It starts with fe ff, which is the byte order mark for big-endian UTF-16, and the XML declaration (<?xml version=...) is also big-endian, but the part starting with <report is little-endian. You can tell because the letters appear on even positions in the first part of the hexl display, but on odd positions further down.
Also, there is a null character (encoded as two bytes, 00 00) right before <report. Null characters are not allowed in XML documents.
However, since some of the XML elements appear correctly in figure A, it seems that the confusion goes on through the file. The file is corrupt, and this probably needs to be resolved manually.
If there are no non-ASCII characters in the file, I would try to open the file in Emacs as binary (M-x revert-buffer-with-coding-system and specify binary), remove all null bytes (M-% C-q C-# RET RET), save the file and hope for the best.
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

For some reason, Emacs takes "UTF-16" in an xml file encoding attribute as big endian, while Windows takes "UTF-16" as little endian (like when exporting from Task Scheduler). Emacs will unknowingly convert LE to BE automatically if you edit and save an xml file. You can mouse over the lower left "U" to see the current encoding. encoding="UTF-16LE" or encoding="UTF-16BE" will ruin the file after saving (no BOM). I believe the latest version has this fixed.
<?xml version="1.0" encoding="UTF-16"?>
<hi />

The solution of legoscia using the possibility of Emacs to change encoding within a file solved my problem. An other possibility is:
cut the part to convert
paste in a new file and save it
open it with an editor which can convert encodings
convert the file and save it
copy the converted string and add (paste) to the original file where you cut the part to convert
In my case it worked with Atom, but not with Notepad++.
PS: The reason why I used this way is that Emacs could not open anymore this kind of corrupted files. I don't know why but this is another issue.
Edit 1: Since copy, paste and merge is cumbersome I found the solution how to open currupted files with emacs: emacs -q xmlfile.xml. Using emacs like legoscia suggested is the best way to repair such files.

Related

recognising encodings in emacs

It is my understanding that txt files do not have encoding information stored so text editors simply make educated guesses about encoding of a given text file and then display the file on screen using that guessed encoding. If the editor guessed right you get your text on the screen, if the editor guessed wrong, then you (sometimes) get gibberish. Am I getting this right so far?
Now on to my problem. I have my bank statements in a csv file. When I open it in MS Excel 14 (MS Office 2010), it recognises the encoding and displays the problematic work as "obračun". Great. When I open the file in Emacs 24.3.1, it fails to recognise the correct encoding and displays the problematic word as "obra鑾n". Not so great.
My question is: how do I tell Emacs which encoding the file is in?
Thanks.
From the Emacs Manual:
If Emacs recognizes the encoding of a file incorrectly, you can reread
the file using the correct coding system with C-x RET r
(revert-buffer-with-coding-system). This command prompts for the
coding system to use.
Give utf-16 a try.

Notepad++ can recognize encoding?

I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?
Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.
You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.
When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)
Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)
You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

How do I make emacs display a multi-byte encoded file, properly? Is it mule?

When I open a multi-byte file, I get this:
Short term, you can revisit the file with an alternate coding system with revert-buffer-with-coding-system (select utf-16le then).
Middle term, you can bump the priority of that utf-16le encoding on load with prefer-coding-system.
Long term, however, you'd better try to understand why emacs did not pick the right encoding. I'm not sure how I can help there though, short of digging inside the coding system guts, or at least have a file to reproduce.
EDIT: Does this file have a BOM ?
If memory serves, Emacs will prompt the User for an encoding if it cannot determine one. When it makes a wrong determination you can use
C-x RET f coding RET
which will use coding as the coding system for the visited file in the current buffer.
In xml files, Emacs takes this is big endian, while Windows takes this as little endian.
<?xml version="1.0" encoding="UTF-16"?>
<hi />
Trying something like encoding="UTF-16LE" or encoding="UTF16-16BE" will ruin the xml file after saving. It will take off the BOM. utf-16le no bom can be opened in Notepad.

Is there a way to get the encoding of a text file in UltraEdit?

Is there a setting in UltraEdit that allows me to see the encoding of the file?
In UltraEdit, the encoding that is being used to -display- the file, is shown in the status bar at the right somewhere, together with the line-ending type in use, for example, "U8-UNIX". You can also manually set as what encoding the file has to be displayed. In version 10 this is under menu View -> Set Code Page. You can also -convert- the actual codepage of the file under menu File -> Conversions.
If the file does not have a BOM header, a couple of bytes at the start of the file indicating the encoding, the -actual- encoding of the file, can only be guessed. And even if the file has a BOM header, there can still be encoding issues.
All text editors do this, and some are better at it than others. I haven't done a comparision to see which is best at it. At the moment (2012), I know UltraEdit fails to detect UTF-8 and other variants in 1000 line (or longer) text files if the first UTF-8 character only appears later in the document. It also fails to show the encoding properly when you set it manually.
Notepad++ is also not great at detecting it, but when you know the encoding, you can set it manually.
Sublime Text is, as far as I know, best at detecting the encoding, also in large files.
I think there are also some very good command line tools out there, ported from GNU to Windows, to detect encoding. My bet would be that that's going to be the best option.