Emacs ediff, foreign character sets, and text file encodings

Emacs ediff, foreign character sets, and text file encodings - emacs

Whenever I use a character set in addition to latin in a text file (mixing Cyrillic and Latin, say), I usually choose utf-16 for the encoding. This way I can edit the file under OS X with either emacs or TextEdit.
But then ediff in emacs ceases to work. It says only that "Binary files this and that differ".
Can ediff be somehow made to work on text files that include foreign characters?

Customize the variable ediff-diff-options and add the option --text.
(setq ediff-diff-options "--text")
Edit:
Ediff calls out to an external program, the GNU utility diff, to compute the differences; however, diff does not understand unicode, and sees unicode encoded files as binary files. The option "--text" simply forces it to treat the input files as text files. See the manual for GNU Diffutils: Comparing and Merging Files; in particular 1.7 Binary Files and Forcing Text Comparisons.

I strongly recommend you use utf-8 instead of utf-16. utf-8 is the standard encoding in most of the Unix world, including Mac OS X, and it does not suffer from those problems.

Related

difference between utf-8-emacs and utf-8 on Emacs

Or which one should I use for which purposes? Am I right in assuming that the utf-8-emacs coding system is for emacs lisp files and the utf-8 is for text files?
M-x describe-coding-system on the two return:
U -- utf-8-emacs
Support for all Emacs characters (including non-Unicode characters).
Type: utf-8 (UTF-8: Emacs internal multibyte form)
EOL type: Automatic selection from:
[utf-8-emacs-unix utf-8-emacs-dos utf-8-emacs-mac]
This coding system encodes the following charsets:
emacs
U -- utf-8 (alias: mule-utf-8)
UTF-8 (no signature (BOM))
Type: utf-8 (UTF-8: Emacs internal multibyte form)
EOL type: Automatic selection from:
[utf-8-unix utf-8-dos utf-8-mac]
This coding system encodes the following charsets:
unicode
Not sure what is meant by
Support for all Emacs characters (including non-Unicode characters).

utf-8-emacs supports additional characters such as the internal representation of binary data. As this is a non-standard extension of Unicode, a separate encoding was defined for it so if you use utf-8 you will not accidentally include these non-standard extensions which could confuse other software.
You can use either encoding for elisp, unless you need to include binary data or obscure characters that are not part of Unicode it won't make a difference.

utf-8-emacs is the encoding used internally by Emacs. It's visible in a few places (e.g. auto-save files), but as a general rule you should never use it unless you know what you're doing.

what's the difference among various types of 'utf-8' in emacs

In Emacs, after typing
M-x revert-buffer-with-coding-system
I could see many types of 'utf-8', for example, utf-8, utf-8-auto-unix, utf-8-emacs-unix and etc.
I want to know what's the difference among them.
I have googled them but couldn't find a proper answer.
P.S.
I ask this question because I encountered an encoding problem a few months ago. I wrote a php program in Emacs and in my ~/.emacs, I set
(prefer-coding-system 'utf-8)
but when browsing the php page in a browser, I found the browser couldn't display the content correctly due to the encoding problem even though I had wrote
<meta name="Content-Type" content="text/html; charset=UTF-8" />
in the page.
But after I used notepad++ to store the file in utf-8, the browser could display the content correctly.
So I want to learn more about encoding in Emacs.

The last part of the encoding name (eg. mac in utf-8-mac) is usually to describe the special character that will be used at the end of lines:
-mac: CR, the standard line delimiter with MacOS (until OS X)
-unix: LF the standard delimiter for unice systems (so the BSD-based Mac OS X)
-dos: CR+LF the delimiter for DOS / Windows
some additional encodings parameters include:
-emacs: support for encoding all Emacs characters (including non Unicode)
-with-signature: force the usage of the BOM (see below)
-auto: autodetect the BOM
You can combine the different possibilities, that makes the list shown in Emacs.
To get some information on type of line ending, BOMs and charsets provided by encodings, you can use describe-coding-system, or: C-hC
Concerning the BOM:
the utf standard defines a special signature to be placed at the beginning of the (text) files to distinct for the utf-16 encoding the order of the bytes (as utf-16 stores the characters with 2 bytes - or 16 bits) or endianess: some systems place the most significant byte first (big-endian -> utf-16be) some others place the least significant byte first (little-endian -> utf-16le). That signature is called BOM: the Byte Order Mark
in utf-8, each character is represented by a single byte (excepted for extended characters greater than 127, they use a special sequence of bytes) thus specifying a byte order is a nonsense but this signature is anyway usefull to detect an utf-8 file instead of a plain text ascii. An utf-8 file differs from an ascii file only on extended chars, and that can be impossible to detect without parsing the whole file until finding one when the pseudo-BOM make it visible instantly. (BTW Emacs is very efficient to make such auto-detection)
FYI, BOMs are the following bytes as very first bytes of a file:
utf-16le : FF FE
utf-16be : FE FF
utf-8 : EF BB BF
you can ask Emacs to open a file without any conversion with find-file-literally : if the first line begins with ï»¿ you see the undecoded utf-8 BOM
for some additional help while playing with encodings, you can refer to this complementary answer "How to see encodings in emacs"
As #wvxvw said, your issue is a probable lack of BOM at the beginning of the file that made it wrongly interpreted and rendered.
BTW, M-x hexl-mode is also a very handy tool to check the raw content of the file. Thanks for pointing it to me (I often use an external hex editor for that, while it could be done directly in Emacs)

Can't say much about the issue, except that after setting
(prefer-coding-system 'utf-8)
(setq coding-system-for-read 'utf-8)
(setq coding-system-for-write 'utf-8)
I haven't had any unicode problems for more than 2 years.

Emacs displays chinese character if I open xml file

I have an xml-file. When I open it with Emacs it displays chinese characters (see attachment). This happens on my Windows 7 PC with Emacs and Notepad and also on my Windows XP (see figure A). Figure B is the hexl-mode of A.
If I use the Windows XP PC of a collegue and open the file with Notepad there are no chinese characters but a strange character character. I saved it as txt-file and sent it by email to my Windows7-PC (see figure C). The strange character was replaced with "?". (Due to restriction I could not use the PC of my collegue and reproduce the notepad file with the strange character).
My questions: it seems that there are characters in the XML-file which creates problems. I don't know how to cope with that. Does anybody has an idea how I can manage this problem? Does it have something to do with encoding? Thanks for hints.

By figure B, it looks like this file is encoded with a mixture of big-endian and little-endian UTF-16. It starts with fe ff, which is the byte order mark for big-endian UTF-16, and the XML declaration (<?xml version=...) is also big-endian, but the part starting with <report is little-endian. You can tell because the letters appear on even positions in the first part of the hexl display, but on odd positions further down.
Also, there is a null character (encoded as two bytes, 00 00) right before <report. Null characters are not allowed in XML documents.
However, since some of the XML elements appear correctly in figure A, it seems that the confusion goes on through the file. The file is corrupt, and this probably needs to be resolved manually.
If there are no non-ASCII characters in the file, I would try to open the file in Emacs as binary (M-x revert-buffer-with-coding-system and specify binary), remove all null bytes (M-% C-q C-# RET RET), save the file and hope for the best.
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

For some reason, Emacs takes "UTF-16" in an xml file encoding attribute as big endian, while Windows takes "UTF-16" as little endian (like when exporting from Task Scheduler). Emacs will unknowingly convert LE to BE automatically if you edit and save an xml file. You can mouse over the lower left "U" to see the current encoding. encoding="UTF-16LE" or encoding="UTF-16BE" will ruin the file after saving (no BOM). I believe the latest version has this fixed.
<?xml version="1.0" encoding="UTF-16"?>
<hi />

The solution of legoscia using the possibility of Emacs to change encoding within a file solved my problem. An other possibility is:
cut the part to convert
paste in a new file and save it
open it with an editor which can convert encodings
convert the file and save it
copy the converted string and add (paste) to the original file where you cut the part to convert
In my case it worked with Atom, but not with Notepad++.
PS: The reason why I used this way is that Emacs could not open anymore this kind of corrupted files. I don't know why but this is another issue.
Edit 1: Since copy, paste and merge is cumbersome I found the solution how to open currupted files with emacs: emacs -q xmlfile.xml. Using emacs like legoscia suggested is the best way to repair such files.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?

Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.

If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake

It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)

Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)

You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse