Emacs - how to avoid or replace wrong character encodings?

Emacs - how to avoid or replace wrong character encodings? - emacs

Assume that I receive a Spanish text written in MS word and saved as plain text (.txt). Unfortunately, all the Spanish accents show up like this:
Un \372ltimo an\341lisis
Can anybody tell me how I can avoid this, or at least how I can replace these characters? They are simply not found by the replace-regexp-functions, otherwise I could write a little elisp function that replaces every occurence of them by the associated Spanish accented character.

This looks like ISO 8859-1 (Latin-1) encoding.
Visit the file with that coding system instead. If Emacs does not automatically identify the coding system, you can revisit the file with an explicit coding system with revert-buffer-with-coding-system (C-x RET r).
For example, if you are looking at the garbled file you describe,
C-x RET r
latin-1 RET
yes RET
Then you can set the coding system you want for saving (C-x RET f) and specifying something like utf-8.

Related

Fast unicode input in Emacs with US layout

I would like to have a quick way to input unicode characters with multicharacter sequences. For example to input ä I would type \a. Searching for this, I found agda-input.
While I could adapt the agda-input for my use, I don't really need the whole emacs mode for my purpose. So I was wondering if such thing already exists.
It is probably also not that difficult to code such input mode. I would appriciate if someone suggested on how to do that.

As #legoscia mentioned, you can use the TeX input method for such things, which is probably more general than agda-input (which seems to be specific for a programming language) and is also built in.
(setq default-input-method "TeX")
Then switch to the input method with C-\ or M-x toggle-input-method. You can then type "ä" with \"a. The minibuffer has hints when you type \.
There are other input methods (M-x list-input-methods), but TeX is a good one if you're not concerned with a specific language, or if you know LaTeX.

Emacs displays chinese character if I open xml file

I have an xml-file. When I open it with Emacs it displays chinese characters (see attachment). This happens on my Windows 7 PC with Emacs and Notepad and also on my Windows XP (see figure A). Figure B is the hexl-mode of A.
If I use the Windows XP PC of a collegue and open the file with Notepad there are no chinese characters but a strange character character. I saved it as txt-file and sent it by email to my Windows7-PC (see figure C). The strange character was replaced with "?". (Due to restriction I could not use the PC of my collegue and reproduce the notepad file with the strange character).
My questions: it seems that there are characters in the XML-file which creates problems. I don't know how to cope with that. Does anybody has an idea how I can manage this problem? Does it have something to do with encoding? Thanks for hints.

By figure B, it looks like this file is encoded with a mixture of big-endian and little-endian UTF-16. It starts with fe ff, which is the byte order mark for big-endian UTF-16, and the XML declaration (<?xml version=...) is also big-endian, but the part starting with <report is little-endian. You can tell because the letters appear on even positions in the first part of the hexl display, but on odd positions further down.
Also, there is a null character (encoded as two bytes, 00 00) right before <report. Null characters are not allowed in XML documents.
However, since some of the XML elements appear correctly in figure A, it seems that the confusion goes on through the file. The file is corrupt, and this probably needs to be resolved manually.
If there are no non-ASCII characters in the file, I would try to open the file in Emacs as binary (M-x revert-buffer-with-coding-system and specify binary), remove all null bytes (M-% C-q C-# RET RET), save the file and hope for the best.
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.

For some reason, Emacs takes "UTF-16" in an xml file encoding attribute as big endian, while Windows takes "UTF-16" as little endian (like when exporting from Task Scheduler). Emacs will unknowingly convert LE to BE automatically if you edit and save an xml file. You can mouse over the lower left "U" to see the current encoding. encoding="UTF-16LE" or encoding="UTF-16BE" will ruin the file after saving (no BOM). I believe the latest version has this fixed.
<?xml version="1.0" encoding="UTF-16"?>
<hi />

The solution of legoscia using the possibility of Emacs to change encoding within a file solved my problem. An other possibility is:
cut the part to convert
paste in a new file and save it
open it with an editor which can convert encodings
convert the file and save it
copy the converted string and add (paste) to the original file where you cut the part to convert
In my case it worked with Atom, but not with Notepad++.
PS: The reason why I used this way is that Emacs could not open anymore this kind of corrupted files. I don't know why but this is another issue.
Edit 1: Since copy, paste and merge is cumbersome I found the solution how to open currupted files with emacs: emacs -q xmlfile.xml. Using emacs like legoscia suggested is the best way to repair such files.

Emacs ediff, foreign character sets, and text file encodings

Whenever I use a character set in addition to latin in a text file (mixing Cyrillic and Latin, say), I usually choose utf-16 for the encoding. This way I can edit the file under OS X with either emacs or TextEdit.
But then ediff in emacs ceases to work. It says only that "Binary files this and that differ".
Can ediff be somehow made to work on text files that include foreign characters?

Customize the variable ediff-diff-options and add the option --text.
(setq ediff-diff-options "--text")
Edit:
Ediff calls out to an external program, the GNU utility diff, to compute the differences; however, diff does not understand unicode, and sees unicode encoded files as binary files. The option "--text" simply forces it to treat the input files as text files. See the manual for GNU Diffutils: Comparing and Merging Files; in particular 1.7 Binary Files and Forcing Text Comparisons.

I strongly recommend you use utf-8 instead of utf-16. utf-8 is the standard encoding in most of the Unix world, including Mac OS X, and it does not suffer from those problems.

How to see the file's encoding in emacs?

I don't find the encoding of current file, how to display it?
You can see there are some Chinese characters in the file, but I don't know what the encoding it is.
Is there any way to let it always show on the emacs GUI?

You have several ways to get (and set) the buffer encoding:
You can see with the U in the mode-line that your buffer is in "Unicode", if you put the mouse over it, it will show in a tooltip the current buffer encoding.
you can also see the current encoding with C-hv buffer-file-coding-system RET
you can change the whole buffer encoding for next save with C-xRETf
you can also change the detected encoding to force an other one and reload the file with C-xRETr
you can set an encoding for the next I/O command only with C-xRETc
there are some other possibilities, take a look on C-xRETC-h
Fix and diagnose:
Inside a buffer, if you are interested by a single character encoding / details, put the point on a chinese char and C-uC-x= will help. (The same without the C-u shows only a few informations about the character, and the encoding is not part of it.)
examine the file by yourself:
you can open a text file without any decoding or heuristic with M-x find-file-literally
or you can go closer to the metal (hex editor) with M-x hexl-find-file
if the file is a mess with mixed encodings, you can fix portions with M-x recode-region

xemacs: dotemacs config so that one can paste without getting "funny" chars

Copying text from websites via browser, paste into xemacs (21.4) buffer, and tildes, quotes, etc. don't copy correctly.
Example: he’s a dummy -> he\222s a dummy.
Can YOU copy & paste it without problems? If so, please help - how to config my .emacs to solve this. Thanks.

Fire this in your .emacs:
(set-clipboard-coding-system 'utf-16le-dos)
That should do it. Don't forget to thi C-x C-e on that statement, or restart xemacs.

This isn’t a clipboard or cygwin problem. If you save a UTF-8 text file with curly quotes in notepad and open it in XEmacs 21.4, you’ll get junk. According to the XEmacs reference documentation, Unicode is not supported before version 21.5.6. Maybe try a later version?

You're attempting to copy+paste smart quotes into XEmacs. In this case, '\222' is the octal code for the character RIGHT SINGLE QUOTATION MARK (U+2019) encoded in the code page Windows-1252, which has the character encoding 0x92.
XEmacs uses UTF-8 internally, so you'll have to configure the copy+paste to convert from Windows-1252 to UTF-8. I don't know how to do that.

Simplest thing to do is write a quick function that translates those characters using replace-string.
You could also have xemacs set to accept that code page directly.

Switch to emacs, it works like a champ (GNU Emacs 23.0.91.1 (i386-mingw-nt6.0.6002) from Emacsw32 here). This may be the Emacsw32 patches in action.

Categories

semantic-segmentation

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Emacs - how to avoid or replace wrong character encodings? - emacs

Related

Fast unicode input in Emacs with US layout

Emacs displays chinese character if I open xml file

Emacs ediff, foreign character sets, and text file encodings

How to see the file's encoding in emacs?

xemacs: dotemacs config so that one can paste without getting "funny" chars

Categories

Resources