As the title says:
Which encoding do I have to choose in emacs long list of encodings (accessible via e.g., change for saving file: C-x RET f), that it matches ansinew in LaTeX as given by:
\usepackage[ansinew]{inputenc}
In inputenc manual you can read
ansinew Windows 3.1 ANSI encoding, extension of Latin-1 (synonym for
cp1252).
In Emacs cp1252 is an alias of windows-1252, so you can use both of them.
Related
I have a text file that needs to be in ANSI mode. It specifies only that: ANSI. Notepad++ has an option to convert to ANSI and that does the trick.
In VS Code I don't find this encoding option. So I read up on it and it looks like ANSI doesn't really exist and should actually be called Windows-1252.
However, there's a difference between Notepad++'s ANSI encoding and VS Code's Windows-1252. It picks different codepoints for characters such as an accented uppercase e (É), as is evident from the document.
When I let VS Code guess the encoding of the document converted to ANSI by Notepad++, however, it still guesses Windows-1252.
So the questions are:
Is there something like pure ANSI?
How can VS Code convert to it?
Check out the upcoming VSCode 1.48 (July 2020) browser support.
It also solves issue 33720, where you had to force encoding for the full project before with, for instance:
"files.encoding": "utf8",
"files.autoGuessEncoding": false
Now you can set the encoding file by file:
Text file encoding support
All of the text file encodings of the desktop version of VSCode are now supported for web as well.
As you can see, the encoding list includes ISO 8859-1, which is the closest norm of what "ANSI encoding" could represent.
The Windows codepage 1252 was created based on ISO 8859-1 but is not completely equal.
It differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range
From Wikipedia:
Historically, the phrase "ANSI Code Page" was used in Windows to refer to non-DOS encodings; the intention was that most of these would be ANSI standards such as ISO-8859-1.
Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard.
Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."
Or which one should I use for which purposes? Am I right in assuming that the utf-8-emacs coding system is for emacs lisp files and the utf-8 is for text files?
M-x describe-coding-system on the two return:
U -- utf-8-emacs
Support for all Emacs characters (including non-Unicode characters).
Type: utf-8 (UTF-8: Emacs internal multibyte form)
EOL type: Automatic selection from:
[utf-8-emacs-unix utf-8-emacs-dos utf-8-emacs-mac]
This coding system encodes the following charsets:
emacs
U -- utf-8 (alias: mule-utf-8)
UTF-8 (no signature (BOM))
Type: utf-8 (UTF-8: Emacs internal multibyte form)
EOL type: Automatic selection from:
[utf-8-unix utf-8-dos utf-8-mac]
This coding system encodes the following charsets:
unicode
Not sure what is meant by
Support for all Emacs characters (including non-Unicode characters).
utf-8-emacs supports additional characters such as the internal representation of binary data. As this is a non-standard extension of Unicode, a separate encoding was defined for it so if you use utf-8 you will not accidentally include these non-standard extensions which could confuse other software.
You can use either encoding for elisp, unless you need to include binary data or obscure characters that are not part of Unicode it won't make a difference.
utf-8-emacs is the encoding used internally by Emacs. It's visible in a few places (e.g. auto-save files), but as a general rule you should never use it unless you know what you're doing.
In Emacs, after typing
M-x revert-buffer-with-coding-system
I could see many types of 'utf-8', for example, utf-8, utf-8-auto-unix, utf-8-emacs-unix and etc.
I want to know what's the difference among them.
I have googled them but couldn't find a proper answer.
P.S.
I ask this question because I encountered an encoding problem a few months ago. I wrote a php program in Emacs and in my ~/.emacs, I set
(prefer-coding-system 'utf-8)
but when browsing the php page in a browser, I found the browser couldn't display the content correctly due to the encoding problem even though I had wrote
<meta name="Content-Type" content="text/html; charset=UTF-8" />
in the page.
But after I used notepad++ to store the file in utf-8, the browser could display the content correctly.
So I want to learn more about encoding in Emacs.
The last part of the encoding name (eg. mac in utf-8-mac) is usually to describe the special character that will be used at the end of lines:
-mac: CR, the standard line delimiter with MacOS (until OS X)
-unix: LF the standard delimiter for unice systems (so the BSD-based Mac OS X)
-dos: CR+LF the delimiter for DOS / Windows
some additional encodings parameters include:
-emacs: support for encoding all Emacs characters (including non Unicode)
-with-signature: force the usage of the BOM (see below)
-auto: autodetect the BOM
You can combine the different possibilities, that makes the list shown in Emacs.
To get some information on type of line ending, BOMs and charsets provided by encodings, you can use describe-coding-system, or: C-hC
Concerning the BOM:
the utf standard defines a special signature to be placed at the beginning of the (text) files to distinct for the utf-16 encoding the order of the bytes (as utf-16 stores the characters with 2 bytes - or 16 bits) or endianess: some systems place the most significant byte first (big-endian -> utf-16be) some others place the least significant byte first (little-endian -> utf-16le). That signature is called BOM: the Byte Order Mark
in utf-8, each character is represented by a single byte (excepted for extended characters greater than 127, they use a special sequence of bytes) thus specifying a byte order is a nonsense but this signature is anyway usefull to detect an utf-8 file instead of a plain text ascii. An utf-8 file differs from an ascii file only on extended chars, and that can be impossible to detect without parsing the whole file until finding one when the pseudo-BOM make it visible instantly. (BTW Emacs is very efficient to make such auto-detection)
FYI, BOMs are the following bytes as very first bytes of a file:
utf-16le : FF FE
utf-16be : FE FF
utf-8 : EF BB BF
you can ask Emacs to open a file without any conversion with find-file-literally : if the first line begins with  you see the undecoded utf-8 BOM
for some additional help while playing with encodings, you can refer to this complementary answer "How to see encodings in emacs"
As #wvxvw said, your issue is a probable lack of BOM at the beginning of the file that made it wrongly interpreted and rendered.
BTW, M-x hexl-mode is also a very handy tool to check the raw content of the file. Thanks for pointing it to me (I often use an external hex editor for that, while it could be done directly in Emacs)
Can't say much about the issue, except that after setting
(prefer-coding-system 'utf-8)
(setq coding-system-for-read 'utf-8)
(setq coding-system-for-write 'utf-8)
I haven't had any unicode problems for more than 2 years.
I'm trying to open a file in Emacs 23.1 which I believe to be encoded in cp1256 (a.k.a. windows-1256, Arabic). However, when using 'C-x RET c' (universal-coding-system-argument) to select an encoding before opening the file, I am told that there is no match for either of these, and in the Completions buffer, I can see that cp125[0-578] are all represented, but no cp1256. I found the file $emacs/etc/charsets/CP1256.map, and when I call 'M-x list-character-sets', cp1256 is present. I believe the issue is that Mule is not aware of this charmap; how to I get it to use that file?
Do you see the windows-1256 instead of the cp1256?
Also, you can also try to open your file with default / autodetected encoding, then re-load the file with C-xRETr : do you have either of these encodings windows-1256 / cp1256?
Whenever I use a character set in addition to latin in a text file (mixing Cyrillic and Latin, say), I usually choose utf-16 for the encoding. This way I can edit the file under OS X with either emacs or TextEdit.
But then ediff in emacs ceases to work. It says only that "Binary files this and that differ".
Can ediff be somehow made to work on text files that include foreign characters?
Customize the variable ediff-diff-options and add the option --text.
(setq ediff-diff-options "--text")
Edit:
Ediff calls out to an external program, the GNU utility diff, to compute the differences; however, diff does not understand unicode, and sees unicode encoded files as binary files. The option "--text" simply forces it to treat the input files as text files. See the manual for GNU Diffutils: Comparing and Merging Files; in particular 1.7 Binary Files and Forcing Text Comparisons.
I strongly recommend you use utf-8 instead of utf-16. utf-8 is the standard encoding in most of the Unix world, including Mac OS X, and it does not suffer from those problems.