Emacs c-mode can't recognize utf-8? - emacs

I need to read one C++ head file which has some Chinese and was encoded using utf-8.
Emacs should recognize this encoding, but it turns out:
Then, I changed it to text-mode, it works:
I also tested for python-mode, lisp-mode, etc, all works except c-mode, c++-mode, java-mode, seems there are something wrong with cc-mode, or the cc-vars?
Please help me if you know how to fix this weird problem.

That looks more like a missing font (rather than encoding) issue; i.e., your system lacks a properly configured Chinese italic font.

Actually, it is arguably a bug in Emacs: it should fallback to some other font (non-italics, if needed) rather than display blank squares. We have fixed a few such problems ober the years, so try the latest Emacs-24 pretest to see if the bug is already fixed there, and otherwise M-x report-emacs-bug

Related

Force Emacs to NOT change the encoding

I'll try to put this as clearly as possible, since I myself don't understand very well what's going on.
If I have a buffer opened in Emacs, and it's in, let's say UTF-8 (could be anything really), and I paste some text that is in another encoding (from a PDF for example), Emacs will CHANGE the original encoding (UTF8) to the new encoding...
This is a pain in the ass, because it screws up thousands of other text lines, just so the new line can be correctly displayed...
So I guess my question is, how can I tell Emacs to NEVER change the encoding of the file? If a character can't be represented in the present encoding, than just don't show it (or show it all messed up like usually happens).
Thanks
Specifying -*- coding: utf-8 -*- (or whatever encoding you want) at the top of the file will force that encoding for that particular file. The relevant manual page is here.
For a more systematic approach, you might want to investigate the docstring for the file-coding-system-alist variable, which forces encodings based on filenames.

How can I get the charset of a string/buffer?

I need an elisp function that guesses the charset of some html, and since Emacs already does that when opening a file, I wonder if I can reuse it somehow, perhaps by writing the string in a temporary buffer, setting the correct charset, and getting it. Are there such functions?
Thanks!
See detect-coding-string.
I don't think that Emacs has something built-in to guess a character encoding, but it can read character encoding hints in files like -- coding: utf8 -- and etc. You can take a look at this external library though. I guess that you're using some web browser for Emacs like W3M and probably it has something to deal with character encodings based on the http metainformation it receives. This article might also be of some help.

Emacs automatic encoding conversion

When I open a buffer in Emacs containing a German Umlaut (the word "Präsentation" occurs in a string), Emacs automatically converts it to a different encoding as soon as I save the file.
How can I tell Emacs to leave the encoding alone?
M-x set-buffer-file-coding-system is what you are looking for.
You might also have a look at http://www.delorie.com/gnu/docs/emacs/emacs_221.html.
Perhaps you want find-file-literally.
I don't know whether its a bug in stackoverflow, or you really mean you see an A~ and universal currency symbol in Emacs. If the problem is Emacs displaying the wrong characters, then the following might help:
(prefer-coding-system 'utf-8)

Problem with LaTeX hyperref

i have an url with cyrilic characters:
http://www.pravoslavie.bg/Възпитание/Духовно-и-светско-образование
when i compile the document, i get following as url:
http://www.pravoslavie.bg/%5CT2A%5CCYRV%20%5CT2A%5Ccyrhrdsn%20%5CT2A%5Ccyrz%20%5CT2A%5Ccyrp%20%5CT2A%5Ccyri%20%5CT2A%5Ccyrt%20%5CT2A%5Ccyra%20%5CT2A%5Ccyrn%20%5CT2A%5Ccyri%20%5CT2A%5Ccyre%20/%5CT2A%5CCYRD%20%5CT2A%5Ccyru%20%5CT2A%5Ccyrh%20%5CT2A%5Ccyro%20%5CT2A%5Ccyrv%20%5CT2A%5Ccyrn%20%5CT2A%5Ccyro%20-%5CT2A%5Ccyri%20-%5CT2A%5Ccyrs%20%5CT2A%5Ccyrv%20%5CT2A%5Ccyre%20%5CT2A%5Ccyrt%20%5CT2A%5Ccyrs%20%5CT2A%5Ccyrk%20%5CT2A%5Ccyro%20-%5CT2A%5Ccyro%20%5CT2A%5Ccyrb%20%5CT2A%5Ccyrr%20%5CT2A%5Ccyra%20%5CT2A%5Ccyrz%20%5CT2A%5Ccyro%20%5CT2A%5Ccyrv%20%5CT2A%5Ccyra%20%5CT2A%5Ccyrn%20%5CT2A%5Ccyri%20%5CT2A%5Ccyre
and that ist not the same. Can I set the encoding to utf8 for hyperref? Or how can i solve the problem?
If you're happy not to use the \url command (i.e., you'll need to break lines manually) you can do the following in regular LaTeX:
\documentclass{article}
\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
\texttt{http://www.pravoslavie.bg/Възпитание/Духовно-и-светско-образование}
\end{document}
If you need to get the hyperlinks working, my only suggestion for now is to use either XeTeX or LuaTeX to be able to use proper unicode input/output. Something like the following produces at least the correct-looking output in XeTeX, although the hyperlink itself is broken for some reason :(
\documentclass{article}
\usepackage{fontspec,hyperref}
\setmonofont{Arial Unicode MS}
\begin{document}
\url{http://www.pravoslavie.bg/Възпитание/Духовно-и-светско-образование}
\end{document}
I had a similar problem with the pdftitle field.
splitting use declaration and setup made it work correctly
\usepackage{hyperref}
\hypersetup{
pdftitle=Priorità
}
Assuming your LaTeX source is utf8 encoded, try adding \usepackage[utf8]{inputenc} to your document. If utf8 doesn't work try utf8x. See here
If it is, as the other posters seem to assume, a charset issue, make sure the character encoding for the bibtex source and the tex document match. Cf. Q#1635788: Different encoding of latex and bibtex files. You don't need to make the character encodings both be utf8; is should think that latin-5 or KOI8-R would both work, but it is the best supported.
If it isn't, than as per my comment above: look at the software chain that you are using: editor, makefiles, &c, to see if something is doing unwanted URL escaping for you. Then deal ruthlessly with the offending software.
#Mike Weller:
i have already \usepackage[utf8]{inputenc} in my document, with utf8x i get following as url:
http://www.pravoslavie.bg/\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{з}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{п}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{а}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð1⁄2}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ðμ}intopreamble]/\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð2}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð1⁄2}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]-\begingroup\let\
relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]-\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð2}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ðμ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ðo}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]-\
begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{б}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{а}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{з}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð2}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{а}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð1⁄2}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ðμ}intopreamble]D
edit: the problem is solved - i've used URL Encoding to convert the cyrilic chars :)
\usepackage[unicode]{hyperref}
worked for me (since at least June 2010) using texlive distribution
(not sure if it is relevant).

xemacs: dotemacs config so that one can paste without getting "funny" chars

Copying text from websites via browser, paste into xemacs (21.4) buffer, and tildes, quotes, etc. don't copy correctly.
Example: he’s a dummy -> he\222s a dummy.
Can YOU copy & paste it without problems? If so, please help - how to config my .emacs to solve this. Thanks.
Fire this in your .emacs:
(set-clipboard-coding-system 'utf-16le-dos)
That should do it. Don't forget to thi C-x C-e on that statement, or restart xemacs.
This isn’t a clipboard or cygwin problem. If you save a UTF-8 text file with curly quotes in notepad and open it in XEmacs 21.4, you’ll get junk. According to the XEmacs reference documentation, Unicode is not supported before version 21.5.6. Maybe try a later version?
You're attempting to copy+paste smart quotes into XEmacs. In this case, '\222' is the octal code for the character RIGHT SINGLE QUOTATION MARK (U+2019) encoded in the code page Windows-1252, which has the character encoding 0x92.
XEmacs uses UTF-8 internally, so you'll have to configure the copy+paste to convert from Windows-1252 to UTF-8. I don't know how to do that.
Simplest thing to do is write a quick function that translates those characters using replace-string.
You could also have xemacs set to accept that code page directly.
Switch to emacs, it works like a champ (GNU Emacs 23.0.91.1 (i386-mingw-nt6.0.6002) from Emacsw32 here). This may be the Emacsw32 patches in action.