Problem with LaTeX hyperref - encoding

i have an url with cyrilic characters:
http://www.pravoslavie.bg/Възпитание/Духовно-и-светско-образование
when i compile the document, i get following as url:
http://www.pravoslavie.bg/%5CT2A%5CCYRV%20%5CT2A%5Ccyrhrdsn%20%5CT2A%5Ccyrz%20%5CT2A%5Ccyrp%20%5CT2A%5Ccyri%20%5CT2A%5Ccyrt%20%5CT2A%5Ccyra%20%5CT2A%5Ccyrn%20%5CT2A%5Ccyri%20%5CT2A%5Ccyre%20/%5CT2A%5CCYRD%20%5CT2A%5Ccyru%20%5CT2A%5Ccyrh%20%5CT2A%5Ccyro%20%5CT2A%5Ccyrv%20%5CT2A%5Ccyrn%20%5CT2A%5Ccyro%20-%5CT2A%5Ccyri%20-%5CT2A%5Ccyrs%20%5CT2A%5Ccyrv%20%5CT2A%5Ccyre%20%5CT2A%5Ccyrt%20%5CT2A%5Ccyrs%20%5CT2A%5Ccyrk%20%5CT2A%5Ccyro%20-%5CT2A%5Ccyro%20%5CT2A%5Ccyrb%20%5CT2A%5Ccyrr%20%5CT2A%5Ccyra%20%5CT2A%5Ccyrz%20%5CT2A%5Ccyro%20%5CT2A%5Ccyrv%20%5CT2A%5Ccyra%20%5CT2A%5Ccyrn%20%5CT2A%5Ccyri%20%5CT2A%5Ccyre
and that ist not the same. Can I set the encoding to utf8 for hyperref? Or how can i solve the problem?

If you're happy not to use the \url command (i.e., you'll need to break lines manually) you can do the following in regular LaTeX:
\documentclass{article}
\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
\texttt{http://www.pravoslavie.bg/Възпитание/Духовно-и-светско-образование}
\end{document}
If you need to get the hyperlinks working, my only suggestion for now is to use either XeTeX or LuaTeX to be able to use proper unicode input/output. Something like the following produces at least the correct-looking output in XeTeX, although the hyperlink itself is broken for some reason :(
\documentclass{article}
\usepackage{fontspec,hyperref}
\setmonofont{Arial Unicode MS}
\begin{document}
\url{http://www.pravoslavie.bg/Възпитание/Духовно-и-светско-образование}
\end{document}

I had a similar problem with the pdftitle field.
splitting use declaration and setup made it work correctly
\usepackage{hyperref}
\hypersetup{
pdftitle=Priorità
}

Assuming your LaTeX source is utf8 encoded, try adding \usepackage[utf8]{inputenc} to your document. If utf8 doesn't work try utf8x. See here

If it is, as the other posters seem to assume, a charset issue, make sure the character encoding for the bibtex source and the tex document match. Cf. Q#1635788: Different encoding of latex and bibtex files. You don't need to make the character encodings both be utf8; is should think that latin-5 or KOI8-R would both work, but it is the best supported.
If it isn't, than as per my comment above: look at the software chain that you are using: editor, makefiles, &c, to see if something is doing unwanted URL escaping for you. Then deal ruthlessly with the offending software.

#Mike Weller:
i have already \usepackage[utf8]{inputenc} in my document, with utf8x i get following as url:
http://www.pravoslavie.bg/\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{з}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{п}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{а}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð1⁄2}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]\begingroup\let\relax\relax\
endgroup[Pleaseinsert\PrerenderUnicode{Ðμ}intopreamble]/\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð2}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð1⁄2}intopreamble]\begingroup\let\relax\
relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]-\begingroup\let\
relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]-\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð2}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ðμ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ðo}intopreamble]\begingroup\
let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]-\
begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{б}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ñ}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{а}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{з}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð3⁄4}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð2}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{а}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð1⁄2}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ð ̧}intopreamble]
\begingroup\let\relax\relax\endgroup[Pleaseinsert\PrerenderUnicode{Ðμ}intopreamble]D
edit: the problem is solved - i've used URL Encoding to convert the cyrilic chars :)

\usepackage[unicode]{hyperref}
worked for me (since at least June 2010) using texlive distribution
(not sure if it is relevant).

Related

Difference between code and verbatim in Org-mode?

What is the intended difference between ~code~ and =verbatim= markup in Org-mode? Exporting to HTML in both cases yields <code> tags.
Same for LaTeX...
Though, as they are fontified differently in your buffer, you can use them for different semantics.
Personally, I use "code" for var/func names, commands to be typed, etc; and "verbatim" for paths or file names.
I would have loved to have the same number of markups as there are in TeX Info, but that's not the case...
In Org 8.0 (ox-* exporters) there are a few differences.
In LaTeX
Code comes out as `\verb{sep}content{sep} where {sep} is found as an appropriate delimiter.
Verbatim comes out as \texttt{content} with certain characters escaped/protected.
In HTML and ODT
Code and Verbatim are treated identically
In TeXInfo
The same behaviour is followed as in LaTeX.

Emacs c-mode can't recognize utf-8?

I need to read one C++ head file which has some Chinese and was encoded using utf-8.
Emacs should recognize this encoding, but it turns out:
Then, I changed it to text-mode, it works:
I also tested for python-mode, lisp-mode, etc, all works except c-mode, c++-mode, java-mode, seems there are something wrong with cc-mode, or the cc-vars?
Please help me if you know how to fix this weird problem.
That looks more like a missing font (rather than encoding) issue; i.e., your system lacks a properly configured Chinese italic font.
Actually, it is arguably a bug in Emacs: it should fallback to some other font (non-italics, if needed) rather than display blank squares. We have fixed a few such problems ober the years, so try the latest Emacs-24 pretest to see if the bug is already fixed there, and otherwise M-x report-emacs-bug

How can I get the charset of a string/buffer?

I need an elisp function that guesses the charset of some html, and since Emacs already does that when opening a file, I wonder if I can reuse it somehow, perhaps by writing the string in a temporary buffer, setting the correct charset, and getting it. Are there such functions?
Thanks!
See detect-coding-string.
I don't think that Emacs has something built-in to guess a character encoding, but it can read character encoding hints in files like -- coding: utf8 -- and etc. You can take a look at this external library though. I guess that you're using some web browser for Emacs like W3M and probably it has something to deal with character encodings based on the http metainformation it receives. This article might also be of some help.

Using Unicode in fancyvrb’s VerbatimOut

Problem
VerbatimOut from the “fancyvrb” package doesn’t play nicely with UTF-8 characters.
Minimal working example:
\documentclass{minimal}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{fancyvrb}
\begin{document}
\begin{VerbatimOut}{\jobname.test}
é
\end{VerbatimOut}
\input{\jobname.test}
\end{document}
Error message
When compiled using pdflatex mini, this gives the error
File ended while scanning use of \UTFviii#three#octets.
A different error occurs when the sole occurrence of é above is replaced by something else, e.g. é */:
Package inputenc Error: Unicode char \u8:### not set up for use with LaTeX.
– indicating that in this case, LaTeX succeeds in reading a multi-byte UTF-8 character, but not knowing what to do with it (i.e. it’s the wrong character).
In fact, when I open the produced .test file manually, it contains the character é, but in Latin-1 encoding!
Proof: when I open the files in a hex editor, I get the following:
Original file: C3 A9 (corresponds to LATIN SMALL LETTER E WITH ACUTE in UTF-8)
Written file: E9 (corresponds to é in Latin-1)
Question
How to set VerbatimOut up correctly?
filecontents* (from “filecontents”) shows that it can work. Unfortunately, I don’t understand either code so I cannot fix fancyvrb’s code by replicating the logic from filecontents manually.
I also cannot use filecontents* instead of VerbatimOut because the former doesn’t work within a \newenvironment, while the latter does.
(Oh, by the way: vanilla Verbatim instead of VerbatimOut also works as expected. The error seems to occur when writing the file, not when reading the verbatim input)
Is your end goal to write symbols and accents in Verbatim? Because you can do that like this:
\documentclass{article}
\usepackage{fancyvrb}
\begin{document}
\begin{Verbatim}[commandchars=\\\{\}]
\'{e} \~{e} \`{e} \^{e}
\end{Verbatim}
\end{document}
The commandchars option allows the \ { } characters to work as they normally would.
Source: http://ctan.mirror.garr.it/mirrors/CTAN/macros/latex/contrib/fancyvrb/fancyvrb.pdf
This is still unfixed? I'll take another look. What exactly do you want: your package to use VerbatimOut, or for it not to interfere with it?
Tests
TexLive 2009's Xelatex compiles fine. With pdflatex, version
This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009)
I get an error message that is rather more useful error message than you got:
! Argument of \UTFviii#three#octets has an extra }.
\par
l.8 é
? i \makeatletter\show\UTFviii#three#octets
! Undefined control sequence.
\GenericError ...
#4 \errhelp \#err# ...
l.8 é
If I were to make a wild guess, I'd say that inputenc with pdftex uses the pdftex primitives to do some hairy storing and restoring of character tables, and some table somewhere has got a rarely mistake in it.
Possibly related
I saw a post by Vladimir Volovich in the pdf-tex mailing list archives, all the way back from 2003, that discusses a conflict between inputenc & fancyvrb, and posts a patch to "solve the problem". Who knows, maybe he faced the same problem? It might be worth emailing him.
XeTeX has much better Unicode support. The following run through xelatex produces “é” both in \jobname.test and the output PDF.
\documentclass{minimal}
\usepackage{fontspec}
\tracingonline=1
\usepackage{fancyvrb}
\begin{document}
\begin{VerbatimOut}{\jobname.test}
é
\end{VerbatimOut}
\input{\jobname.test}
\end{document}
fontspec loads the Latin Modern fonts, which have Unicode support. The standard TeX Computer Modern fonts don’t have the right tables for Unicode support.
If you use a character that does not have a glyph in the current font, by default XeTeX writes a blank space to the PDF and prints a warning in the log but not on the terminal. \tracingonline=1 prints the warning to the terminal.
On http://wiki.portal.chalmers.se/agda/pmwiki.php?n=Main.LiterateAgda, they suggest that you should use
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
in the preabmle. I successfully used this in order to insert unicode into a verbatim environment.
\documentclass{article}
\usepackage{fancyvrb}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\newenvironment{MonVerbatim}{%
\count0=128\relax %
\loop
\catcode\count0=11\relax
\advance\count0 by 1\relax
\ifnum\count0<256
\repeat
\VerbatimOut[commandchars=\\\{\}]{VerbatimText.tex}%
}{\endVerbatimOut}
\newcommand\test{A command producing accented characters éà}
\begin{document}
\begin{MonVerbatim}
A little bit text in verbatim mode éà_].
\test
\end{MonVerbatim}
Followed by some accented character éà.
\end{document}
This code is working for me with TeXLive 2018 and pdflatex. Yous should
probably avoid changing catcode if you are using a 16 bits TeX (lualatex or xelatex).
You can use the package "iftex" to check the tex engine used.

Japanese characters in a latex \section{} cause an error

I am working on getting Japanese documents created with latex. I have installed the latest version of texlive-2008 which includes CJK.
In my document I have the following:
\documentclass{class}
\usepackage{CJK}
\begin{document}
\begin{CJK*}{UTF8}{min}
\title{[Japanese Characters here 1]}
\maketitle
\section{[Japanese Characters here 2]}
[Japanese Characters here 3]
\end{CJK*}
\end{document}
In the above code there are 3 locations Japanese characters are used.
1 + 3 work fine whereas 2, which contains Japanese characters in a \section{} fails with the following error.
! Argument of \#sect has an extra }.
After some research it turns out this error manifests when you’ve put a fragile command inside a moving argument. A moving argument because section can be moved to a contents page for example.
Does anyone know how to get this to work and why latex thinks Japanese characters are "fragile".
Sorry to post this as an answer rather than a comment to your answer; I don't have enough rep yet to comment. (EDIT: Now I have enough rep to comment, but I'm not sorry anymore. Thanks Will.)
Your solution of replacing
\section{[Japanese Text]}
with
\section{\texorpdfstring{[Japanese Text]}{}}
suggests that you're using the hyperref package. When you use the hyperref package, any sort of not-totally-boring text (e.g. math) within \section causes a problem because \section is having trouble generating pdf bookmarks. \texorpdfstring allows you to specify how you want the section title to appear in the pdf bookmark. For example, I might write
\section{Calculation of \texorpdfstring{$H_2(\mathcal{X})$}{H\_2(X)}}
if I want the section title to be "Calculation of $H_2(\mathcal{X})$" but I want the pdf bookmark to be "Calculation of H_2(X)".
You should probably use xetex/xelatex, as it has been created to support unicode. The change is sometimes not easy for already existing documents, though. (xelatex should be included in texlive, it is just different executable to call -- this is how it is done in Debian).
I have managed to get this working now!
Using Latex and CJK as before.
\section{[Japanese Text]}
was replaced with
\section{\texorpdfstring{[Japanese Text]}{}}
Now the contents pages and section titles work and update fine.