Rendering Swedish Å Ä and Ö with groff, mupdf - mupdf

I am learning to use groff as an alternative to Latex and am struggling with rendering Å Ä and Ö
characters.
In an English only setting, I have been running groff with:
$ groff -ms example.ms -T pdf > example.pdf
and then viewing my pdf with :
$ mupdf example.pdf
I see in the man pages groff_tmac(5) that there is support for Swedish and when I try to extrapolate the example from French in the man pages to Swedish I do not achieve the results I want. I am still getting two strange characters in place of every Å, Ä, or Ö.
I am trying the command
$ groff -ms -msv example.ms -T pdf > example.pdf
I have scoured the web both in Swedish and English (Swedish results even worse...seems like no one is using Groff) and am finding zero examples.
I don't need the answer, just where someone smarter than myself would start looking. People in my circle just suggest Latex but I am determined to use the much lighter Groff.
I am expecting to have Å Ä and Ö print nicely, so I can do my assignments in groff instead of Latex.
Thank You!

When I have accented or non-standard-ascii characters in my input text, I run the inputfile through preconv, which solves almost all my problems.
preconv inputfile | groff > outputfile
If you dislike typing the separate preconv, you can also run groff with -k.
If I use the following inputfile:
test
.br
I am expecting to have Å Ä and Ö print nicely,
and run it through preconv, I get
.lf 1 inputfile
test
.br
I am expecting to have \[u00C5] \[u00C4] and \[u00D6] print nicely,
Of course, you can put in the \[u00C4] etc. by hand, but that makes groff a lot less lightweight. And the output of groff then becomes:

Related

Encoding from ANSI when having non-latin letters

I have a very old program (not a server or something on the internet) that I think it use the ANSI (Windows-1252) encoding.
The problem is that some inputs to this program are written in Arabic.
However, when I am trying to read the result, the Arabic words are written with very wired character. For example the input: "نور" is converted to "äæÑ".
The program output should contain a combination of English words and Arabic words.
E.x. It outputs "Name äæÑ" while the correct output should be something like "Name نور".
In general, the English words are correct and readable with both UTF-8 and ANSI. But the Arabic words are read for example as "���" with UTF-8 and as "äæÑ" with ANSI.
I understand that this is because ANSI doesn't have support to non-Latin letters.
but what should I do now? How can I convert them to Arabic again?
Note: I know the exact input and the exact output that this program should produce.
Note2: I don't have the source code of this program. I just want to convert the output file of this program to have the correct words or encoding.
I solved this problem now by typing in the terminal:
iconv -f WINDOWS-1256 -t utf8 < my_File.ged > result.ged
I tried to write code in java that do a similar thing but it wasn't really working with giving my the result I wanted.
I have also tried the previous terminal command but using WINDOWS-1252 instead of WINDOWS-1256 but it wasn't working. So, I guess it is good to try different encoding until it is working

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

text file encoding shows correctly on terminal, not in editor

I got a problem with the encoding of a text file.
If I open it with *nix terminal tools like less, cat or more, accented characters are shown correctly.
But if I open it with any editor (e.g. vim), accented characters are scrambled.
My terminal locale is set tu UTF-8, my editor (vim) has his default encoding set to UTF-8. If I open textfile.txt with vim I see scrambled accents either I set vim encoding to UTF-8 or ISO8859-1.
The output of the file utility is:
$ file textfile.txt
textfile.txt: ISO-8859 English text, with very long lines
I already tried the following with iconv:
iconv -f iso-8859-1 -t utf-8 textfile.txt > textfile.utf8.txt
I get this
$ file textfile.utf8.txt
textfile.utf8.txt: UTF-8 Unicode English text, with very long lines
Opening it with vim keeps showing scrambled accents, and this time accents are scrambled even if I use cat or more.
My goal is to get this file in UTF-8 format and, obviously, showing correctly the accented characters.
[The brute way to do this is to copy every single output screen of the command "more", and paste it into an editor. There must be a smarter way to do this.]
Thanks for any help.
It turned out that the file contained characters from two different encodings, that's why visualizations were so scrambled in every case, and iconv didn't manage to successfully convert the file. Thanks everyone anyway

Using Unicode in fancyvrb’s VerbatimOut

Problem
VerbatimOut from the “fancyvrb” package doesn’t play nicely with UTF-8 characters.
Minimal working example:
\documentclass{minimal}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{fancyvrb}
\begin{document}
\begin{VerbatimOut}{\jobname.test}
é
\end{VerbatimOut}
\input{\jobname.test}
\end{document}
Error message
When compiled using pdflatex mini, this gives the error
File ended while scanning use of \UTFviii#three#octets.
A different error occurs when the sole occurrence of é above is replaced by something else, e.g. é */:
Package inputenc Error: Unicode char \u8:### not set up for use with LaTeX.
– indicating that in this case, LaTeX succeeds in reading a multi-byte UTF-8 character, but not knowing what to do with it (i.e. it’s the wrong character).
In fact, when I open the produced .test file manually, it contains the character é, but in Latin-1 encoding!
Proof: when I open the files in a hex editor, I get the following:
Original file: C3 A9 (corresponds to LATIN SMALL LETTER E WITH ACUTE in UTF-8)
Written file: E9 (corresponds to é in Latin-1)
Question
How to set VerbatimOut up correctly?
filecontents* (from “filecontents”) shows that it can work. Unfortunately, I don’t understand either code so I cannot fix fancyvrb’s code by replicating the logic from filecontents manually.
I also cannot use filecontents* instead of VerbatimOut because the former doesn’t work within a \newenvironment, while the latter does.
(Oh, by the way: vanilla Verbatim instead of VerbatimOut also works as expected. The error seems to occur when writing the file, not when reading the verbatim input)
Is your end goal to write symbols and accents in Verbatim? Because you can do that like this:
\documentclass{article}
\usepackage{fancyvrb}
\begin{document}
\begin{Verbatim}[commandchars=\\\{\}]
\'{e} \~{e} \`{e} \^{e}
\end{Verbatim}
\end{document}
The commandchars option allows the \ { } characters to work as they normally would.
Source: http://ctan.mirror.garr.it/mirrors/CTAN/macros/latex/contrib/fancyvrb/fancyvrb.pdf
This is still unfixed? I'll take another look. What exactly do you want: your package to use VerbatimOut, or for it not to interfere with it?
Tests
TexLive 2009's Xelatex compiles fine. With pdflatex, version
This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009)
I get an error message that is rather more useful error message than you got:
! Argument of \UTFviii#three#octets has an extra }.
\par
l.8 é
? i \makeatletter\show\UTFviii#three#octets
! Undefined control sequence.
\GenericError ...
#4 \errhelp \#err# ...
l.8 é
If I were to make a wild guess, I'd say that inputenc with pdftex uses the pdftex primitives to do some hairy storing and restoring of character tables, and some table somewhere has got a rarely mistake in it.
Possibly related
I saw a post by Vladimir Volovich in the pdf-tex mailing list archives, all the way back from 2003, that discusses a conflict between inputenc & fancyvrb, and posts a patch to "solve the problem". Who knows, maybe he faced the same problem? It might be worth emailing him.
XeTeX has much better Unicode support. The following run through xelatex produces “é” both in \jobname.test and the output PDF.
\documentclass{minimal}
\usepackage{fontspec}
\tracingonline=1
\usepackage{fancyvrb}
\begin{document}
\begin{VerbatimOut}{\jobname.test}
é
\end{VerbatimOut}
\input{\jobname.test}
\end{document}
fontspec loads the Latin Modern fonts, which have Unicode support. The standard TeX Computer Modern fonts don’t have the right tables for Unicode support.
If you use a character that does not have a glyph in the current font, by default XeTeX writes a blank space to the PDF and prints a warning in the log but not on the terminal. \tracingonline=1 prints the warning to the terminal.
On http://wiki.portal.chalmers.se/agda/pmwiki.php?n=Main.LiterateAgda, they suggest that you should use
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
in the preabmle. I successfully used this in order to insert unicode into a verbatim environment.
\documentclass{article}
\usepackage{fancyvrb}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\newenvironment{MonVerbatim}{%
\count0=128\relax %
\loop
\catcode\count0=11\relax
\advance\count0 by 1\relax
\ifnum\count0<256
\repeat
\VerbatimOut[commandchars=\\\{\}]{VerbatimText.tex}%
}{\endVerbatimOut}
\newcommand\test{A command producing accented characters éà}
\begin{document}
\begin{MonVerbatim}
A little bit text in verbatim mode éà_].
\test
\end{MonVerbatim}
Followed by some accented character éà.
\end{document}
This code is working for me with TeXLive 2018 and pdflatex. Yous should
probably avoid changing catcode if you are using a 16 bits TeX (lualatex or xelatex).
You can use the package "iftex" to check the tex engine used.

Twitter sharing problems with umlauts ä & ö

I'm having trouble sharing messages containing scandinavian ä & ö to twitter through a share-button on my site. If I use UTF8-codes above %7F, i just bump into an "Invalid Unicode value in one or more parameters" error.
An example: http://twitter.com/home/?status=%40user+blah%26%E4
I've tried a bunch of different encodings, but none seem to work with ä, ö etc.
Anyone found a solution for this?
Edit:
Part of this problem is related to what address you link your share-tweet. Links to http://twitter.com/home/?status=%40user+blah%26%E4%C3%A4
and
http://www.twitter.com/home/?status=%40user+blah%26%E4%C3%A4
Yield very different results.
UTF-8 represents code points above U+007F using more than one byte. So when you want ä (U+00E4), the UTF-8 representation is the two bytes C3 A4 and thus the percent-encoding is %C3%A4. A handy website that will help you with these conversions is https://www.url-encode-decode.com