Displaying Unicode characters in PDF produced by Apache FOP - unicode

I have an XML file containing a list of names, some of which use characters/glyphs which are not represented in the default PDF font (Helvetica/Arial):
<name>Paul</name>
<name>你好</name>
I'm processing this file using XSLT and Apache FOP to produce a PDF file which lists the names. Currently I'm getting the following warning on the console and the Chinese characters are replaced by ## in the PDF:
Jan 30, 2016 11:30:56 AM org.apache.fop.events.LoggingEventListener processEvent WARNING: Glyph "你" (0x4f60) not available in font "Helvetica".
Jan 30, 2016 11:30:56 AM org.apache.fop.events.LoggingEventListener processEvent WARNING: Glyph "好" (0x597d) not available in font "Helvetica".
I've looked at the documentation and it seems to suggest that the options available are:
Use an OpenType font - except this isn't supported by FOP.
Switch to a different font just for the non-ASCII parts of text.
I don't want to use different fonts for each language, because there will be PDFs that have a mixture of Chinese and English, and as far as I know there's no way to work out which is which in XSLT/XSL-FO.
Is it possible to embed a single font to cover all situations? At the moment I just need English and Chinese, but I'll probably need to extend that in future.
I'm using Apache FOP 2.1 and Java 1.7.0_91 on Ubuntu. I've seen some earlier questions on a similar topic but most seem to be using a much older version of Apache FOP (e.g. 0.95 or 1.1) and I don't know if anything has been changed/improved in the meantime.
Edit: My question is different (I think) to the suggested duplicate. I've switched to using the Ubuntu Font Family using the following code in my FOP config:
<font kerning="yes" embed-url="../fonts/ubuntu/Ubuntu-R.ttf" embedding-mode="full">
<font-triplet name="Ubuntu" style="normal" weight="normal"/>
</font>
<font kerning="yes" embed-url="../fonts/ubuntu/Ubuntu-B.ttf" embedding-mode="subset">
<font-triplet name="Ubuntu" style="normal" weight="bold"/>
</font>
However, I'm still getting the 'glyph not available' warning:
Jan 31, 2016 10:22:59 AM org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "你" (0x4f60) not available in font "Ubuntu".
Jan 31, 2016 10:22:59 AM org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "好" (0x597d) not available in font "Ubuntu".
I know Ubuntu Regular has these two glyphs because it's my standard system font.
Edit 2: If I use GNU Unifont, the glyphs display correctly. However, it seems to be a font aimed more at console use than in documents.

If you cannot find a suitable font supporting both Chinese and English (or you found one, but you don't like very much its latin glyphs), remember that font-family can contain a comma-separated list of names, to be used in that order.
So, you can list your desired font for English text first, and then the one for the Chinese text:
<!-- this has # instead of the missing Chinese glyphs -->
<fo:block font-family="Helvetica" space-after="1em" background-color="#AAFFFF">
Paul 你好</fo:block>
<!-- this has all the glyphs, but I don't like its latin glyphs -->
<fo:block font-family="SimSun" space-after="1em" background-color="#FFAAFF">
Paul 你好</fo:block>
<!-- the best of both worlds! -->
<fo:block font-family="Helvetica, SimSun" space-after="1em" background-color="#FFFFAA">
Paul 你好</fo:block>
The output looks like this:

The answer to my question is either use GNU Unifont, which:
Supports Chinese and English.
Is available under a free licence.
'Just works' if you add it to the FOP config file.
Or alternatively produce separate templates for English and Chinese PDFs and use different fonts for each.

Related

SVG to PDF conversion doesn't render Cyrillic characters correctly

I'm trying to convert a SVG which has Cyrillic characters to PDF document but those are showing as #. I have gone through few threads reporting similar issue but could not found solution.
PDF is getting generated with below warnings:
About to transcoder source of type: org.apache.batik.apps.rasterizer.SVGConverterFileSource
Aug 10, 2021 5:36:56 PM org.apache.fop.fonts.Typeface warnMissingGlyph
WARNING: Glyph 12362 (0x304a, ohiragana) not available in font Helvetica
Aug 10, 2021 5:36:56 PM org.apache.fop.fonts.Typeface warnMissingGlyph
WARNING: Glyph 12397 (0x306d, nehiragana) not available in font Helvetica
In input svg, we don't have any font-family specified.
As mentioned on few threads I need to define font that has glyph for required characters. The question is how to do that? Can you please provide some link which will guide through steps for defining fonts?

Where are the unicode characters on the disk and what's the mapping process?

There are several unicode relevant questions has been confusing me for some time.
For these reasons as follow I think the unicode characters are existed on disk.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
There's a concept of UCD (unicode character database), and We can download it's latest version. UCD latest
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
So if the unicode characters does existed on the disk , then :
Where is it ?
How can I upgrade it ?
What's the process of mapping the unicode code point to a glyph ?
If I use a specific font, then what's the process of mapping the unicode code point to a glyph ?
If not, then what's the process of mapping the unicode code point to a glyph ?
It will very appreciated if someone could shed light on these problems.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
That's echo -e in bash.
› echo "\u6211"
\u6211
› echo -e "\u6211"
我
Where is it ?
In the font file.
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
How can I upgrade it ?
Installing/upgrading a suitable font with the emojis should be enough. I don't have macOS, so I cannot verify this.
I use "Noto Color Emoji" version 2.011/20180424, it works fine.
What's the process of mapping the unicode code point to a glyph ?
The application (e.g. text editor) provides the font rendering subsystem (Quartz? on macOS) with Unicode text and a font name. The font renderer analyses the codepoints of the text and decides whether this is simple text (e.g. Latin, Chinese, stand-alone emojis) or complex text (e.g. Latin with many marks, Thai, Arabic, emojis with zero-width joiners). The renderer finds the corresponding outlines in the font file. If the file does not have the required glyph, the renderer may use a similar font, or use a configured fallback font for a poor substitute (white box, black question mark etc.). Then the outlines undergo shaping to compose a complex glyph and line-breaking. Finally, the font renderer hands off the result to the display system.
Apart from the shaping, very little of this has to do with Unicode or encoding. Font rendering already used to work that way before Unicode existed, of course font files and rendering was much simpler 30 years ago. Encoding only matters when someone wants to load or save text from an application.
Summary: investigate
Truetype/Opentype font editing software so you can see what's contained in the files
font renderers, on Linux look at the libraries pango and freetype.
Generally speaking, operating system components that use text use the Unicode character set. In particular, font files use the Unicode character set. But, not all font files support all the Unicode codepoints.
When a codepoint is not supported by one font, the system might fallback to another that does. This is particularly true of web browsers. But ultimately if the codepoint is not supported, an unfilled rectangle is rendered. (There is no character for that because it's not a character. In fact, if you were able to copy and paste it as text, it should be the original character that couldn't be rendered.)
In web development, the web page can either supply or give the location of fonts that should work for the codepoints it uses.
Other programs typically use the operating system's rendering facilities and therefore the fonts available through it. How to install a font in an operating system is not a programming question (unless you are including a font in an installer for your program). For more information on that, you could see if the question fits with the Ask Different (Apple) Stack Exchange site.

fontspec (xetex/luatex) and fontconfig

LuaTeX and XeTeX are passing on a warning on from the fontspec package, and I'm not sure whether I should worry. The following minimal code will generate the warning:
\documentclass{report}
\usepackage{fontspec}
\setmainfont[Script=Cyrillic]{Arial}
\begin{document}
\end{document}
The warning is
fontspec warning: "script-not-exist"
Font 'Arial' does not contain script 'Cyrillic'.
Replacing 'Arial' with, for example, 'Charis SIL', results in no warning.
I believe I've traced the problem back to fc-list/ fontconfig. In the output of fc-list for the Arial font, I get the following 'capabilities' line:
capability: "otlayout:arab otlayout:hebr otlayout:latn"(s)
This line does not mention Cyrillic (which would be the code 'cyrl', according to the \newfontscript{Cyrillic}{cyrl} line in fontspec-xetex.sty). Fonts for which I do not get this warning, such as Charis SIL, explicitly mention Cyrillic in their capabilities lines:
capability: "ttable:Silf otlayout:DFLT otlayout:cyrl otlayout:latn"(s)
Unfortunately, documentation of this "capability" line in the output of fc-list is limited to the line
capability String List of layout capabilities in the font
(that from https://www.freedesktop.org/software/fontconfig/fontconfig-user.html and a couple other places on the web)
My question is basically, should I worry about this warning message? Arial (and Charis) list 'ru', the ISO 639-1 code for Russian, in the 'lang' field. Just what is this "layout" capability for the Cyrillic script that Charis SIL supports, but Arial apparently does not, and why does it matter?
BTW, this is with the TeX Live 2017 distro.

Apache FOP insert special character [duplicate]

I am maintaining a program which uses the Apache FOP for printing PDF documents. There have been a couple complaints about the Chinese characters coming up as "####". I have found an existing thread out there about this problem and done some research on my side.
http://apache-fop.1065347.n5.nabble.com/Chinese-Fonts-td10789.html
I do have the uming.tff language files installed on my system. Unlike the person in this thread, I am still getting the "####".
From this point forward, has anyone seen a work around that would allow you to print complex characters in a PDF document using Apache FOP?
Three steps must be taken for chinese characters to correctly show in a PDF file created with FOP (this is also true for all characters not available in the default font, and more generally to use a non-default font).
Let us use this simple fo example to show the warnings produced by FOP when something is wrong:
<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="one">
<fo:region-body />
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="one">
<fo:flow flow-name="xsl-region-body">
<!-- a block of chinese text -->
<fo:block>博洛尼亚大学中国学生的毕业论文</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
Processing this input, FOP gives several warnings similar to this one:
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "?" (0x535a) not available in font "Helvetica".
...
Without any explicit font-family indication in the FO file, FOP defaults to using Helvetica, which is one of the Base-14 fonts (fonts that are available everywhere, so there is no need to embed them).
Each font supports a set of characters, assigning a visible glyphs to them; when a font does not support a character, the above warning is produced, and the PDF shows "#" instead of the missing glyph.
Step 1: set font-family in the FO file
If the default font doesn't support the characters of our text (or we simply want to use a different font), we must use the font-family property to state the desired one.
The value of font-family is inherited, so if we want to use the same font for the whole document we can set the property on the fo:page-sequence; if we need a special font just for some paragraphs or words, we can set font-family on the relevant fo:block or fo:inline.
So, our input becomes (using a font I have as example):
<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="one">
<fo:region-body />
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="one">
<fo:flow flow-name="xsl-region-body">
<!-- a block of chinese text -->
<fo:block font-family="SimSun">博洛尼亚大学中国学生的毕业论文</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
But now we get a new warning, in addition to the old ones!
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Font "SimSun,normal,400" not found. Substituting with "any,normal,400".
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "?" (0x535a) not available in font "Times-Roman".
...
FOP doesn't know how to map "SimSun" to a font file, so it defaults to a generic Base-14 font (Times-Roman) which does not support our chinese characters, and the PDF still shows "#".
Step 2: configure font mapping in FOP's configuration file
Inside FOP's folder, the file conf/fop.xconf is an example configuration; we can directly edit it or make a copy to start from.
The configuration file is an XML file, and we have to add the font mappings inside /fop/renderers/renderer[#mime = 'application/pdf']/fonts/ (there is a renderer section for each possible output mime type, so check you are inserting your mapping in the right one):
<?xml version="1.0"?>
<fop version="1.0">
...
<renderers>
<renderer mime="application/pdf">
...
<fonts>
<!-- specific font mapping -->
<font kerning="yes" embed-url="/Users/furini/Library/Fonts/SimSun.ttf" embedding-mode="subset">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<!-- "bulk" font mapping -->
<directory>/Users/furini/Library/Fonts</directory>
</fonts>
...
</renderer>
...
</renderers>
</fop>
each font element points to a font file
each font-triplet entry identifies a combination of font-family + font-style (normal, italic, ...) + font-weight (normal, bold, ...) mapped to the font file in the parent font element
using directory elements it is also possible to automatically configure all the font files inside the indicated folders (but this takes some time if the folders contain a lot of fonts)
If we have a complete file set with specific versions of the desired font (normal, italic, bold, light, bold italic, ...) we can map each file to the precise font triplet, thus producing a very sophisticated PDF.
On the opposite end of the spectrum we can map all the triplet to the same font file, if it's all we have available: in the output all text will appear the same, even if in the FO file parts of it were marked as italic or bold.
Note that we don't need to register all possible font triplets; if one is missing, FOP will use the font registered for a "similar" one (for example, if we don't map the triplet "SimSun,italic,400" FOP will use the font mapped to "SimSun,normal,400", warning us about the font substitution).
We are not done yet, as without the next and last step nothing changes when we process our input file.
Step 3: tell FOP to use the configuration file
If we are calling FOP from the command line, we use the -c option to point to our configuration file, for example:
$ fop -c /path/to/our/fop.xconf input.fo input.pdf
From java code we can use (see also FOP's site):
fopFactory.setUserConfig(new File("/path/to/our/fop.xconf"));
Now, at last, the PDF should correctly use the desired fonts and appear as expected.
If instead FOP terminates abruptly with an error like this:
org.apache.fop.cli.Main startFOP
SEVERE: Exception org.apache.fop.apps.FOPException: Failed to resolve font with embed-url '/Users/furini/Library/Fonts/doesNotExist.ttf'
it means that FOP could not find the font file, and the font configuration needs to be checked again; typical causes are
a typo in the font url
insufficient privileges to access the font file

how to generate Chinese Characters using Postscript?

Does anyone knows how to generate Chinese characters using Postscript or related tools? I'd like to use unicode to represent Chinese characters but it seems that Postscript doesn't support unicode, yet. In addition, I'd like to specify several fonts to generate the same character.
Thus, I have two questions:
1. how to use unicode in Postscript? Or how to enumerate Chinese Character set in the postscript way?
2. How to specify the fonts configurations using Postscript?
At last, in case postscript cannot do this job, what tools should I turn to for my purpose?
Thank you very much!
-Jin
In Adobe's official PostScript language specification there is no specific support for Unicode fonts. (And this is the final version of the spec for PS Level 3, valid since its publication in 1999 -- PostScript as a language is no longer developed...)
However, PostScript supports (since Level 2) multi-byte fonts (2-, 3- and 4-bytes) in a generic way (see 'CID'). All PostScript fonts need an "encoding": an encoding basically is a table telling at which index position of a font which glyph description for a given character can be found. So while there are no Unicode fonts as such, there are multi-byte CID fonts which provide ranged subsets of Unicode.
Also, there are no freely re-distributable CMaps. (A CMap .) If you need a CMap, you have to derive it from the Windows codepage and the matching Adobe CMap.
If you just look for a "super-simple" method to use Unicode text strings with no need of checking for ranges, language etc.: sorry to disappoint you. There is no way. That would be a pipe dream.
Have a look at CID-keyed fonts instead. These are designed to include a large number of glyphs. (Page 364ff in PLRM)
Update: Linked to the correct page with CID font description.