Apache FOP insert special character [duplicate] - apache-fop

I am maintaining a program which uses the Apache FOP for printing PDF documents. There have been a couple complaints about the Chinese characters coming up as "####". I have found an existing thread out there about this problem and done some research on my side.
http://apache-fop.1065347.n5.nabble.com/Chinese-Fonts-td10789.html
I do have the uming.tff language files installed on my system. Unlike the person in this thread, I am still getting the "####".
From this point forward, has anyone seen a work around that would allow you to print complex characters in a PDF document using Apache FOP?

Three steps must be taken for chinese characters to correctly show in a PDF file created with FOP (this is also true for all characters not available in the default font, and more generally to use a non-default font).
Let us use this simple fo example to show the warnings produced by FOP when something is wrong:
<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="one">
<fo:region-body />
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="one">
<fo:flow flow-name="xsl-region-body">
<!-- a block of chinese text -->
<fo:block>博洛尼亚大学中国学生的毕业论文</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
Processing this input, FOP gives several warnings similar to this one:
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "?" (0x535a) not available in font "Helvetica".
...
Without any explicit font-family indication in the FO file, FOP defaults to using Helvetica, which is one of the Base-14 fonts (fonts that are available everywhere, so there is no need to embed them).
Each font supports a set of characters, assigning a visible glyphs to them; when a font does not support a character, the above warning is produced, and the PDF shows "#" instead of the missing glyph.
Step 1: set font-family in the FO file
If the default font doesn't support the characters of our text (or we simply want to use a different font), we must use the font-family property to state the desired one.
The value of font-family is inherited, so if we want to use the same font for the whole document we can set the property on the fo:page-sequence; if we need a special font just for some paragraphs or words, we can set font-family on the relevant fo:block or fo:inline.
So, our input becomes (using a font I have as example):
<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="one">
<fo:region-body />
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="one">
<fo:flow flow-name="xsl-region-body">
<!-- a block of chinese text -->
<fo:block font-family="SimSun">博洛尼亚大学中国学生的毕业论文</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
But now we get a new warning, in addition to the old ones!
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Font "SimSun,normal,400" not found. Substituting with "any,normal,400".
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "?" (0x535a) not available in font "Times-Roman".
...
FOP doesn't know how to map "SimSun" to a font file, so it defaults to a generic Base-14 font (Times-Roman) which does not support our chinese characters, and the PDF still shows "#".
Step 2: configure font mapping in FOP's configuration file
Inside FOP's folder, the file conf/fop.xconf is an example configuration; we can directly edit it or make a copy to start from.
The configuration file is an XML file, and we have to add the font mappings inside /fop/renderers/renderer[#mime = 'application/pdf']/fonts/ (there is a renderer section for each possible output mime type, so check you are inserting your mapping in the right one):
<?xml version="1.0"?>
<fop version="1.0">
...
<renderers>
<renderer mime="application/pdf">
...
<fonts>
<!-- specific font mapping -->
<font kerning="yes" embed-url="/Users/furini/Library/Fonts/SimSun.ttf" embedding-mode="subset">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<!-- "bulk" font mapping -->
<directory>/Users/furini/Library/Fonts</directory>
</fonts>
...
</renderer>
...
</renderers>
</fop>
each font element points to a font file
each font-triplet entry identifies a combination of font-family + font-style (normal, italic, ...) + font-weight (normal, bold, ...) mapped to the font file in the parent font element
using directory elements it is also possible to automatically configure all the font files inside the indicated folders (but this takes some time if the folders contain a lot of fonts)
If we have a complete file set with specific versions of the desired font (normal, italic, bold, light, bold italic, ...) we can map each file to the precise font triplet, thus producing a very sophisticated PDF.
On the opposite end of the spectrum we can map all the triplet to the same font file, if it's all we have available: in the output all text will appear the same, even if in the FO file parts of it were marked as italic or bold.
Note that we don't need to register all possible font triplets; if one is missing, FOP will use the font registered for a "similar" one (for example, if we don't map the triplet "SimSun,italic,400" FOP will use the font mapped to "SimSun,normal,400", warning us about the font substitution).
We are not done yet, as without the next and last step nothing changes when we process our input file.
Step 3: tell FOP to use the configuration file
If we are calling FOP from the command line, we use the -c option to point to our configuration file, for example:
$ fop -c /path/to/our/fop.xconf input.fo input.pdf
From java code we can use (see also FOP's site):
fopFactory.setUserConfig(new File("/path/to/our/fop.xconf"));
Now, at last, the PDF should correctly use the desired fonts and appear as expected.
If instead FOP terminates abruptly with an error like this:
org.apache.fop.cli.Main startFOP
SEVERE: Exception org.apache.fop.apps.FOPException: Failed to resolve font with embed-url '/Users/furini/Library/Fonts/doesNotExist.ttf'
it means that FOP could not find the font file, and the font configuration needs to be checked again; typical causes are
a typo in the font url
insufficient privileges to access the font file

Related

Unicode (Cyrillic) text in SVG in Apache FOP (XML)

(I don't know which FOP version I'm using. I only have API/HTTP access to the service. I'm asking.)
I'm making a PDF with Apache FOP. It contains SVG with <text>. Some text is Bulgarian/Cyrillic, inside and outside the SVG.
In a web browser, Bulgarian needs no encoding:
In a <table>: <td class="label">Спортуване</td>
In an <svg>: <text x="153" y="91">Спортуване</text>
In Apache FOP input XML, it needs encoding:
Outside SVG: <fop:block>Спортуване</fop:block>
Every character is encoded to a 
That works, outside SVG
Inside SVG: <svg:text x="153" y="91" >Спортуване</svg:text>
(Local namespace is svg)
Same exact encoding
Doesn't work =(
So the only part that doesn't work is text inside the SVG. Text outside works perfectly.
It's not the font. I can change inside and outside SVG fonts to Times (the default), and the same happens with the new font: works outside SVG, doesn't inside.
Web browser result:
PDF result:
PDF, outside SVG:
My special spelling of Aesthetics (aëstéthics) does work in the PDF SVG. It's only partly encoded, since some characters are safe ASCII: <svg:text x="436" y="218">aëstéthics</svg:text>. The Ӓ encoded characters work, like they do outside the SVG for Cyrillic.
How do I add unicode <text> in SVG in FOP??
The 'new' version (1.1) on the same server, on CLI:
Without fonts config, no unicode works, all text is #####, inside and outside SVG.
With fonts config (TTF fonts copied from Windows):
fop -fo input.xml -c fop.conf.xml -pdf output-1.pdf
A bunch of these warnings:
The following feature isn't implemented by Apache FOP, yet: table-layout="auto" (on fo:table) (See position 29:34)
and a few of these:
Glyph "И" (0x418, Iicyrillic) not available in font "Helvetica".
but the result is very decent. The same as on version 0.9. No Unicode inside the SVG, but outside works.
Omg omg omg version 2.1 on a different server works!!
Even with weird java error:
[warning] /usr/bin/fop: JVM flavor 'sun' not understood
If only version 2.1 was available on the right server, with http api acces...

docx4J - Set default font or encoding to UTF-8 for docx output file

I'm using docx4j to make a translate apps with input file is docx and output is docx too. I have problems when working with chinese character input. That is the w:rFonts tag of input file: <w:rFonts w:hint="eastAsia" w:ascii="MingLiU" w:hAnsi="MingLiU" w:eastAsia="MingLiU" w:cs="MingLiU"/>
How can i change to Time New Roman font in the output file or change the encoding to UTF-8.
Thank you guys!
The encoding should be UTF-8 already. That's standard for docx files.
The simplest way to change to "Times New Roman" is to set the attributes of the rFonts tag above. That is, where it says "MingLiU"
To do that, get the rFonts object (in direct formatting, styles etc)
You should also change the font in rPrDefaults, since this takes effect anywhere where it isn't overridden by another rFonts tag.

some nonASCII Characters are coming like a thick vertical line in talend 5.6

I have a file with name "→Ψjohn.txt" and now i want to remove those special characters from the file name and update the file name as "john.txt". But talend is recognizing those characters as thick vertical line so it is not recognizing the source file in the physical location.can anyone please suggest a solution.
I have this file in database as well as physical location and when I am reading the file from the database it has to remove the special characters and update the same in the database as well as physical location.
in database the file looks like this
database
when I am reading from database using talend it looks like following
talend
Thanks in advance
You dont have to specify the exact file name, you can use tFilelist and get all file in a specific directory, you can also regex to mask some names, for example iterate over all *john.txt.
Once you get the actual filename, use a regex to remove unwanted characters for example : \W for non word characters and rename the file through a system command or using tFileCopy.
After seeing the pictures this is not a Talend issue but has to do with the font used in the Talend (Eclipse) console and the enconding setting of Java.
Those rectangles (better visible with a bigger font size) show that the font cannot represent your characters - it has no symbols for it.
Talend (Eclipse) settings
In Talend, navigate to Window / Preferences and choose General / Appearance / Colors and Fonts (as described in the Eclipse help). Check what font you are using for Debug / Console font. I've had good results with the font Consolas, which is set from Talend version 6 onwards. Beforehand it was Courier New.
Java encoding
You should check that Java uses UTF-8 encoding to display the characters. The Console Enconding has to be set to UTF-8. See this answer for an explanation how to do this.
Alternative
Alternatively you could store all the log data into a file and open this file in e.g. Notepad++ to see if the output is generated correctly and only displayed wrong.

Displaying Unicode characters in PDF produced by Apache FOP

I have an XML file containing a list of names, some of which use characters/glyphs which are not represented in the default PDF font (Helvetica/Arial):
<name>Paul</name>
<name>你好</name>
I'm processing this file using XSLT and Apache FOP to produce a PDF file which lists the names. Currently I'm getting the following warning on the console and the Chinese characters are replaced by ## in the PDF:
Jan 30, 2016 11:30:56 AM org.apache.fop.events.LoggingEventListener processEvent WARNING: Glyph "你" (0x4f60) not available in font "Helvetica".
Jan 30, 2016 11:30:56 AM org.apache.fop.events.LoggingEventListener processEvent WARNING: Glyph "好" (0x597d) not available in font "Helvetica".
I've looked at the documentation and it seems to suggest that the options available are:
Use an OpenType font - except this isn't supported by FOP.
Switch to a different font just for the non-ASCII parts of text.
I don't want to use different fonts for each language, because there will be PDFs that have a mixture of Chinese and English, and as far as I know there's no way to work out which is which in XSLT/XSL-FO.
Is it possible to embed a single font to cover all situations? At the moment I just need English and Chinese, but I'll probably need to extend that in future.
I'm using Apache FOP 2.1 and Java 1.7.0_91 on Ubuntu. I've seen some earlier questions on a similar topic but most seem to be using a much older version of Apache FOP (e.g. 0.95 or 1.1) and I don't know if anything has been changed/improved in the meantime.
Edit: My question is different (I think) to the suggested duplicate. I've switched to using the Ubuntu Font Family using the following code in my FOP config:
<font kerning="yes" embed-url="../fonts/ubuntu/Ubuntu-R.ttf" embedding-mode="full">
<font-triplet name="Ubuntu" style="normal" weight="normal"/>
</font>
<font kerning="yes" embed-url="../fonts/ubuntu/Ubuntu-B.ttf" embedding-mode="subset">
<font-triplet name="Ubuntu" style="normal" weight="bold"/>
</font>
However, I'm still getting the 'glyph not available' warning:
Jan 31, 2016 10:22:59 AM org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "你" (0x4f60) not available in font "Ubuntu".
Jan 31, 2016 10:22:59 AM org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "好" (0x597d) not available in font "Ubuntu".
I know Ubuntu Regular has these two glyphs because it's my standard system font.
Edit 2: If I use GNU Unifont, the glyphs display correctly. However, it seems to be a font aimed more at console use than in documents.
If you cannot find a suitable font supporting both Chinese and English (or you found one, but you don't like very much its latin glyphs), remember that font-family can contain a comma-separated list of names, to be used in that order.
So, you can list your desired font for English text first, and then the one for the Chinese text:
<!-- this has # instead of the missing Chinese glyphs -->
<fo:block font-family="Helvetica" space-after="1em" background-color="#AAFFFF">
Paul 你好</fo:block>
<!-- this has all the glyphs, but I don't like its latin glyphs -->
<fo:block font-family="SimSun" space-after="1em" background-color="#FFAAFF">
Paul 你好</fo:block>
<!-- the best of both worlds! -->
<fo:block font-family="Helvetica, SimSun" space-after="1em" background-color="#FFFFAA">
Paul 你好</fo:block>
The output looks like this:
The answer to my question is either use GNU Unifont, which:
Supports Chinese and English.
Is available under a free licence.
'Just works' if you add it to the FOP config file.
Or alternatively produce separate templates for English and Chinese PDFs and use different fonts for each.

Where can I get a Font-family to language pair map for Microsoft Word

I am programmatically generating a MSWord 2011 bilingual file(contains text from 2 languages) using docx4j. My plan is to set the font-family of text based on the language in the text. eg: When I have a Latin and Indian language passed, all text containing English will have 'Times New Roman' and Hindi as 'Devanagari' as their font type.
MS Word documentation doesn't have any information on this. Any help to find a list of all prominent languages MS-Word supports and their corresponding Font-Families appreciated.
The starting point is the rFonts element.
As it says:
This element specifies the fonts which shall be used to display the
text contents of this run. Within a single run, there may be up to
four types of content present which shall each be allowed to use a
unique font:
• ASCII
• High ANSI
• Complex Script
• East Asian
The use of each of these fonts shall be determined by the Unicode
character values of the run content, unless manually overridden via
use of the cs element
For further commentary and the actual algorithm used by docx4j (in its PDF output), which aims to mimic Word, see RunFontSelector
To simplify a bit, you need to work out which of the 4 attributes Word would use for your Hindi (from its Unicode character values), then set that attribute to the font you want.
You can set the attribute to an actual font name, or use a theme reference (see the RunFontSelector code for how that works).
If I were you, I'd create a docx in Word which is set up as you like, then look at its underlying XML. If it uses theme references in the font attributes, you can either use the docx you created as a template for your docx4j work, or you can manually 'resolve' the references and replace them with the actual font names.
If you want to programmatically reproduce what Word has created for you, you can upload your docx to the docx4j webapp to generate suitable code.
Finally, note that the fonts need to be available on the computer opening the docx. (Unless the fonts are embedded in the docx) If they aren't, another font may be substituted.