SVG to PDF conversion doesn't render Cyrillic characters correctly - apache-fop

I'm trying to convert a SVG which has Cyrillic characters to PDF document but those are showing as #. I have gone through few threads reporting similar issue but could not found solution.
PDF is getting generated with below warnings:
About to transcoder source of type: org.apache.batik.apps.rasterizer.SVGConverterFileSource
Aug 10, 2021 5:36:56 PM org.apache.fop.fonts.Typeface warnMissingGlyph
WARNING: Glyph 12362 (0x304a, ohiragana) not available in font Helvetica
Aug 10, 2021 5:36:56 PM org.apache.fop.fonts.Typeface warnMissingGlyph
WARNING: Glyph 12397 (0x306d, nehiragana) not available in font Helvetica
In input svg, we don't have any font-family specified.
As mentioned on few threads I need to define font that has glyph for required characters. The question is how to do that? Can you please provide some link which will guide through steps for defining fonts?

Related

fontspec (xetex/luatex) and fontconfig

LuaTeX and XeTeX are passing on a warning on from the fontspec package, and I'm not sure whether I should worry. The following minimal code will generate the warning:
\documentclass{report}
\usepackage{fontspec}
\setmainfont[Script=Cyrillic]{Arial}
\begin{document}
\end{document}
The warning is
fontspec warning: "script-not-exist"
Font 'Arial' does not contain script 'Cyrillic'.
Replacing 'Arial' with, for example, 'Charis SIL', results in no warning.
I believe I've traced the problem back to fc-list/ fontconfig. In the output of fc-list for the Arial font, I get the following 'capabilities' line:
capability: "otlayout:arab otlayout:hebr otlayout:latn"(s)
This line does not mention Cyrillic (which would be the code 'cyrl', according to the \newfontscript{Cyrillic}{cyrl} line in fontspec-xetex.sty). Fonts for which I do not get this warning, such as Charis SIL, explicitly mention Cyrillic in their capabilities lines:
capability: "ttable:Silf otlayout:DFLT otlayout:cyrl otlayout:latn"(s)
Unfortunately, documentation of this "capability" line in the output of fc-list is limited to the line
capability String List of layout capabilities in the font
(that from https://www.freedesktop.org/software/fontconfig/fontconfig-user.html and a couple other places on the web)
My question is basically, should I worry about this warning message? Arial (and Charis) list 'ru', the ISO 639-1 code for Russian, in the 'lang' field. Just what is this "layout" capability for the Cyrillic script that Charis SIL supports, but Arial apparently does not, and why does it matter?
BTW, this is with the TeX Live 2017 distro.

Displaying Unicode characters in PDF produced by Apache FOP

I have an XML file containing a list of names, some of which use characters/glyphs which are not represented in the default PDF font (Helvetica/Arial):
<name>Paul</name>
<name>你好</name>
I'm processing this file using XSLT and Apache FOP to produce a PDF file which lists the names. Currently I'm getting the following warning on the console and the Chinese characters are replaced by ## in the PDF:
Jan 30, 2016 11:30:56 AM org.apache.fop.events.LoggingEventListener processEvent WARNING: Glyph "你" (0x4f60) not available in font "Helvetica".
Jan 30, 2016 11:30:56 AM org.apache.fop.events.LoggingEventListener processEvent WARNING: Glyph "好" (0x597d) not available in font "Helvetica".
I've looked at the documentation and it seems to suggest that the options available are:
Use an OpenType font - except this isn't supported by FOP.
Switch to a different font just for the non-ASCII parts of text.
I don't want to use different fonts for each language, because there will be PDFs that have a mixture of Chinese and English, and as far as I know there's no way to work out which is which in XSLT/XSL-FO.
Is it possible to embed a single font to cover all situations? At the moment I just need English and Chinese, but I'll probably need to extend that in future.
I'm using Apache FOP 2.1 and Java 1.7.0_91 on Ubuntu. I've seen some earlier questions on a similar topic but most seem to be using a much older version of Apache FOP (e.g. 0.95 or 1.1) and I don't know if anything has been changed/improved in the meantime.
Edit: My question is different (I think) to the suggested duplicate. I've switched to using the Ubuntu Font Family using the following code in my FOP config:
<font kerning="yes" embed-url="../fonts/ubuntu/Ubuntu-R.ttf" embedding-mode="full">
<font-triplet name="Ubuntu" style="normal" weight="normal"/>
</font>
<font kerning="yes" embed-url="../fonts/ubuntu/Ubuntu-B.ttf" embedding-mode="subset">
<font-triplet name="Ubuntu" style="normal" weight="bold"/>
</font>
However, I'm still getting the 'glyph not available' warning:
Jan 31, 2016 10:22:59 AM org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "你" (0x4f60) not available in font "Ubuntu".
Jan 31, 2016 10:22:59 AM org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "好" (0x597d) not available in font "Ubuntu".
I know Ubuntu Regular has these two glyphs because it's my standard system font.
Edit 2: If I use GNU Unifont, the glyphs display correctly. However, it seems to be a font aimed more at console use than in documents.
If you cannot find a suitable font supporting both Chinese and English (or you found one, but you don't like very much its latin glyphs), remember that font-family can contain a comma-separated list of names, to be used in that order.
So, you can list your desired font for English text first, and then the one for the Chinese text:
<!-- this has # instead of the missing Chinese glyphs -->
<fo:block font-family="Helvetica" space-after="1em" background-color="#AAFFFF">
Paul 你好</fo:block>
<!-- this has all the glyphs, but I don't like its latin glyphs -->
<fo:block font-family="SimSun" space-after="1em" background-color="#FFAAFF">
Paul 你好</fo:block>
<!-- the best of both worlds! -->
<fo:block font-family="Helvetica, SimSun" space-after="1em" background-color="#FFFFAA">
Paul 你好</fo:block>
The output looks like this:
The answer to my question is either use GNU Unifont, which:
Supports Chinese and English.
Is available under a free licence.
'Just works' if you add it to the FOP config file.
Or alternatively produce separate templates for English and Chinese PDFs and use different fonts for each.

Where can I get a Font-family to language pair map for Microsoft Word

I am programmatically generating a MSWord 2011 bilingual file(contains text from 2 languages) using docx4j. My plan is to set the font-family of text based on the language in the text. eg: When I have a Latin and Indian language passed, all text containing English will have 'Times New Roman' and Hindi as 'Devanagari' as their font type.
MS Word documentation doesn't have any information on this. Any help to find a list of all prominent languages MS-Word supports and their corresponding Font-Families appreciated.
The starting point is the rFonts element.
As it says:
This element specifies the fonts which shall be used to display the
text contents of this run. Within a single run, there may be up to
four types of content present which shall each be allowed to use a
unique font:
• ASCII
• High ANSI
• Complex Script
• East Asian
The use of each of these fonts shall be determined by the Unicode
character values of the run content, unless manually overridden via
use of the cs element
For further commentary and the actual algorithm used by docx4j (in its PDF output), which aims to mimic Word, see RunFontSelector
To simplify a bit, you need to work out which of the 4 attributes Word would use for your Hindi (from its Unicode character values), then set that attribute to the font you want.
You can set the attribute to an actual font name, or use a theme reference (see the RunFontSelector code for how that works).
If I were you, I'd create a docx in Word which is set up as you like, then look at its underlying XML. If it uses theme references in the font attributes, you can either use the docx you created as a template for your docx4j work, or you can manually 'resolve' the references and replace them with the actual font names.
If you want to programmatically reproduce what Word has created for you, you can upload your docx to the docx4j webapp to generate suitable code.
Finally, note that the fonts need to be available on the computer opening the docx. (Unless the fonts are embedded in the docx) If they aren't, another font may be substituted.

Opening a file containing unicode characters using notepad++ appears corrupted

I'm using the latest version of Notepad ++ 6.3.1 and using a windows os. While trying to open a file containing unicode character appears corrupted despite changing the encoding to UTP8. It is displayed like "[][][][]". I'm I missing something in the settings? Kindly help.
Thanks
This is a font issue. You need a font containing the Japanese characters as installed in your computer, and you also need to have Notepad++ set to use such a font, for the kind of text being viewed. But it seems that Notepad++ is capable of using fallback fonts when needed (e.g., when the font selected does not contain all characters appearing in the text), so the problem is probably that no font in your system contains the characters. See e.g. the list East Asian Unicode fonts for Windows computers.
Not a font issue, unfortunately. Try with these characters, with UTF-16 encoding:
🔊, 🎥, 📕 (> U+FFFF)
Conclusion: Notepad++ doesn't have full Unicode support (unlike Windows Notepad or AkelPad)
Notepad++ is also inconsistent. With a document in UTF-8, using Lucida Console font, create a line
⇐⇑⇒⇓⇔⇕⇖⇗⇘⇙
and enter a newline in the middle - second line becomes 5 blocks, and then delete newline - all 10 characters display properly.
With font MS Gothic, this test always displays proper characters
Notepad++ v7.5.1 (64-bit)
Build time : Aug 29 2017 - 02:38:44
Path : C:\Program Files\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : mimeTools.dll NppConverter.dll

How to draw Thai text to PDF file by using libharu library

i am using free pdf library libharu to generate PDF file,
but i have a encoding problem, i can not draw Thai lanugage text on PDF file,
all the text shows "???.."
Somebody know how to fix it?
Thanks
I have succeeded in rendering hieroglyphic texts (not Thai, but Chinese and Japanese) using libharu. First of all, I used Unicode mode, please refer to HPDF_UseUTFEncodings() function documentation.
For C language, here is a sequence of libharu API calls needed to overcome your trouble:
HPDF_UseUTFEncodings(docHandle);
HPDF_SetCurrentEncoder(docHandle, "UTF-8");
Here docHandle is a valid HPDF_Doc object.
Next part is proper work with UTF fonts:
const char * libFontName = HPDF_LoadTTFontFromFile(docHandle, fontFileName.c_str(), font_embed::EmbedFonts);
HPDF_Font font = HPDF_GetFont(docHandle, libFontName, "UTF-8");
After these calls you may render unicode texts containing Thai characters. Also note about embedding flag (3rd param of LoadTTFontFromFile) - your PDF file may be unreadable due to external font references. If you are not crazy with output PDF size, you may just embed fonts.
I've tested couple of Thai .ttf fonts found in Google and they were rendered OK. Also (it may be important, but I'm not sure) I'm using fork of libharu https://github.com/kdeforche/libharu which is now merged into master branch.
When you write text to the PDF, use the correct font and encoding. In the libharu documentation you have all the possibilities: https://github.com/libharu/libharu/wiki/Fonts
In your case, you must use the ISO8859-11 Thai, TIS 620-2569 character set
An example (in spanish):
HPDF_Font fontEn = HPDF_GetFont(pdf, "Helvetica-Bold", "ISO8859-2");
HPDF_Page_TextOut(page1, 50.00, 750.00, [#"Código para correcta codificación en libharu" cStringUsingEncoding:NSISOLatin1StringEncoding]);