How to print the chinese and japanese characters using I text xml worker [duplicate] - itext

I'm facing a problem when trying to export a Vietnamese document as PDF using iText.
I put Vietnamese words in .xml file like this
<td fontfamily="Helvetica" fontstyle="0" fontsize="9" align="0" colspan="48" lineoccupied="1">T\u1ED5 ch\u1EE9c tham gia</td>
then having java to get the phrases from xml file and convert it into Unicode using this method:
public String convertToUnicode(String s) {
int i = 0, len = s.length();
char c;
StringBuffer sb = new StringBuffer(len);
try {
while (i < len) {
c = s.charAt(i++);
if (c == '\\') {
if (i < len) {
c = s.charAt(i++);
if (c == 'u') {
if (Character.digit(s.charAt(i), 16) != -1
&& Character.digit(s.charAt(i + 1), 16) != -1
&& Character.digit(s.charAt(i + 2), 16) != -1
&& Character.digit(s.charAt(i + 3), 16) != -1) {
if (s.substring(i).length() >= 4) {
c = (char) Integer.parseInt(s.substring(i, i + 4), 16);
i += 4;
} else {
sb.append('\\');
}
} else {
sb.append('\\');
}
} // add other cases here as desired...
}
} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
} catch (Exception e) {
System.out.println("Error Generate PDF :: " + e.getStackTrace().toString());
return s;
}
return sb.toString();
}
After that, export String to PDF - encoding UTF-8.
But the program failed to display Vietnamese character '\u1ED5' and '\u1EE9'
The output becomes "T chc tham gia"
Could you please show me how to fix this issue?
Thanks :)

There are 3 XML Worker examples involving Asian languages on the official iText web site. They parse an XHTML file containing Chinese characters, but it should be easy to adapt them to Vietnamese examples.
You can find the HTML files were going to parse here:
hero.html
hero2.html
Both files contain the following text:
長空 (Broken Sword), 秦王殘劍 (Flying Snow), 飛雪 (Moon), 如月 (the King), and 秦王 (Sky).
In the first case, a font is defined using CSS:
<span style="font-size:12.0pt; font-family:MS Mincho">長空</span>
In the second case, no specific font is defined:
<body><p>長空 (Broken Sword), 秦王殘劍 (Flying Snow), 飛雪 (Moon), 如月 (the King), and 秦王 (Sky).</p></body>
These files contain UTF-8 characters, so we're going to parse them like this:
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML), Charset.forName("UTF-8"));
The first thing you need, is a font that supports Vietnamese characters. That's something iText can't help you with. In your HTML file, you've defined Helvetica, but that's a standard Type1 font that is never embedded when using iText and that doesn't know how to draw Vietnamese glyphs. That's never going to work.
The first example D07_ParseHtmlAsian will automatically search for a font named MS Mincho. If it finds that font (for instance because you have msmincho.ttc in your Windows fonts directory), the font will show up in your PDF. See hero.pdf. If it doesn't find a font with that name, then the glyphs won't be visible, because you didn't provide any font program for those glyphs.
The second example D07bis_ParseHtmlAsian offers a workaround in case you don't have MS Mincho anywhere. In that case, you have to use an XMLWorkerFontProvider and register a font that can be used instead of MS Mincho. For instance: we use a font stored in the file cfmingeb.ttf and assign the alias MS Mincho:
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/cfmingeb.ttf", "MS Mincho");
The resulting file asian.pdf is slightly different from what we expect, but now we can at least see the Chinese glyphs.
In the third example, the HTML file doesn't tell us anything about the font that needs to be used. We'll define the font using CSS like this:
CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(new ByteArrayInputStream("body {font-family:tsc fming s tt}".getBytes()));
cssResolver.addCss(cssFile);
Now, all the text in the body will use the font TSC FMing S TT (stored in the file cfmingeb.ttf). You can see the difference in the resulting PDF asian2.pdf.

I think you need an encoding as UTF-8 for your HTML and use &#xUNUM; for hex or &#NUM; for regular code to embed your special characters. Not sure where but somewhere in your program since it is not display shown, but your final HTML should be:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML LEVEL 1//EN">
<HTML>
<HEAD>
<TITLE>Your Page Title</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
</HEAD>
<BODY>
<!-- YOUR CONTENT HERE -->
<td fontfamily="Helvetica" fontstyle="0" fontsize="9"
align="0" colspan="48"
lineoccupied="1">Tổ chức tham gia</td>
</BODY>
</HTML>
You can cut and paste the above into an HTML file and view the result. More reading pleasure is here Unicode and HTML

Related

How can I get text value of TMPro Text without markup tags in Unity?

I am trying to get text value in TMPro Text component without markup tags but haven't found any solutions.
Say, <b><color=red>Hello </color><b> world is the value in TMPro Text, and I just want Hello world in c# script.
Bear in mind that tag <b> and color will change, so I would love to remove tags dynamically, meaning I would like not to replacing each tag by text.replace("<b>", "") kind of things.
Does anyone know how to do it?
I dont know about the option of using HTML on tmp but you can attach the text to your script by create a new variable like that:
[SerializeField] TMP_Text textVar;
then you can drag you tmp game object to the component that include this script
and the you can change the text like that:
textVar.text = "what ever";
or get text like that:
string textString = textVar.text;
for the color you can use
Color color = textVar.color;
You can use TMP_Text.GetParsedText () to get the text after it has been parsed and rich text tags removed.
Alternatively you can also use regular expressions to search a string for rich text tags, and remove them while preserving the rest of the string.
using System.Text.RegularExpressions;
public static string GetString (string str)
{
Regex rich = new Regex (#"<[^>]*>");
if (rich.IsMatch (str))
{
str = rich.Replace (str, string.Empty);
}
return str;
}

How to specify Japanese encoding for a UILabel?

When I attempt to display a Japanese string in a UILabel on iOS, it gets displayed using Chinese encoding instead of Japanese.
The two encodings are nearly identical, except in a few specific cases. For example, here is how the character 直 (Unicode U+76F4) is rendered in Chinese (top) vs. Japanese (bottom):
(see here for more examples)
The only time Japanese strings render correctly is when the user's system locale is ja-jp (Japan), but I'd like it to render as Japanese for all users.
Is there any way to force the Japanese encoding? Android has TextView.TextLocale, but I don't see anything similar on iOS UILabel
(Same question for Android. I tagged this Swift/Objective-C because, although I'm looking for a Xamarin.iOS solution, the API is almost the same)
You just need to specify language identifier for attributed string, like
let label = UILabel()
let text = NSAttributedString(string: "直", attributes: [
.languageIdentifier: "ja", // << this !!
.font: UIFont.systemFont(ofSize: 64)
])
label.attributedText = text
Tested with Xcode 13.2 / iOS 15.2
I found an extremely hacky solution that seems to work. However, it seems absurd that there's no way to simply set the locale of a label, so if anyone finds something I missed, please post an answer.
The trick relies on the fact that the Hiragino font displays kanji using Japanese encoding rather than Chinese encoding by default. However, the font looks like shit for English text, so I have to search every string in every label for Japanese substrings and manually change the font using NSMutableAttributedString. The font is also completely broken so I had to find another workaround to fix that.
[assembly: ExportRenderer(typeof(Label), typeof(RingotanLabelRenderer))]
namespace MyApp
{
public class MyLabelRenderer : LabelRenderer
{
private readonly UIFont HIRAGINO_FONT = UIFont.FromName("HiraginoSans-W6", 1); // Size gets updated later
protected override void OnElementPropertyChanged(object sender, PropertyChangedEventArgs e)
{
base.OnElementPropertyChanged(sender, e);
// BUGFIX: Chinese encoding is shown by default. Switch to Hiragino font, which correctly shows Japanese characters
// Taken from https://stackoverflow.com/a/71045204/238419
if (Control?.Text != null && e.PropertyName == "Text")
{
var kanjiRanges = GetJapaneseRanges(Control.Text).ToList();
if (kanjiRanges.Count > 0)
{
var font = HIRAGINO_FONT.WithSize((nfloat)Element.FontSize);
var attributedString = Control.AttributedText == null
? new NSMutableAttributedString(Control.Text)
: new NSMutableAttributedString(Control.AttributedText);
// Search through string for all instances of Japanese characters and update the font
foreach (var (start, end) in kanjiRanges)
{
int length = end - start + 1;
var range = new NSRange(start, length);
attributedString.AddAttribute(UIStringAttributeKey.Font, font, range);
// Bugfix: Hiragino font is broken (https://stackoverflow.com/a/44397572/238419) so needs to be adjusted upwards
// jesus christ Apple
attributedString.AddAttribute(UIStringAttributeKey.BaselineOffset, (NSNumber)(Element.FontSize/10), range);
}
Control.AttributedText = attributedString;
}
}
}
// Returns all (start,end) ranges in the string which contain only Japanese strings
private IEnumerable<(int,int)> GetJapaneseRanges(string str)
{
for (int i = 0; i < str.Length; i++)
{
if (IsJapanese(str[i]))
{
int start = i;
while (i < str.Length - 1 && KanjiHelper.IsJapanese(str[i]))
{
i++;
}
int end = i;
yield return (start, end);
}
}
}
private static bool IsJapanese(char character)
{
// An approximation. See https://github.com/caguiclajmg/WanaKanaSharp/blob/792f45a27d6e543d1b484d6825a9f22a803027fd/WanaKanaSharp/CharacterConstants.cs#L110-L118
// for a more accurate version
return character >= '\u3000' && character <= '\u9FFF'
|| character >= '\uFF00';
}
}
}

Display Chinese characters in a table [duplicate]

I'm facing a problem when trying to export a Vietnamese document as PDF using iText.
I put Vietnamese words in .xml file like this
<td fontfamily="Helvetica" fontstyle="0" fontsize="9" align="0" colspan="48" lineoccupied="1">T\u1ED5 ch\u1EE9c tham gia</td>
then having java to get the phrases from xml file and convert it into Unicode using this method:
public String convertToUnicode(String s) {
int i = 0, len = s.length();
char c;
StringBuffer sb = new StringBuffer(len);
try {
while (i < len) {
c = s.charAt(i++);
if (c == '\\') {
if (i < len) {
c = s.charAt(i++);
if (c == 'u') {
if (Character.digit(s.charAt(i), 16) != -1
&& Character.digit(s.charAt(i + 1), 16) != -1
&& Character.digit(s.charAt(i + 2), 16) != -1
&& Character.digit(s.charAt(i + 3), 16) != -1) {
if (s.substring(i).length() >= 4) {
c = (char) Integer.parseInt(s.substring(i, i + 4), 16);
i += 4;
} else {
sb.append('\\');
}
} else {
sb.append('\\');
}
} // add other cases here as desired...
}
} // fall through: \ escapes itself, quotes any character but u
sb.append(c);
}
} catch (Exception e) {
System.out.println("Error Generate PDF :: " + e.getStackTrace().toString());
return s;
}
return sb.toString();
}
After that, export String to PDF - encoding UTF-8.
But the program failed to display Vietnamese character '\u1ED5' and '\u1EE9'
The output becomes "T chc tham gia"
Could you please show me how to fix this issue?
Thanks :)
There are 3 XML Worker examples involving Asian languages on the official iText web site. They parse an XHTML file containing Chinese characters, but it should be easy to adapt them to Vietnamese examples.
You can find the HTML files were going to parse here:
hero.html
hero2.html
Both files contain the following text:
長空 (Broken Sword), 秦王殘劍 (Flying Snow), 飛雪 (Moon), 如月 (the King), and 秦王 (Sky).
In the first case, a font is defined using CSS:
<span style="font-size:12.0pt; font-family:MS Mincho">長空</span>
In the second case, no specific font is defined:
<body><p>長空 (Broken Sword), 秦王殘劍 (Flying Snow), 飛雪 (Moon), 如月 (the King), and 秦王 (Sky).</p></body>
These files contain UTF-8 characters, so we're going to parse them like this:
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML), Charset.forName("UTF-8"));
The first thing you need, is a font that supports Vietnamese characters. That's something iText can't help you with. In your HTML file, you've defined Helvetica, but that's a standard Type1 font that is never embedded when using iText and that doesn't know how to draw Vietnamese glyphs. That's never going to work.
The first example D07_ParseHtmlAsian will automatically search for a font named MS Mincho. If it finds that font (for instance because you have msmincho.ttc in your Windows fonts directory), the font will show up in your PDF. See hero.pdf. If it doesn't find a font with that name, then the glyphs won't be visible, because you didn't provide any font program for those glyphs.
The second example D07bis_ParseHtmlAsian offers a workaround in case you don't have MS Mincho anywhere. In that case, you have to use an XMLWorkerFontProvider and register a font that can be used instead of MS Mincho. For instance: we use a font stored in the file cfmingeb.ttf and assign the alias MS Mincho:
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.register("resources/fonts/cfmingeb.ttf", "MS Mincho");
The resulting file asian.pdf is slightly different from what we expect, but now we can at least see the Chinese glyphs.
In the third example, the HTML file doesn't tell us anything about the font that needs to be used. We'll define the font using CSS like this:
CSSResolver cssResolver = new StyleAttrCSSResolver();
CssFile cssFile = XMLWorkerHelper.getCSS(new ByteArrayInputStream("body {font-family:tsc fming s tt}".getBytes()));
cssResolver.addCss(cssFile);
Now, all the text in the body will use the font TSC FMing S TT (stored in the file cfmingeb.ttf). You can see the difference in the resulting PDF asian2.pdf.
I think you need an encoding as UTF-8 for your HTML and use &#xUNUM; for hex or &#NUM; for regular code to embed your special characters. Not sure where but somewhere in your program since it is not display shown, but your final HTML should be:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML LEVEL 1//EN">
<HTML>
<HEAD>
<TITLE>Your Page Title</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
</HEAD>
<BODY>
<!-- YOUR CONTENT HERE -->
<td fontfamily="Helvetica" fontstyle="0" fontsize="9"
align="0" colspan="48"
lineoccupied="1">Tổ chức tham gia</td>
</BODY>
</HTML>
You can cut and paste the above into an HTML file and view the result. More reading pleasure is here Unicode and HTML

iText(Sharp) - how to avoid creating a blank page?

I'm generating a PDF document using iTextSharp version 5.5.7, using their "streaming" mode - by which I mean I'm not specifying the location of every piece of text, I'm just adding Paragraphs to the Document and letting iTextSharp figure out where to draw them. The text I'm outputting is the result of a report generator, so it is different every time.
The problem I'm running into is this: imagine that, given the page size and the selected font, I can fit 40 lines of text on a page. I output 40 Paragraphs, then I output a blank Paragraph (contents=" "), then an image that fills an entire page. iTextSharp does exactly what I tell it - I end up with one page full of text, a blank page, and then a page containing my image.
But now my document looks funny - there is this unexpectedly blank page in the middle of everything.
I can't just say "don't output any blank lines" because of course that blank line might show up after only 20 lines of text, in which case it has to be there. I need some way to either tell iTextSharp "include this paragraph only if it's not the only thing on a page" or else somehow detect that the page is blank in OnEndPage() and suppress its output (without screwing up my page numbers).
Any suggestions on how I can do this?
ADDED LATER
Output from the report generator:
<html>
<p>Information header</p>
<p>Detail</p>
<p>Detail</p>
<p>Detail</p>
<p></p> <!-- Blank line inserted by report generator for clarity -->
<p>Information header</p>
<p>Detail</p>
<p>Detail</p>
<p>Detail</p>
...
<p>Detail</p> <!-- just by random happenstance this is the last line that will fit on the first page -->
<p></p> <!-- This line happens to be blank, I have no control over it -->
<img src="blah blah"></image>
My (pseudo) code:
foreach (HtmlNode node in htmlFromReportGenerator)
{
if (node is text)
pdfDoc.Add(new Paragraph(node.text));
else if (node is image)
pdfDoc.Add(new Image(node.image));
}
Following Bruno's suggestion, my (pseudo)code now looks like this:
Paragraph lastParagraph = null;
foreach (HtmlNode node in htmlFromReportGenerator)
{
if (node is text)
{
Paragraph parg = new Paragraph(node.text);
if ( (lastParagraph != null) && (text.Trim().Length == 0) )
lastParagraph.SpacingAfter += parg.Leading;
else
{
pdfDoc.Add(parg);
lastParagraph = parg;
}
}
else if (node is image)
{
pdfDoc.Add(new Image(node.image));
lastParagraph = null;
}
}

Controlling the font of the content of an HTML Table in iTextSharp

I am using the most recent DLLs and trying to render HTML fragments into a PDF document using the following code:
Private Function ReadHtml(ByVal text As String) As Paragraph
Dim par = NewParagraph()
Try
Dim htmlText = Server.HtmlDecode(text)
If Not htmlText.StartsWith("<") Then
htmlText = "<span>" & htmlText & "</span>"
End If
Using reader As New IO.StringReader(htmlText)
Dim mh As New MyHandler()
XMLWorkerHelper.GetInstance().ParseXHtml(mh, reader)
'use mh.elements
For Each element In mh.Elements
Dim list = TryCast(element, List)
If list IsNot Nothing Then
element = Clone(list)
End If
par.Add(element)
Next
setFont(par, m_rowFont)
End Using
Catch ex As Exception
Throw New Exception("Exception in ReadHtml using: '" & text & "'")
End Try
Return par
End Function
When this function returns, I take the paragraph and insert it into the PDF. The problem I am having is that while I can set the font in an outer div, and the resulting PDF will render that correctly, if I include an HTML table inside of the div where I've set the font, everything inside of the table renders using the Page's default font.
How do I control the font of the table content?
Your Html in font-size and font-family remove.
public string ConvertPdfHtml(string html)
{
html = CleanStyleAttribute(html, "font-family");
html = CleanStyleAttribute(html, "font-size");
....
return html;
}
private string CleanStyleAttribute(string html, string styleAttributeName)
{
return Regex.Replace(html, styleAttributeName + "(?>:(.*?);)","");
}