This question already has answers here:
RTL not working in pdf generation with itext 5.5 for Arabic text
(3 answers)
Closed 4 years ago.
For my website, I use itextpdf 5.5.4 to generate PDF downloads. The website is meant for people who speak English. Recently, a user from Egypt used the site, entered some Arabic content, and contacted me with the problem he has.
This is his Arabic content shown correctly in the browser:
This is incorrect display in PDF:
Here is the Java code I have. Please note that it is actually able to generate PDF with Chinese characters CORRECTLY:
BASE_FONT base = BaseFont.createFont("/fonts/ARIALUNI.ttf", BaseFont.IDENTITY_H , BaseFont.EMBEDDED);
Font f = new Font(base, 10f);
String htmlString = string_with_Arabic_text;
Paragraph p = new Paragraph(htmlString, f);
p.setSpacingBefore(20.0f);
p.setSpacingAfter(7.0f);
document.add(p);
How to fix the problem?
In Eclipse (the IDE I use), I am able to see Arabic characters display correctly in htmlString. At this moment, I cannot upgrade to use the latest version of itextpdf due to various project reasons.
iText 5 has limited support for non-Western writing systems. It support right-to-left writing but only in the context of ColumnText and PdfPCell objects.
This is an iText 5 example with ColumnText where p contains text in Arabic:
ColumnText canvas = new ColumnText(writer.getDirectContent());
canvas.setSimpleColumn(36, 750, 559, 780);
canvas.setRunDirection(PdfWriter.RUN_DIRECTION_LTR);
canvas.addElement(p);
canvas.go();
This is an iText 5 example with PdfPCell where p contains text in Arabic:
PdfPCell cell = new PdfPCell(p);
cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
This is very annoying, as it would mean that you have to rewrite your entire application so that all text is added either in a ColumnText or in a PdfPCell object. You'd also have to examine the content to check if you need to change the run direction.
As you have to rewrite the application anyway, it would be best to upgrade to iText 7, because iText 7 has an add-on that detects the writing system based on the UNICODE values of the content (see pdfCalligraph). When Arabic or Hebrew text is detected, the add-on changes the writing system for "left to right" to "right to left." See How to display Arabic strings from RTL in PDF generated using itext 7 API?
I see that you are coding your document. Please note that you can save yourself a lot of work by creating the content in HTML, and then converting it to PDF using the pdfHTML add-on. The PDF to HTML tutorial has some examples involving Arabic. See the section on internationalization in chapter 6, and the following FAQ entries:
Which languages are supported in pdfHTML?
How to convert HTML containing Arabic/Hebrew characters to PDF?
iText 7 is also the first version that supports more writing systems, such as Devanagari, Tamil, Telugu,... For more info, read the pdfCalligraph white paper.
Important: the pdfCalligraph add-on is closed source. You'll need a trial license to test it and a commercial license to use it in production. Note that the current version of iText that you are using is licensed as AGPL software, which implies that you can't use your project in a closed source context. You mention external users, which means that you are distributing your service. Did you open source all your own source code? If not, you should purchase a commercial license for your use of iText.
Related
We have a usecase to generate PDF from HTML for both RTL and LTR languages. Can anyone share the differences between headless and Itext to evaluate which is better for us?
It depends on your expectations of the PDF. If you just want an ordinary PDF, then you can choose any tool that converts HTML to PDF.
However, if you want an archivable PDF (PDF/A), an accessible PDF (PDF/UA) or a PDF 2.0 document, then iText 7 + the pdfHTML add-on + the pdfCalligraph add-on is the better choice. I don't know of any other HTML to PDF conversion software that is PDF 2.0-ready, nor do I think many HTML to PDF convertors support PDF/A or PDF/UA. For instance: with an ordinary HTML to PDF convertor you can convert Arabic content to a PDF, but when you try to convert the PDF to Arabic content, you will get a result that is slightly different. With iText 7, you create PDF documents that can be extracted correctly.
See How to convert HTML containing Arabic/Hebrew characters to PDF? for an RTL example. This FAQ entry is part of the HTML to PDF tutorial.
NOTE: I'm the original developer of iText; you should get the point of view of the people developing Headless Chrome too.
PDF created will be based on dynamic HTML page.
Using ITEXT 5 or 7 with XMLWORKERHELPERCLASS would be lengthy process.
If i am using pdfcrowd API it seems to be ok but not able to generate on localhost or any other private ip. I am ready to pay for their services if they achieve above issue.
First you need to get iText 7 (the core library) and the pdfHtml add-on (the part that will parse the HTML+CSS and convert it to iText objects). Go to github to find out how to download these.
Suppose that you have this HTML:
With this corresponding CSS:
Then you can use this code snippet:
ConverterProperties converterProperties =
new ConverterProperties().setBaseUri(resoureLocation);
HtmlConverter.convertToPdf(
new FileInputStream(HTMLSource),
new FileOutputStream(pdfDestination), converterProperties);
Where resourceLocation points at your base URI, HTMLSource is the path to your HTML file, and pdfDestination is the path where you want the resulting PDF to be written.
When you execute this code, you will get the following PDF:
Note that buying a commercial license may be necessary if you intend to use iText in the context of a proprietary software project.
I am trying to build desktop application for Hindi PDFs in c#. But the Unicode encoding is not well supported.Any idea to fix this.
string ARIALUNI_TTF = path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
bf = iTextSharp.text.pdf.BaseFont.CreateFont(ARIALUNI_TTF, BaseFont.IDENTITY_H,BaseFont.EMBEDDED);
iTextSharp.text.Font font = new iTextSharp.text.Font(bf, 8, iTextSharp.text.Font.NORMAL);
Can Identity_H will give support for Hindi Encoding?
Hindi is not supported yet. A font like mangal.ttf, that supports the Devanagari script, will show you in iTextSharp the glyphs but not the ligatures. Work is being done on the Indic front not only for Hindi support but also for Telegu, Gujarati and others.
You basically require support for Asian Characters. A similar thread can be found here(stackoverflow). The implementation revolve around usage of BaseFont (use createFont method), which indicates using font and appropriate encoding. You can find the example on the official site of iText here. Note that the example is in Java, but the same implementation is available in .Net as well.
I have a .rtf file that I need to display within a JavaFX GUI.
My research indicates that the JavaFX TextFlow supports rich text through a tree of Node objects. However, I am at a loss on how to get my .rtf file represented as this tree of Nodes.
I feel like there should be an intuitive way to parse the .rtf file into the Node tree, but I just can't seem to find a way to do it!
Parsing RTF and Rendering in a TextFlow
You could parse the rtf and generate a TextFlow representation of it (similar as is done for this markdown editor for markdown markup). I believe this would be a difficult task for you (the RTF 1.9.1 specification is 277 pages long). Describing how to do this would be too long and complicated for a StackOverflow answer (even if I could describe it, which I probably could not).
Converting RTF to a format JavaFX can more easily render
I suggest using a converter (either offline or using an online service) to convert your RTF to another format before trying to render it in JavaFX. If you know the documents in advance you can pre-convert before shipping your application, if you don't then you will have to provide a real-time conversion facility with your application. I won't recommend a particular service, but you can google and do some research on RTF conversion to see if there is one that fits. As a target format you could choose PDF or HTML, or an image (e.g. PNG).
JavaFX will natively display:
Images using an ImageView.
HTML using a WebView.
A 3rd party library can be used to display PDF documents or other formats using JavaFX.
Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).