What is the difference between IText and Headless chrome - itext

We have a usecase to generate PDF from HTML for both RTL and LTR languages. Can anyone share the differences between headless and Itext to evaluate which is better for us?

It depends on your expectations of the PDF. If you just want an ordinary PDF, then you can choose any tool that converts HTML to PDF.
However, if you want an archivable PDF (PDF/A), an accessible PDF (PDF/UA) or a PDF 2.0 document, then iText 7 + the pdfHTML add-on + the pdfCalligraph add-on is the better choice. I don't know of any other HTML to PDF conversion software that is PDF 2.0-ready, nor do I think many HTML to PDF convertors support PDF/A or PDF/UA. For instance: with an ordinary HTML to PDF convertor you can convert Arabic content to a PDF, but when you try to convert the PDF to Arabic content, you will get a result that is slightly different. With iText 7, you create PDF documents that can be extracted correctly.
See How to convert HTML containing Arabic/Hebrew characters to PDF? for an RTL example. This FAQ entry is part of the HTML to PDF tutorial.
NOTE: I'm the original developer of iText; you should get the point of view of the people developing Headless Chrome too.

Related

Chinese/Japanese Characters getting Truncated in Adobe pdf Forms

Native Language Characters are truncated when you start typing them or paste them in form fields using Adobe. If I open it on chrome, it works fine. Currently I am using iTextSharp to populate the data properly to the pdf form. But as soon as I click on the field, font changes and the data gets truncated. Screenshot is attached.
Note: Javascripts from PDF are already removed.
EDIT: PDF is available on https://gofile.io/?c=6nO7ul
Test Characters: 汉字汉

Arabic characters do not display correctly [duplicate]

This question already has answers here:
RTL not working in pdf generation with itext 5.5 for Arabic text
(3 answers)
Closed 4 years ago.
For my website, I use itextpdf 5.5.4 to generate PDF downloads. The website is meant for people who speak English. Recently, a user from Egypt used the site, entered some Arabic content, and contacted me with the problem he has.
This is his Arabic content shown correctly in the browser:
This is incorrect display in PDF:
Here is the Java code I have. Please note that it is actually able to generate PDF with Chinese characters CORRECTLY:
BASE_FONT base = BaseFont.createFont("/fonts/ARIALUNI.ttf", BaseFont.IDENTITY_H , BaseFont.EMBEDDED);
Font f = new Font(base, 10f);
String htmlString = string_with_Arabic_text;
Paragraph p = new Paragraph(htmlString, f);
p.setSpacingBefore(20.0f);
p.setSpacingAfter(7.0f);
document.add(p);
How to fix the problem?
In Eclipse (the IDE I use), I am able to see Arabic characters display correctly in htmlString. At this moment, I cannot upgrade to use the latest version of itextpdf due to various project reasons.
iText 5 has limited support for non-Western writing systems. It support right-to-left writing but only in the context of ColumnText and PdfPCell objects.
This is an iText 5 example with ColumnText where p contains text in Arabic:
ColumnText canvas = new ColumnText(writer.getDirectContent());
canvas.setSimpleColumn(36, 750, 559, 780);
canvas.setRunDirection(PdfWriter.RUN_DIRECTION_LTR);
canvas.addElement(p);
canvas.go();
This is an iText 5 example with PdfPCell where p contains text in Arabic:
PdfPCell cell = new PdfPCell(p);
cell.setRunDirection(PdfWriter.RUN_DIRECTION_RTL);
This is very annoying, as it would mean that you have to rewrite your entire application so that all text is added either in a ColumnText or in a PdfPCell object. You'd also have to examine the content to check if you need to change the run direction.
As you have to rewrite the application anyway, it would be best to upgrade to iText 7, because iText 7 has an add-on that detects the writing system based on the UNICODE values of the content (see pdfCalligraph). When Arabic or Hebrew text is detected, the add-on changes the writing system for "left to right" to "right to left." See How to display Arabic strings from RTL in PDF generated using itext 7 API?
I see that you are coding your document. Please note that you can save yourself a lot of work by creating the content in HTML, and then converting it to PDF using the pdfHTML add-on. The PDF to HTML tutorial has some examples involving Arabic. See the section on internationalization in chapter 6, and the following FAQ entries:
Which languages are supported in pdfHTML?
How to convert HTML containing Arabic/Hebrew characters to PDF?
iText 7 is also the first version that supports more writing systems, such as Devanagari, Tamil, Telugu,... For more info, read the pdfCalligraph white paper.
Important: the pdfCalligraph add-on is closed source. You'll need a trial license to test it and a commercial license to use it in production. Note that the current version of iText that you are using is licensed as AGPL software, which implies that you can't use your project in a closed source context. You mention external users, which means that you are distributing your service. Did you open source all your own source code? If not, you should purchase a commercial license for your use of iText.

Displaying Rich Text (.rtf) in JavaFX

I have a .rtf file that I need to display within a JavaFX GUI.
My research indicates that the JavaFX TextFlow supports rich text through a tree of Node objects. However, I am at a loss on how to get my .rtf file represented as this tree of Nodes.
I feel like there should be an intuitive way to parse the .rtf file into the Node tree, but I just can't seem to find a way to do it!
Parsing RTF and Rendering in a TextFlow
You could parse the rtf and generate a TextFlow representation of it (similar as is done for this markdown editor for markdown markup). I believe this would be a difficult task for you (the RTF 1.9.1 specification is 277 pages long). Describing how to do this would be too long and complicated for a StackOverflow answer (even if I could describe it, which I probably could not).
Converting RTF to a format JavaFX can more easily render
I suggest using a converter (either offline or using an online service) to convert your RTF to another format before trying to render it in JavaFX. If you know the documents in advance you can pre-convert before shipping your application, if you don't then you will have to provide a real-time conversion facility with your application. I won't recommend a particular service, but you can google and do some research on RTF conversion to see if there is one that fits. As a target format you could choose PDF or HTML, or an image (e.g. PNG).
JavaFX will natively display:
Images using an ImageView.
HTML using a WebView.
A 3rd party library can be used to display PDF documents or other formats using JavaFX.

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).