Chinese/Japanese Characters getting Truncated in Adobe pdf Forms

Chinese/Japanese Characters getting Truncated in Adobe pdf Forms - forms

Native Language Characters are truncated when you start typing them or paste them in form fields using Adobe. If I open it on chrome, it works fine. Currently I am using iTextSharp to populate the data properly to the pdf form. But as soon as I click on the field, font changes and the data gets truncated. Screenshot is attached.
Note: Javascripts from PDF are already removed.
EDIT: PDF is available on https://gofile.io/?c=6nO7ul
Test Characters: 汉字汉

Related

What is the difference between IText and Headless chrome

We have a usecase to generate PDF from HTML for both RTL and LTR languages. Can anyone share the differences between headless and Itext to evaluate which is better for us?

It depends on your expectations of the PDF. If you just want an ordinary PDF, then you can choose any tool that converts HTML to PDF.
However, if you want an archivable PDF (PDF/A), an accessible PDF (PDF/UA) or a PDF 2.0 document, then iText 7 + the pdfHTML add-on + the pdfCalligraph add-on is the better choice. I don't know of any other HTML to PDF conversion software that is PDF 2.0-ready, nor do I think many HTML to PDF convertors support PDF/A or PDF/UA. For instance: with an ordinary HTML to PDF convertor you can convert Arabic content to a PDF, but when you try to convert the PDF to Arabic content, you will get a result that is slightly different. With iText 7, you create PDF documents that can be extracted correctly.
See How to convert HTML containing Arabic/Hebrew characters to PDF? for an RTL example. This FAQ entry is part of the HTML to PDF tutorial.
NOTE: I'm the original developer of iText; you should get the point of view of the people developing Headless Chrome too.

Does TinyMCE PowerPaste support pasting images from a browser as Base64

From the TinyMCE PowerPaste plugin documentation :
If you configure PowerPaste to allow local images (see the
powerpaste_allow_local_images setting below) then images pasted from
Microsoft Word and other sources will appear in TinyMCE as Base64
encoded images.
It's not clear what and other sources means. Does it include things copied over from a browser?
From https://stackoverflow.com/a/39842881/329660 I assume that an image pasted as part of a chunk of HTML can't be pasted as binary data in TinyMCE.
But if the user right clicks the image and "copy image" from their browser, is PowerPaste supposed to paste the full Base64 data in the editor (granted that the browser put the data in the correct format in the clipboard in the first place, of course)?

The "other sources" for this is typically just Excel. Word and Excel are the two office products that consistently place image binaries into the clipboard in RTF format that PowerPaste can extract.
If another office product did the same PowerPaste would see that and grab it as well but at present I don't know of another office product that does exactly what Word and Excel do with images.

IText Pdf - RadioBox(On/Off) not appearing for some pdf

In our application we are using Itext Pdf 5.5.3 library.
We have checked with some of the pdfs in which Checkboxes displayed correctly(check/uncheck) .
However there are some pdf with RadioBoxes and do not display radiobutton(on/off) correctly.
I also use this link to validate pdfs and java code
String[] values = form.getAppearanceStates("Checkbox");
return null values.
Also tried Itext RUPS and found that pdf which are working shows Form Field Names in RUPS Form Tab. And PDfs which are not working do not display form fields.
I tried generating pdf from word document and it doesn't display form fields in RUP , neither I can check/uncheck checkbox in Adobe Acrobat Reader.
What could be the solution to display radiobutton with check on / off ?
Edit -
I had created sample web application to reproduce the issue.
Please setup attached web application and let me know the fix for the issue.
Please download from this link

You have successfully discovered the difference between interactive PDF forms and "flat" PDF documents that look like a form to the human eye, but that aren't interactive forms.
To make the "flat" forms interactive, you need to open those flat documents in PDF editing software (e.g. Adobe Acrobat) and you need to add a form field manually.
You can ask Acrobat to guess where it should add fields, but Acrobat will be wrong in many cases for obvious reasons. You always need a human if you want it to be done correctly.
As for creating an interactive PDF from Word... Forget about it. Use OpenOffice or LibreOffice.

XMLWorker not rendering full html on different machine

I have some code which produces HTML and convert it to PDF correctly on my machine.
When I test the same code on different machine, some very simple text disappears from the generated PDF: what could be the reason? I have investigated the intermediate HTML and it is the same.
Does the locale affect rendering text from HTML to PDF? What others settings could be potentially affect that?

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?

Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.