Rich text printout with itext - itext

I have user requests to add rich text input in a wicket web application to the textareas. The input text is persisted in a database. I am thinking to use one of the rich text editors out (tinyMCE, CKEditor). The problem is that there is also need for this persisted text to be printed out with itext in PDF format in some cases. Is there a straightforward way to print out html rich text to pdf? I don't mind if it's rich text or not in the printout. Any ideas?

Related

TinyMCE autocloses HTML tags - How to disable? 2

Same question as here
I have two tinymce Editors One of them for Header other for Footer(needs o be done for email template).
I want for example to have
<div>abra in Header editor. After saving becomes <div>abra</div>(closes the tag)
And
cadabra</div> in Footer editor. After saving becomes cadabra(removes tag)
so that at the end I could get <div>abracadabra</div>
How can i disable it?
You cannot disable TinyMCE from trying to create valid, well-formed HTML. The engine that drives TinyMCE is designed to ensure that the content in any one editor is valid and well-formed and while you realize that the data across two editors is intended to be correct TinyMCE won't allow you to do this. You could certainly post-process the data when extracting it from TinyMCE to get your desired end result.

Chinese/Japanese Characters getting Truncated in Adobe pdf Forms

Native Language Characters are truncated when you start typing them or paste them in form fields using Adobe. If I open it on chrome, it works fine. Currently I am using iTextSharp to populate the data properly to the pdf form. But as soon as I click on the field, font changes and the data gets truncated. Screenshot is attached.
Note: Javascripts from PDF are already removed.
EDIT: PDF is available on https://gofile.io/?c=6nO7ul
Test Characters: 汉字汉

Sanitizing inputs with AEM

We have various people updating our AEM website however when they copy and paste from word or from online it retains the HTML. I'm wondering if AEM has any built-in way of sanitizing the input so I don't need to build one.
If you are using Rich Text Editor field in the dialog then the text will be parsed and some tags will be stripped. Take a look here for more information about how to configure it and how it works.
We had a rich text edit component with same issue wherein authors were able to place HTML styling onto RTE and the placed styles were colliding with application styles and was breaking components. Fix was, we stripped out all HTML styling using jsoup API before rendering back on screen.
The usual approach in AEM is to protect the user on output (i.e. take the input as-is and use the built-in XSS API when rendering that input).
https://docs.adobe.com/docs/en/cq/5-6-1/deploying/security_checklist.html#Protect%20against%20Cross-Site%20Scripting%20%28XSS%29
https://docs.adobe.com/content/docs/en/cq/5-6-1/developing/securitychecklist/_jcr_content/par/download/file.res/xss_cheat_sheet.pdf

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.

How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?

How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?
print CAM::PDF->new('file.pdf')->getPageText(1);
will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.
Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.
I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.