I have some code which produces HTML and convert it to PDF correctly on my machine.
When I test the same code on different machine, some very simple text disappears from the generated PDF: what could be the reason? I have investigated the intermediate HTML and it is the same.
Does the locale affect rendering text from HTML to PDF? What others settings could be potentially affect that?
Related
I am writing a Flutter web application that needs to have a customizable template for printing reports: inside the template there will be some placeholders that must be replaced with data at the moment of printing. The "printing" itself will be done by having the user download and open a PDF file, then print it through the browser, the OS or anything else, but that's beyond the scope, at the moment.
The default template would be something like this, where <BUYER_DATA> and <TRANSACTION_DATA> will be replaced with the data of the transaction the user is printing, along with some other "technical" tags (i.e., the page number and the pages count):
Header with site name
727 Chester Rd New Trafford, Stretford
Manchester
<BUYER_DATA>
01-12-2022 14:40
<TRANSACTION_DATA>
AppName - TM 2022
Page <PAGE_NO> of <PAGES_COUNT>
The user is allowed to edit this template in any aspect (boldness, size, colors, etc), provided that the tags related to the data are not removed from it.
So, to achieve this, I added a WYSIWYG html editor inside a page in order to save the template as an HTML-formatted string. And this works fine: only then I realized that the well-known flutter printing library doesn't support conversion from HTML directly to PDF on web, and all my plans began to crumble.
I then tried to discover if there's some other way to achieve the same by replacing the HTML template with something else, like markdown, but it seems that there's nothing that could help me.
The question is: anybody knows of a package capable of converting from HTML, markdown or such, directly into PDF?
I just need to know so I can stop googling around and decide to write my own parser for the HTML and convert it into a series of Widgets of the before-mentioned printing package.
Native Language Characters are truncated when you start typing them or paste them in form fields using Adobe. If I open it on chrome, it works fine. Currently I am using iTextSharp to populate the data properly to the pdf form. But as soon as I click on the field, font changes and the data gets truncated. Screenshot is attached.
Note: Javascripts from PDF are already removed.
EDIT: PDF is available on https://gofile.io/?c=6nO7ul
Test Characters: 汉字汉
I have come across a scenario where I have to read html data from database and display it in pdf reports. This html data also contains table structure <table></table> tags and other html element inside it. Previously we used jasper reports for our reporting needs but recently as we came to know that the above functionality is not supported in jasper, I wanted to know which reporting tool can be used so that it can be incorporated with servoy. Does birt provide this functionality?
AFAIK none of the well-known reporting tools does support this, although in BIRT it works "somehow" - but not good enough to be usable.
The reason for this is simple, I think: A reporting tool would have to incorporate a complete browser engine like WebKit or others to achieve this, because it would have to "understand" the structure for its page-breaking algorithm.
Yes, BIRT has a text element where we can set the display type to HTML. If the html table is in a dataset field you will just have to include it in the expression of the text using "value-of" tag, something like this:
<VALUE-OF format="HTML">row["htmlTableField"]</VALUE-OF>
PDF format is taking such html elements into account, including most of simple style settings such background color, text-align, borders etc.
Usually the reports render just fine with html.
There are some tricks to displaying html correctly in BIRT.
You may use a Dynamic Text element and set to html or auto.
Here are some tricks to handling free form text..
Make sure your xml is valid, I recommend replacing line breaks or you may catch a scenario where the rptdocument will not export.
Also, if possible keep these in auto layout, when using run + render. The page breaks may actually be calculated once on run and again on render. You might experience breaking issues with fixed. The page may attempt to display all the html prior to breaking a page when using the RUN() phase, in web viewer or the rptdocument. Then when rendering to pdf the the breaks are applied differently, with fixed layout.
I am running into an odd problem with iText. I have a document with a few fields. On my server, I open the local document, set the fields and send the output of the stamper to the browser.
Works perfectly on my local devel machine.
The pdf generated on the server is missing the PDF contents. I only see the content of the fields I set, the rest is completely blank.
Any tips?
Your application on your local machine respects the bytes of the PDF you're using as a template. Your application on the server doesn't respect those bytes. Maybe you've copied the template using the wrong encoding, making all the binary characters corrupt. Or maybe your application is reading the template using the wrong encoding with the same result.
You can find out by opening your PDF file in a text editor (not inside a PDF viewer). Look for the keyword stream and inspect the bytes that follow this keyword. Do you see the difference? In the PDF produced on your local machine, the bytes look like a normal binary stream. In the PDF produced on your server, the bytes look awkward. For instance: it consists of plenty of question marks.
How to solve: check if the template was copied correctly. If so, check the way you're reading the document. For instance: read the PDF template into a byte array without using iText and write it to a new byte array. Can you reproduce the process of corruption? If so, tweak your application (the one that doesn't involve iText) until you've got the correct encoding.
I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.