Making a PDF output in raster format instead of vector using itextsharp - itext

I have written C# code to save product specifications to a PDF document using iTextSharp, mainly with PdfPTable and Chunks/Paragraphs in the PdfPCells. However, I have been told that the output is unacceptable due to the fact that you can highlight and copy the text from the document and document storage and retrieval server software that they are currently using does not support "Vector" based PDFs. I'm not exactly certain what the difference is between a raster pdf and and vector pdf. Basically, every page of the PDF Document should be an image so that the text can not be highlighted. Is there any way to do this without using the DirectContent? Below is an image, illustrating a portion of the PDF that was created, and how the text can be selected and copied, which is the incorrect functionality.
I would like to avoid directly writing to the canvas, unless there is a way to do this and still have itextsharp handle my formatting and proper paging.
The windows application PDF2R works well, but doesn't seem to offer any programmatic solutions. I have found libraries that stated that they do this sort of conversion, but are several thousand dollars. I'd like to work within my budget and use the itextsharp or something much cheaper than this.

I would suggest you try to generate an image using the System.Drawing class and then insert that into the PDF document.

Call this code on your PdfWriter object:
writer.SetEncryption(PdfWriter.STRENGTH40BITS, null, null, PdfWriter.AllowPrinting);
This won't prevent users from selecting text, but it will prevent them from copying and pasting it. Give it a try.

Related

Converting Email to PDF

I have expended a good deal of effort trying to convert emails to PDF.
I am using Delphi 10.4 although that is not necessarily relevant to the question.
I came up with a solution that involves extraction of the body from the email in whatever format (HTML, RTF or TXT). I use INDY for this or Outlook if email is in MSG format.
I then save the body to file and open it using MS Word via automation. Then it should be a simple matter of saving the Word document in PDF format.
However, MS Word doesn't seem to read html files that well.
From the numerous samples of emails that I have tried, I have come across several issues which were complex to solve.
Examples:
html tables expanding beyond the document's page width. I solved this by working out what the page width is, setting the offending table's width as fixed and setting it to the page width and finally resizing it's columns proportionately to its new width.
That worked well until I tried to process an email with html tables with differing numbers of columns/cells per row. That causes a crash. I solved that by handling the exception and iterating through each table by row and working with its cells rather than columns.
Images within table cells often overlap the cell and the page width. Solved by iterating through all InlineShapes, checking whether they are within a table and, if so setting their width to the cell width.
There have been other issues, but I now have something that seems to work pretty well on a fairly disparate bunch of emails.
But I would think it incredibly likely that there will be new issues that will crop up from time to time and since this procedure is designed to deal unsupervised with batches of emails, this is a concern.
So my question is, does anyone know of a better way of dealing with this? For example, is there some simple way of getting Word to to "nicely" format the html on loading so that it displays and saves to PDF in a readable fashion similar to how it looks when you open the same email in Outlook.
Have you tried using the WordEditor property of the Outlook Inspector object? This returns the Microsoft Word Document Object Model of the message and you can export directly to PDF from that.
Here is a basic example...
Private Sub Demo()
Dim MailItem As MailItem
Dim FileName As String
FileName = "C:\Users\Sam\Desktop\Email.pdf"
Set MailItem = ActiveExplorer.Selection.Item(1)
With MailItem.GetInspector
.WordEditor.ExportAsFixedFormat FileName, 17
.Close 0
End With
MsgBox "Export complete"
End Sub

manipulating Microsoft Word DOCX files that have links and track changes using Python

I have been using the excellent python-docx package to read, modify, and write Microsoft Word files. The package supports extracting the text from each paragraph. It also allows accessing a paragraph a "run" at a time, where the run is a set of characters that have the same font information. Unfortunately, when you access a paragraph by runs, you lose the links, because the package does not support links. The package also does not support accessing change tracking information.
My problem is that I need to access change tracking information. Or, more specifically, I need to copy paragraphs that have change tracking indicated from one document to another.
I've tried doing this at the XML level. For example, this code snippet appends the contents of file1.docx to file2.docx:
from docx import Document
doc1 = Document("file1.docx")
doc2 = Document("file2.docx")
doc2.element.body.append(doc1.element.body)
doc2.save("file2-appended.docx")
When I try to open the file on my Mac for complicated files, I get this error:
But if I click OK, the contents are there. The manipulation also works without problem for very simple files.
What am I missing?
The .element attribute is really an "internal" interface and should be named ._element. In most other places I have named it that. What you're getting there is the root element of the document part. You can see what it is by calling:
print(doc2.element.xml)
That element has one and only one w:body element below it, which is what you get when with doc2.element.body (.xml will work on that too, btw, if you want to inspect that element).
What your code is doing is appending one body element at the end of another w:body element and thereby forming invalid XML. The WordprocessingML vocabulary is quite strict about what element can follow another and how many and so forth. The only surprise for me is that it actually sometimes works for you, I take it :)
If you want to manipulate the XML directly, which is what the ._element attribute is there for, you need to do it carefully, in view of the (complex) WordprocessingML XML Schema.
Unlike when you stick to the published API, there's no safety net once ._element (or .element) appears in your code.
Inside the body XML can be relationships to external document parts, like images and hyperlinks. These will only be valid within the document in which they appear. This might explain why some files can be repaired.

PDF Writing programatically using Java itext

Task: I am trying to Edit a PDF Form programmatic manner using java.
Issue: Currently i am using i text libraries where i am unable to get position(x and y coordinates) of a text where i want to write in the PDF.
What i have already tried: implemented RenderListner but TextRenderInfo.getText() gives me the half a word or a chars.
What i want: i want to get rendered text in a proper word formatted as it is in pdf document.
I need example of PDF text editing using itext java.
Thanks in Advance.
What you're looking for is not easy to do (I am hesitant to even say impossible).
According to the spec, pdf documents need only contain the instructions needed to render the document in a viewer.
One of the symptoms of that (which you are seeing with TextRenderInfo) is that there is no real concept of structure.
For instance the paragraph "Lorem Ipsum Dolor Sit Amet" might get rendered as:
render "Ipsum Do"
render "Lorem " (space included)
render "lor"
render "Sit Amet" (leading space not included, but realized by shifting position)
TextRenderInfo is doing exactly what the name suggests, you are getting information about each text rendering instruction. The standard itself makes it impossible to give guarantees like "it will always contain a full word/phrase" or "instructions appear in logical reading order".
If you are doing this to get the location of some piece of text, simply collect all TextRenderInfo objects, and sort with a custom comparator that takes their position into account. Once sorted, you can loop over them, by doing so you have information about the text being rendered and the coordinates at which the text is being rendered. Look in SimpleTextExtractionStrategy to get an idea of how we do this.
If you want to inject custom values into a pdf (you mentioned editing), the consider using forms (XFA or acro). Forms allow you to basically define placeholders that iText can later fill with content. Some form elements will allow dynamic resizing.
If you have access to the data before it gets turned into a pdf, it would be of course even better to manipulate the data at that point. Since editing a pdf (programmatically) is by nature a hard problem to tackle.

How to read form fragment references in a pdf document?

I am working with adobe LiveCycle ES4 and I am attempting to make a custom LiveCycle component (in java) that counts references to a specified form fragment. However I am having some difficulty finding documentation on form fragments in pdf files. So my question is how would I go about reading form fragment references from a pdf document?
Also any documentation, API, or library that may help me with this task would be greatly appreciated.
--Form Fragments--
A form fragment is a collection of form objects (fields, buttons, shapes, tables, etc) as well as related style/formatting that is saved as a separate .xsd file in a form fragment library (usually a directory on the livecycle server that holds the .xsd files). References to form fragments can be inserted into a form when working in LiveCycle Designer.This is particularly useful for creating many similar forms (such as a form fragment with fields for contact information). When a form fragment is edited the changes are reflected across all forms that hold a reference to that fragment (when the pdf is opened and has access to the form fragment library.
The PDF specification currently is an official ISO standard - ISO 32000. You should be able to get the document from the ISO organisation or from your country's standards organisation.
However, before being an ISO standard, PDF was developed and maintained by Adobe and they still have the specification available on their site: http://www.adobe.com/devnet/pdf/pdf_reference.html
There are differences between this specification and the ISO 32000 specification document, but they are largely from an editorial manner so for your purpose I'd look at the Adobe document.
Using the additional information in the comments, your test file and a low level PDF document browser (pdfToolbox in this case - attention, I'm affiliated with this product), I found following information:
In the "Catalog" object for your test PDF, you'll find a key named "AcroForm" that points to a dictionary with information about your form.
In that "AcroForm" dictionary you'll find a key called "XFA" that contains what appears to be almost all information about the XFA form that LifeCycle designer generated.
That "XFA" key points to an array which seems to consist of pairs of information. Element 0 is a string called "preamble", element 1 seems to be the data belonging to that string. So each pair of elements is a bit of information.
The information in that array consists of "preamble", "config", "template", "localeset", "xmpmeta" and "postamble". If you look at the element for "template" (the 6th element in the array if you calculate 1-based), you'll find the data you are looking for. The data is stored as a FlateDecoded stream that you'll have to uncompress - then it's just XML data that should be fairly easy to parse. In it are three lines that for you should be of particular interest:
<subform x="6.35mm" y="6.35mm" name="TestFragment1"
<subform x="3.175mm" y="34.925mm" name="TestFragment2"
<subform x="0.125in" y="2.75in" name="TestFragment2"
I'm assuming the XFA specification pointed to by mkl contains more information about these things, but it seems like simply looking for "subform" elements in the XML should get you the references to the form fragments pretty easily.
In the XFA xml data in the sample PDF you provided there are processing instructions like this:
<?templateDesigner expand 1?><?designerFragmentSource CjxzdWJmb3JtIHVzZWhyZWY9Ii4uXC4uXC4uXEFkb2JlXEFkb2JlIExpdmVDeWNsZSBFUzRcZm9y
bV9mcmFnbWVudHNcVGVzdEZyYWdtZW50MS54ZHAjc29tKCR0ZW1wbGF0ZS5mb3JtMS5UZXN0RnJh
Z21lbnQxKSIgeD0iNi4zNW1tIiB5PSI2LjM1bW0iIHhtbG5zPSJodHRwOi8vd3d3LnhmYS5vcmcv
c2NoZW1hL3hmYS10ZW1wbGF0ZS8zLjMvIgo+PD90ZW1wbGF0ZURlc2lnbmVyIGV4cGFuZCAxPz48
L3N1YmZvcm0KPg==?>
Decoding the base64 encoded argument in there one gets:
<subform usehref="..\..\..\Adobe\Adobe LiveCycle ES4\form_fragments\TestFragment1.xdp#som($template.form1.TestFragment1)" x="6.35mm" y="6.35mm" xmlns="http://www.xfa.org/schema/xfa-template/3.3/"
><?templateDesigner expand 1?></subform
>
So it looks like your
custom LiveCycle component (in java) that counts references to a specified form fragment
should parse the XFA XML, look for these designerFragmentSource processing instructions, and analyze them.
Please be aware, though, that these processing instructions most likely are proprietary Adobe stuff (I at least did not find them in the current XFA specification). Thus, as soon some third party tool touches the XFA XML, the PIs might not be accurate anymore. You actually even cannot be sure what happens in different Adobe software versions.

showing bullets, numbering and images in pdf

In an asp.net mvc-2 application I need to export a view to pdf. I am using itextsharp to generate pdf. In the application I have documents which contains different instructions with instruction title and different images attached to it. I need to export all the instructions under a document to pdf. The instruction title contains bullets and numbering. It is stored in the database in html format. Also the images attached to each instruction is stored in the db in byte format. My pdf contains a table with each row representing an instruction. I am facing two basic issues in generating pdf with these data :
When I try to show the title, it is shown with html tags. In the pdf I need to show the title with bullets and numbers(ie. as entered through the editor).
I need to show byte stream images attached to each instruction in the pdf.
See the first three paragraphs of my post here for the answer to your first problem.
For your second question, iTextSharp (and more to the point, the PDF format) supports several formats. For formats not supported you can use the .Net Framework to convert many of them to a usable format. (The easiest to work with is probably JPEG.)
See this post if you need to know hot to get raw bytes from a database. Once you've got bytes you can use iTextSharp.text.Image.GetInstance() which has (as of 5.1.1) 15 overloads including ones that takes raw bytes, a System.Drawing.Image and `System.IO.Stream'.