PDF Writing programatically using Java itext - itext

Task: I am trying to Edit a PDF Form programmatic manner using java.
Issue: Currently i am using i text libraries where i am unable to get position(x and y coordinates) of a text where i want to write in the PDF.
What i have already tried: implemented RenderListner but TextRenderInfo.getText() gives me the half a word or a chars.
What i want: i want to get rendered text in a proper word formatted as it is in pdf document.
I need example of PDF text editing using itext java.
Thanks in Advance.

What you're looking for is not easy to do (I am hesitant to even say impossible).
According to the spec, pdf documents need only contain the instructions needed to render the document in a viewer.
One of the symptoms of that (which you are seeing with TextRenderInfo) is that there is no real concept of structure.
For instance the paragraph "Lorem Ipsum Dolor Sit Amet" might get rendered as:
render "Ipsum Do"
render "Lorem " (space included)
render "lor"
render "Sit Amet" (leading space not included, but realized by shifting position)
TextRenderInfo is doing exactly what the name suggests, you are getting information about each text rendering instruction. The standard itself makes it impossible to give guarantees like "it will always contain a full word/phrase" or "instructions appear in logical reading order".
If you are doing this to get the location of some piece of text, simply collect all TextRenderInfo objects, and sort with a custom comparator that takes their position into account. Once sorted, you can loop over them, by doing so you have information about the text being rendered and the coordinates at which the text is being rendered. Look in SimpleTextExtractionStrategy to get an idea of how we do this.
If you want to inject custom values into a pdf (you mentioned editing), the consider using forms (XFA or acro). Forms allow you to basically define placeholders that iText can later fill with content. Some form elements will allow dynamic resizing.
If you have access to the data before it gets turned into a pdf, it would be of course even better to manipulate the data at that point. Since editing a pdf (programmatically) is by nature a hard problem to tackle.

Related

Converting Email to PDF

I have expended a good deal of effort trying to convert emails to PDF.
I am using Delphi 10.4 although that is not necessarily relevant to the question.
I came up with a solution that involves extraction of the body from the email in whatever format (HTML, RTF or TXT). I use INDY for this or Outlook if email is in MSG format.
I then save the body to file and open it using MS Word via automation. Then it should be a simple matter of saving the Word document in PDF format.
However, MS Word doesn't seem to read html files that well.
From the numerous samples of emails that I have tried, I have come across several issues which were complex to solve.
Examples:
html tables expanding beyond the document's page width. I solved this by working out what the page width is, setting the offending table's width as fixed and setting it to the page width and finally resizing it's columns proportionately to its new width.
That worked well until I tried to process an email with html tables with differing numbers of columns/cells per row. That causes a crash. I solved that by handling the exception and iterating through each table by row and working with its cells rather than columns.
Images within table cells often overlap the cell and the page width. Solved by iterating through all InlineShapes, checking whether they are within a table and, if so setting their width to the cell width.
There have been other issues, but I now have something that seems to work pretty well on a fairly disparate bunch of emails.
But I would think it incredibly likely that there will be new issues that will crop up from time to time and since this procedure is designed to deal unsupervised with batches of emails, this is a concern.
So my question is, does anyone know of a better way of dealing with this? For example, is there some simple way of getting Word to to "nicely" format the html on loading so that it displays and saves to PDF in a readable fashion similar to how it looks when you open the same email in Outlook.
Have you tried using the WordEditor property of the Outlook Inspector object? This returns the Microsoft Word Document Object Model of the message and you can export directly to PDF from that.
Here is a basic example...
Private Sub Demo()
Dim MailItem As MailItem
Dim FileName As String
FileName = "C:\Users\Sam\Desktop\Email.pdf"
Set MailItem = ActiveExplorer.Selection.Item(1)
With MailItem.GetInspector
.WordEditor.ExportAsFixedFormat FileName, 17
.Close 0
End With
MsgBox "Export complete"
End Sub

iTextSharp extracts wrapped cell contents into new lines - how do you identify to which column a given wrapped piece of data belongs now?

I am using iTextSharp to extract data from pdfs.
I stumbled across the following problem, depicted by the scenario below:
I created a sample excel file to illustrate. Here is what it looks like:
I convert it to a pdf, using one of the many free online converters available out there, which generates a pdf looking like (when I generated the pdf I did not apply the styling to the excel):
Now, using iTextSharp to extract the data from the pdf, returns me the following string as the data extracted:
As you can see, wrapped cell data generate new lines, where each wrapped piece of data separated by a single white space.
The problem: how does one identify, now, to which column a given piece of wrapped data belongs to ? If only iTextSharp preserved as many white spaces as columns...
In my example - how can I identify to which column does 111 belong ?
Update 1:
A similar problem occurs whenever a field has more than one word (i.e., contains white spaces). For example, considering the 1st line of the sample above:
say it looked like
---A--- ---B--- ---C--- ---D---
aaaaaaa bb b cccc
iText again would generate the extraction for this one as:
aaaaaaa bb b cccc
Same problem here, in having to determine the borders of each column.
Update 2:
A sample of the real pdf file I am working with:
This is how the pdf data looks like.
In addition to Chris' generic answer, some background in iText(Sharp) content parsing...
iText(Sharp) provides a framework for content extraction in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser. This franework reads the page content, keeps track of the current graphics state, and forwards information on pieces of content to the IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener the user (i.e. you) provides. In particular it does not interpret structure into this information.
This render listener may be a text extraction strategy (ITextExtractionStrategy / TextExtractionStrategy), i.e. a special render listener which is predominantly designed to extract a pure text stream without formatting or layout information. And for this special case iText(Sharp) additionally provides two sample implementations, the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy.
For your task you need a more sophisticated render listener which either
exports the text with coordinates (Chris in one of his answers has provided an extended LocationTextExtractionStrategy which can additionally provide positions and bounding boxes of text chunks) allowing you in additional code to analyse tabular structures; or
does the analysis of tabular data itself.
I do not have an example for the latter variant because generically recognizing and parsing tables is a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at the task of table extraction.
PS: If you feel more at home with trying to extract structured content from a pure string representation of the content which nonetheless tries to reflect the original layout, you might try something like what is proposed in this answer, a variant of the LocationTextExtractionStrategy working similar to the pdftotext -layout tool; only the changes to be applied to the LocationTextExtractionStrategy are shown there.
PPS: Extraction of data from very specific PDF tables may be much easier; for example have a look at this answer which demonstrates that after some PDF analysis the specific way a given table is created might give rise to a simple custom render listener for extracting the table data. This can make sense for a single PDF with a table spanning many many pages like in the case of that answer, or it can make sense if you have many PDFs identically created by the same software.
This is why I asked for a representative sample file in a comment to your question
Concerning your comments
Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed?
The chunks of text you get as separate RenderText calls are not separated by accident or some random decision of iText. They are the very strings drawn separately in the page content!
In your sample "Fi", "el", "d", and "A" come in different RenderText calls because the content stream contains operations in which first "Fi" is drawn, then "el", then "d", then "A".
This may sound weird at first. A common cause for such torn up words is that PDF does not use the kerning information from fonts; to apply kerning, therefore, the PDF generating software has to insert tiny forward or backward jumps between characters which should be farther from or nearer to each other than without kerning. Thus, words often are torn apart between kerning pairs.
So this cannot be changed, you will get those pieces, and it is the job of the text extraction strategy to put them together.
By the way, there are worse PDFs, some PDF generators position each and every glyph separately, foremost such generators which predominantly build GUIs but can as a feature automatically export GUI canvasses as PDFs.
I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text.
You can... well, you have to decide which of the incoming pieces belong together and which don't. E.g. do glyphs with the same y coordinate form a single line? Or do they form separate lines in different columns which just happen to be located next to each other.
So yes, you decide which glyphs you interpret as a single word or as content of a single table cell, but your input consists of the groups of glyphs used in the actual PDF content stream.
Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (RenderImage is not called)
RenderImage will be called for embedded bitmap images, JPEGs etc. If you want to be informed about vector graphics, your strategy will also have to implement IExtRenderListener which provides methods ModifyPath, RenderPath and ClipPath.
This isn't really an answer but I needed a spot to show some things that might help you understand things.
First "conversion" from Excel, Word, PowerPoint, HTML or whatever to PDF is almost always going to be a destructive change. The destructive part is very important and it happens because you are taking data from a program that has very specific knowledge of what that data represents (Excel) and you are turning it into drawing commands in a very generic universal format (PDF) that only cares about what the data looks like, not the data itself. Unless the data is "tagged" (and it almost never is these days still) then there is no context for the drawing commands. There are no paragraphs, there are no sentences, there are no columns, rows, tables, etc. There's literally just draw this letter at x,y and draw this word at a,b.
Second, imagine you Excel file had that following data and for some reason that last column was narrower than the others when the PDF was made:
Column A | Column B | Column
C
Data #1 Data #2 Data
#3
You and I have context so we know that the second and fourth lines are really just the continuation of the first and third lines. But since iText doesn't have any context during extraction it doesn't think like that and it sees four lines of text. In fact, since it doesn't have context it doesn't even see columns, just the lines themselves.
Third, although a very small thing you need to understand that you don't draw spaces in PDF. Imagine the three column table below:
Column A | Column B | Column C
Yes
If you extracted that from a PDF you'd get this data:
Column A | Column B | Column C
Yes
Inside the PDF the word "Yes" will be just drawn at a certain x coordinate that you and I consider to be under the third column and it won't have a bunch of spaces in front of it.
As I said at the beginning, this isn't much of an answer but hopefully it will explain to you the problem that you are trying to solve. If your PDF is tagged then it will have context and you can use that context during extraction. Context isn't universal, however, so there usually isn't just a magic "insert context" checkbox. Excel actually does have a checkbox (if I remember correctly) to make a tagged PDF during export and it ultimately creates a tagged PDF using HTML-like tags for tables. Very primitive but it will works. However it will be up to you to parse this context.
Leaving here an alternative strategy for extracting the data - that does not solve the problem of who are spaces treated/can be treated, but gives you somewhat more control over the extraction by specifying geometric areas you want to extract text from. Taken from here.
public static System.util.RectangleJ GetRectangle(float distanceInPixelsFromLeft, float distanceInPixelsFromBottom, float width, float height)
{
return new System.util.RectangleJ(
distanceInPixelsFromLeft,
distanceInPixelsFromBottom,
width,
height);
}
public static void Strategy2()
{
// In this example, I'll declare a pageNumber integer variable to
// only capture text from the page I'm interested in
int pageNumber = 1;
var text = new StringBuilder();
List<Tuple<string, int>> result = new List<Tuple<string, int>>();
// The PdfReader object implements IDisposable.Dispose, so you can
// wrap it in the using keyword to automatically dispose of it
using (var pdfReader = new PdfReader("D:/Example.pdf"))
{
float distanceInPixelsFromLeft = 20;
//float distanceInPixelsFromBottom = 730;
float width = 300;
float height = 10;
for (int i = 800; i >= 0; i -= 10)
{
var rect = GetRectangle(distanceInPixelsFromLeft, i, width, height);
var filters = new RenderFilter[1];
filters[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(
new LocationTextExtractionStrategy(),
filters);
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
pageNumber,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
//text.Append(currentText);
result.Add(new Tuple<string, int>(currentText, currentText.Length));
}
}
// You'll do something else with it, here I write it to a console window
//Console.WriteLine(text.ToString());
foreach (var line in result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1)))
{
Console.WriteLine("Text: [{0}], Length: {1}", line.Item1, line.Item2);
}
//Console.WriteLine("", string.Join("\r\n", result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))));
Outputs:
PS.: We are still left with the problem of how to deal with spaces/non text data.

How to create a block letters form input in libreoffice writer

I would like to create a document including a input form.
The printed version of the form should have little boxes for block letter input ("monospace font") like this:
The form will be printed and will be filled out manually using pens (but it would be good if the form could also be easily filled out digitally via pdf form)
Is there any convenient way apart from creating separate input boxes, or tables or other quick fixes which do not make it inconvenient filling the form digitally?
One way could be to use a background image with the required block pattern.
If you only want it printable - create a document and set the image as background.
If you want a computer fillable form for a SEPA banking transaction form - do a search, as there are free PDF forms available.

OPEN XML add custom not visible data to paragraph/table

Is there a way to store additional data for a paragraph, that would be persisted after user opens and saves a document in MS Word.
Ive been using CusotmXML for this, but it turns out that this is no logner possible due to the fact that MS-Word strips all CusotmXML elements from the document structure.
Every single paragraph or a table has an ID that I would like to "pair back" to my data-source.
So later when I read the docx again I can identify origins of every unchanged paragraph/table in the document.
A possibility would be to insert a SET field. This creates a bookmark in the document to which you can assign information. There's no way to protect it from the user removing it, however. A DATA field might also be a possibility.
Unlike "vanish" (which I believe is equivalent to "hidden" font format) the information would not display if the user is in the habit of displaying non-printing information. It will display, however, if the user toggles on field codes (Alt+F9).
You can have a divId on a paragraph, and in xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" there are attributes w14:textId and w14:paraId.
For example:
<w:p w14:textId="81a184ad" w14:paraId="81a184ad" >
<w:pPr>
<w:divId w:val="124349312"/>
See [MS-Docx] for details.
Alternatively, have a look at content controls, which you can wrap around paragraphs and tables (or put inside them). These have an ID property; they also let you store arbitrary text in their tag property. The string is limited in length to something like 120 chars.
A rather noddy solution, but have you cosidered using a custom run for your data and hiding it from displaying using Vanish
<w:rPr>
<w:vanish />
</w:rPr>
Adding vanish to run properties will hide the run from displaying and you might use this to store custom data with out affecting the output of the document.

Making a PDF output in raster format instead of vector using itextsharp

I have written C# code to save product specifications to a PDF document using iTextSharp, mainly with PdfPTable and Chunks/Paragraphs in the PdfPCells. However, I have been told that the output is unacceptable due to the fact that you can highlight and copy the text from the document and document storage and retrieval server software that they are currently using does not support "Vector" based PDFs. I'm not exactly certain what the difference is between a raster pdf and and vector pdf. Basically, every page of the PDF Document should be an image so that the text can not be highlighted. Is there any way to do this without using the DirectContent? Below is an image, illustrating a portion of the PDF that was created, and how the text can be selected and copied, which is the incorrect functionality.
I would like to avoid directly writing to the canvas, unless there is a way to do this and still have itextsharp handle my formatting and proper paging.
The windows application PDF2R works well, but doesn't seem to offer any programmatic solutions. I have found libraries that stated that they do this sort of conversion, but are several thousand dollars. I'd like to work within my budget and use the itextsharp or something much cheaper than this.
I would suggest you try to generate an image using the System.Drawing class and then insert that into the PDF document.
Call this code on your PdfWriter object:
writer.SetEncryption(PdfWriter.STRENGTH40BITS, null, null, PdfWriter.AllowPrinting);
This won't prevent users from selecting text, but it will prevent them from copying and pasting it. Give it a try.