iTextSharp - Continuing ordered list on second page with a number other than '1' - itext

I am fairly new to iTextSharp. I create PDFs by adding variable data (text/barcodes/images) to existing PDF documents/templates (think boiler plate). Most commonly, I have to place various sections of text in specific places. I know how to create an ordered list, but I have come across a situation where the list begins with #1 on the first page and then #2-4 on the top of the second page. I use two different templates for p1 and p2.
I am currently creating the document by creating ColumnTexts, placing SimpleColumns with specific coordinates, and then placing phrases inside. I am not sure if this is the best way or not, so I am open for alternative solutions.
I have checked out several places including http://www.mikesdotnetting.com/article/83/lists-with-itextsharp but I see nothing that describes how to start a list at a number other than '1'. None of the 6 overloads provide a parameter for starting number.
Thanks!

There are two answers to your question. The first one is to point you to the official documentation. There is a method setFirst() that (I quote) sets the number that has to come first in the list.
You are using the C# port of iText, so if you want the list to start counting at 10, you need to do something like:
list.First = 10;
The second answer takes more time, but it is probably the better one.You don't need two List objects, one for the first page and one for the second page. It's better to add the List to a ColumnText object and then distribute the column over two pages.
Take a look at the ListInColumn example. It takes an existing PDF (with the text "Hello World Hello People") and it adds a list using ColumnText: list_in_column.pdf
This is how it's done:
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
List list = new List(List.ORDERED);
for (int i = 0; i < 10; i++) {
list.add("...");
}
ColumnText ct = new ColumnText(stamper.getOverContent(1));
ct.addElement(list);
Rectangle rect = new Rectangle(250, 400, 500, 806);
ct.setSimpleColumn(rect);
int status = ct.go();
if (ColumnText.hasMoreText(status)) {
ct.setCanvas(stamper.getOverContent(2));
ct.setSimpleColumn(rect);
ct.go();
}
stamper.close();
To add the content on the first page, I use:
ColumnText ct = new ColumnText(stamper.getOverContent(1));
You are probably using similar code.
The content is added using the line:
int status = ct.go();
If not all the content was added, I change the canvas to add the rest of the content on the second page:
ct.setCanvas(stamper.getOverContent(2));
The rest of the code is pretty standard.
I think the setCanvas() method is the missing piece in your puzzle, although in your case, you'll need:
ct.Canvas = stamper.GetOverContent(2);

Related

iTextSharp extracts wrapped cell contents into new lines - how do you identify to which column a given wrapped piece of data belongs now?

I am using iTextSharp to extract data from pdfs.
I stumbled across the following problem, depicted by the scenario below:
I created a sample excel file to illustrate. Here is what it looks like:
I convert it to a pdf, using one of the many free online converters available out there, which generates a pdf looking like (when I generated the pdf I did not apply the styling to the excel):
Now, using iTextSharp to extract the data from the pdf, returns me the following string as the data extracted:
As you can see, wrapped cell data generate new lines, where each wrapped piece of data separated by a single white space.
The problem: how does one identify, now, to which column a given piece of wrapped data belongs to ? If only iTextSharp preserved as many white spaces as columns...
In my example - how can I identify to which column does 111 belong ?
Update 1:
A similar problem occurs whenever a field has more than one word (i.e., contains white spaces). For example, considering the 1st line of the sample above:
say it looked like
---A--- ---B--- ---C--- ---D---
aaaaaaa bb b cccc
iText again would generate the extraction for this one as:
aaaaaaa bb b cccc
Same problem here, in having to determine the borders of each column.
Update 2:
A sample of the real pdf file I am working with:
This is how the pdf data looks like.
In addition to Chris' generic answer, some background in iText(Sharp) content parsing...
iText(Sharp) provides a framework for content extraction in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser. This franework reads the page content, keeps track of the current graphics state, and forwards information on pieces of content to the IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener the user (i.e. you) provides. In particular it does not interpret structure into this information.
This render listener may be a text extraction strategy (ITextExtractionStrategy / TextExtractionStrategy), i.e. a special render listener which is predominantly designed to extract a pure text stream without formatting or layout information. And for this special case iText(Sharp) additionally provides two sample implementations, the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy.
For your task you need a more sophisticated render listener which either
exports the text with coordinates (Chris in one of his answers has provided an extended LocationTextExtractionStrategy which can additionally provide positions and bounding boxes of text chunks) allowing you in additional code to analyse tabular structures; or
does the analysis of tabular data itself.
I do not have an example for the latter variant because generically recognizing and parsing tables is a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at the task of table extraction.
PS: If you feel more at home with trying to extract structured content from a pure string representation of the content which nonetheless tries to reflect the original layout, you might try something like what is proposed in this answer, a variant of the LocationTextExtractionStrategy working similar to the pdftotext -layout tool; only the changes to be applied to the LocationTextExtractionStrategy are shown there.
PPS: Extraction of data from very specific PDF tables may be much easier; for example have a look at this answer which demonstrates that after some PDF analysis the specific way a given table is created might give rise to a simple custom render listener for extracting the table data. This can make sense for a single PDF with a table spanning many many pages like in the case of that answer, or it can make sense if you have many PDFs identically created by the same software.
This is why I asked for a representative sample file in a comment to your question
Concerning your comments
Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed?
The chunks of text you get as separate RenderText calls are not separated by accident or some random decision of iText. They are the very strings drawn separately in the page content!
In your sample "Fi", "el", "d", and "A" come in different RenderText calls because the content stream contains operations in which first "Fi" is drawn, then "el", then "d", then "A".
This may sound weird at first. A common cause for such torn up words is that PDF does not use the kerning information from fonts; to apply kerning, therefore, the PDF generating software has to insert tiny forward or backward jumps between characters which should be farther from or nearer to each other than without kerning. Thus, words often are torn apart between kerning pairs.
So this cannot be changed, you will get those pieces, and it is the job of the text extraction strategy to put them together.
By the way, there are worse PDFs, some PDF generators position each and every glyph separately, foremost such generators which predominantly build GUIs but can as a feature automatically export GUI canvasses as PDFs.
I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text.
You can... well, you have to decide which of the incoming pieces belong together and which don't. E.g. do glyphs with the same y coordinate form a single line? Or do they form separate lines in different columns which just happen to be located next to each other.
So yes, you decide which glyphs you interpret as a single word or as content of a single table cell, but your input consists of the groups of glyphs used in the actual PDF content stream.
Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (RenderImage is not called)
RenderImage will be called for embedded bitmap images, JPEGs etc. If you want to be informed about vector graphics, your strategy will also have to implement IExtRenderListener which provides methods ModifyPath, RenderPath and ClipPath.
This isn't really an answer but I needed a spot to show some things that might help you understand things.
First "conversion" from Excel, Word, PowerPoint, HTML or whatever to PDF is almost always going to be a destructive change. The destructive part is very important and it happens because you are taking data from a program that has very specific knowledge of what that data represents (Excel) and you are turning it into drawing commands in a very generic universal format (PDF) that only cares about what the data looks like, not the data itself. Unless the data is "tagged" (and it almost never is these days still) then there is no context for the drawing commands. There are no paragraphs, there are no sentences, there are no columns, rows, tables, etc. There's literally just draw this letter at x,y and draw this word at a,b.
Second, imagine you Excel file had that following data and for some reason that last column was narrower than the others when the PDF was made:
Column A | Column B | Column
C
Data #1 Data #2 Data
#3
You and I have context so we know that the second and fourth lines are really just the continuation of the first and third lines. But since iText doesn't have any context during extraction it doesn't think like that and it sees four lines of text. In fact, since it doesn't have context it doesn't even see columns, just the lines themselves.
Third, although a very small thing you need to understand that you don't draw spaces in PDF. Imagine the three column table below:
Column A | Column B | Column C
Yes
If you extracted that from a PDF you'd get this data:
Column A | Column B | Column C
Yes
Inside the PDF the word "Yes" will be just drawn at a certain x coordinate that you and I consider to be under the third column and it won't have a bunch of spaces in front of it.
As I said at the beginning, this isn't much of an answer but hopefully it will explain to you the problem that you are trying to solve. If your PDF is tagged then it will have context and you can use that context during extraction. Context isn't universal, however, so there usually isn't just a magic "insert context" checkbox. Excel actually does have a checkbox (if I remember correctly) to make a tagged PDF during export and it ultimately creates a tagged PDF using HTML-like tags for tables. Very primitive but it will works. However it will be up to you to parse this context.
Leaving here an alternative strategy for extracting the data - that does not solve the problem of who are spaces treated/can be treated, but gives you somewhat more control over the extraction by specifying geometric areas you want to extract text from. Taken from here.
public static System.util.RectangleJ GetRectangle(float distanceInPixelsFromLeft, float distanceInPixelsFromBottom, float width, float height)
{
return new System.util.RectangleJ(
distanceInPixelsFromLeft,
distanceInPixelsFromBottom,
width,
height);
}
public static void Strategy2()
{
// In this example, I'll declare a pageNumber integer variable to
// only capture text from the page I'm interested in
int pageNumber = 1;
var text = new StringBuilder();
List<Tuple<string, int>> result = new List<Tuple<string, int>>();
// The PdfReader object implements IDisposable.Dispose, so you can
// wrap it in the using keyword to automatically dispose of it
using (var pdfReader = new PdfReader("D:/Example.pdf"))
{
float distanceInPixelsFromLeft = 20;
//float distanceInPixelsFromBottom = 730;
float width = 300;
float height = 10;
for (int i = 800; i >= 0; i -= 10)
{
var rect = GetRectangle(distanceInPixelsFromLeft, i, width, height);
var filters = new RenderFilter[1];
filters[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(
new LocationTextExtractionStrategy(),
filters);
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
pageNumber,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
//text.Append(currentText);
result.Add(new Tuple<string, int>(currentText, currentText.Length));
}
}
// You'll do something else with it, here I write it to a console window
//Console.WriteLine(text.ToString());
foreach (var line in result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1)))
{
Console.WriteLine("Text: [{0}], Length: {1}", line.Item1, line.Item2);
}
//Console.WriteLine("", string.Join("\r\n", result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))));
Outputs:
PS.: We are still left with the problem of how to deal with spaces/non text data.

is it posible to include page numbers with onendpage events

I am working through some more examples from the new itextpdf website...(good job by the way) I normally add page numbers in headers and footers as a second pass over the document once it is completed.
Is there a way to add the page number dynamically in a header or footer as an onendpage event?
Clearly this could be done by using a counter with document.addPage(), but text may create new pages all by itself when given a large text block so this would then not work.
Thank you very much for your comment on the new web site!
You can indeed get the current page number in the onEndPage() method and add it to the document. Please take a look at the MovieHistory2 example, or better yet: the MovieCountries1 example.
Allow me to simplify the onEndPage() method of these examples:
public void onEndPage(PdfWriter writer, Document document) {
Rectangle rect = writer.getPageSize();
ColumnText.showTextAligned(writer.getDirectContent(),
Element.ALIGN_CENTER, new Phrase(
String.format("page %d", writer.getPageNumber())),
(rect.getLeft() + rect.getRight()) / 2, rect.getBottom() - 18, 0);
}
In this snippet, writer.getPageNumber() will give you the current page number. I add it to the page at the bottom-middle.

How to set initial view properties?

Here I want to set the already exist PDF document properties under Initial View tab in acrobat.
Document Options:
Show = Bookmarks Panel and Page
Page Layout = Continuous
Magnification = Fit Width
Open to Page number = 1
Window Options:
Show = Document Title
As show in below screen shot:
I am tried following code:
PdfStamper stamper = new PdfStamper(reader, new FileStream(dPDFFile, FileMode.Create));
stamper.AddViewerPreference(PdfName.DISPLAYDOCTITLE, new PdfBoolean(true));
the above code is used to set the document title show.
But following code are not working
For Page Layout:
stamper.AddViewerPreference(PdfName.PAGELAYOUT, new PdfName("OneColumn"));
For Bookmarks Panel and Page:
stamper.AddViewerPreference(PdfName. PageMode, new PdfName("UseOutlines"));
So please give guide me what is the correct way to meet my requirement.
I'm adding an extra answer in answer to the extra question in the comments of the previous answer:
When you have a PdfWriter instance named writer, you can set the Viewer preferences like this:
writer.ViewerPreferences = viewerpreference;
In this case, the viewerpreference is a value that can have one of the following values:
PdfWriter.PageLayoutSinglePage
PdfWriter.PageLayoutOneColumn
PdfWriter.PageLayoutTwoColumnLeft
PdfWriter.PageLayoutTwoColumnRight
PdfWriter.PageLayoutTwoPageLeft
PdfWriter.PageLayoutTwoPageRight
See the PageLayoutExample for more info.
You can also change the page mode as is shown in the ViewerPreferencesExample. In which case the different values are "OR"-ed:
PdfWriter.PageModeFullScreen
PdfWriter.PageModeUseThumbs
PdfWriter.PageLayoutTwoColumnRight | PdfWriter.PageModeUseThumbs
PdfWriter.PageModeFullScreen | PdfWriter.NonFullScreenPageModeUseOutlines
PdfWriter.FitWindow | PdfWriter.HideToolbar
PdfWriter.HideWindowUI
Currently, you've only used the PrintPreferences example from the official documentation:
writer.AddViewerPreference(PdfName.PRINTSCALING, PdfName.NONE);
writer.AddViewerPreference(PdfName.NUMCOPIES, new PdfNumber(3));
writer.AddViewerPreference(PdfName.PICKTRAYBYPDFSIZE, PdfBoolean.PDFTRUE);
But in some cases, it's just easier to use:
writer.ViewerPreferences = viewerpreference;
Note that the official documentation is the book "iText in Action - Second Edition." The examples are written in Java, but you can find the C# version here. There is a new book in the works called "The ABC of PDF", but so far only 4 chapters were written. You'll find more info here: http://itextpdf.com/learn
The part about the different options to create a PdfDestination is already present in "The ABC of PDF".
As for setting the language, this is done like this:
stamper.Writer.ExtraCatalog.Put(PdfName.LANG, new PdfString("EN"));
The result is shown in the following screen shot:
As you can see, there is now a Lang entry with value EN added to the catalog.
The two items Show = Bookmarks Panel and Page and Page Layout = Continuous are controlled one layer up from the ViewerPreferences in the document's /Catalog. You can get to this via:
stamper.Writer.ExtraCatalog
In your case you're looking for:
// Acrobat's Single Page
stamper.Writer.ExtraCatalog.Put(PdfName.PAGELAYOUT, PdfName.ONECOLUMN);
// Show bookmarks
stamper.Writer.ExtraCatalog.Put(PdfName.PAGEMODE, PdfName.USEOUTLINES);
The items Magnification = Fit Width and Open to Page number = 1 are also part of the /Catalog but in a special key called /OpenAction. You can set this using:
stamper.Writer.SetOpenAction();
In your case you're looking for:
//Create a destination that fit's width (fit horizontal)
var D = new PdfDestination(PdfDestination.FITH);
//Create an open action that points to a specific page using this destination
var OA = PdfAction.GotoLocalPage(1, D, stamper.Writer);
//Set the open action on the writer
stamper.Writer.SetOpenAction(OA);

Inserting a table into multiple PDF pages with columnText

I am trying to insert a table into a PDF template. It is successful when the table fits on the page. However, if it is too big then we lose data. I basically just want it to paste what is left of the ColumnText to a next page which looks like page # 5.
Here is my current code, it is creating a blank white page in front of page #4 and it is writing the remaining ColumnText data over where it already pasted the first time.
PdfImportedPage templatePage = stamper.GetImportedPage(pdfReader, 5);
int pageNum = 5;
while (true)
{
ct.SetSimpleColumn(-75, 50, PageSize.A4.Height + 25, PageSize.A4.Width - 200);
if (!ColumnText.HasMoreText(ct.Go()))
break;
pageNum++;
stamper.InsertPage(pageNum, new Rectangle(792f, 612f));
stamper.GetOverContent(pageNum).AddTemplate(templatePage, 0, 0);
}
I've created a small code sample named AddLongTable that you can use to complete your code. The reason why all the content is added to the same page is simple. You forgot this line:
ct.setCanvas(stamper.getOverContent(pageNum));
Note that my example is written in Java, but I'm sure you'll know how to adapt it to C#. If you post your fix in a comment, I'll update my answer, adding the C# version of the solution.

Second page for PDFContentByte

I'm pretty sure I'm missing something simple, but since I've been breaking my head on this for a while, I'm just going to ask.
I'm using JavaScript to access the iText (Java) library to take a filable PDF and serve it up via a browser. The process has worked for my first one, and now I'm doing one where the original fillable PDF has 2 pages. I've been trying to get the second page for a while now. I'm using the PdfContentByte to get it to the browser, and it works except I can't seem to get the PdfContentByte to have a second page. My relevant code is below. When I add the second template (page2) they way I do, it moves what I'm writing, but I'm still just getting one (US letter) page.
This may not be the most efficient code, but like I said, I've been trying a few things on this. If someone has a pointer, I would be very grateful.
var cb:com.itextpdf.text.pdf.PdfContentByte = writer.getDirectContent();
var cb2:com.itextpdf.text.pdf.PdfContentByte = writer.getDirectContent();
var reader2:com.itextpdf.text.pdf.PdfReader = new com.itextpdf.text.pdf.PdfReader(os.toByteArray());
var page:com.itextpdf.text.pdf.PdfImportedPage = writer.getImportedPage(reader2, 1);
cb.addTemplate(page, 0, 0); //this works as expected
var page2:com.itextpdf.text.pdf.PdfImportedPage = writer.getImportedPage(reader2, 2);
// this will add, and with the 100 do an offset, but the
// "physical size" of the paper is the same
cb2.addTemplate(page2, 0, 100);
Have a look at chapter 6 of iText in Action, 2nd edition, especially at subsection 6.4.1: Concatenating and splitting PDF documents.
Listing 6.22, ConcatenateStamp.java, shows you how you should create a PDF from copies of pages of multiple other PDFs; the sample actually additionally adds a new "Page X of Y" footer which you may keep or remove from the sample.