How to generate a Table of Contents “TOC” from merged file.TOC should be heading of each pages - itext

How to generate a Table of Contents “TOC” from merged file.TOC should be heading of each pages.i have seen many examples, all TOC example worked on page number basis.I am using text pdf 5.5.11.

I would try following workflow:
Extract the text where you expect the header to be
Store (List of String) all headers and their corresponding pages
Loop over the list, and flatten it (eg [TitleA, TitleA, TitleB, ..] should become [TitleA, TitleB])
Now you have information on when every header appears for the first time
Use this information to build a TOC
If your document is tagged, this can be done in a way that will work more often (considering that using the approximate position of headers and simply extracting text there is a bit of a heuristic approach)

Related

iTextSharp extracts wrapped cell contents into new lines - how do you identify to which column a given wrapped piece of data belongs now?

I am using iTextSharp to extract data from pdfs.
I stumbled across the following problem, depicted by the scenario below:
I created a sample excel file to illustrate. Here is what it looks like:
I convert it to a pdf, using one of the many free online converters available out there, which generates a pdf looking like (when I generated the pdf I did not apply the styling to the excel):
Now, using iTextSharp to extract the data from the pdf, returns me the following string as the data extracted:
As you can see, wrapped cell data generate new lines, where each wrapped piece of data separated by a single white space.
The problem: how does one identify, now, to which column a given piece of wrapped data belongs to ? If only iTextSharp preserved as many white spaces as columns...
In my example - how can I identify to which column does 111 belong ?
Update 1:
A similar problem occurs whenever a field has more than one word (i.e., contains white spaces). For example, considering the 1st line of the sample above:
say it looked like
---A--- ---B--- ---C--- ---D---
aaaaaaa bb b cccc
iText again would generate the extraction for this one as:
aaaaaaa bb b cccc
Same problem here, in having to determine the borders of each column.
Update 2:
A sample of the real pdf file I am working with:
This is how the pdf data looks like.
In addition to Chris' generic answer, some background in iText(Sharp) content parsing...
iText(Sharp) provides a framework for content extraction in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser. This franework reads the page content, keeps track of the current graphics state, and forwards information on pieces of content to the IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener the user (i.e. you) provides. In particular it does not interpret structure into this information.
This render listener may be a text extraction strategy (ITextExtractionStrategy / TextExtractionStrategy), i.e. a special render listener which is predominantly designed to extract a pure text stream without formatting or layout information. And for this special case iText(Sharp) additionally provides two sample implementations, the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy.
For your task you need a more sophisticated render listener which either
exports the text with coordinates (Chris in one of his answers has provided an extended LocationTextExtractionStrategy which can additionally provide positions and bounding boxes of text chunks) allowing you in additional code to analyse tabular structures; or
does the analysis of tabular data itself.
I do not have an example for the latter variant because generically recognizing and parsing tables is a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at the task of table extraction.
PS: If you feel more at home with trying to extract structured content from a pure string representation of the content which nonetheless tries to reflect the original layout, you might try something like what is proposed in this answer, a variant of the LocationTextExtractionStrategy working similar to the pdftotext -layout tool; only the changes to be applied to the LocationTextExtractionStrategy are shown there.
PPS: Extraction of data from very specific PDF tables may be much easier; for example have a look at this answer which demonstrates that after some PDF analysis the specific way a given table is created might give rise to a simple custom render listener for extracting the table data. This can make sense for a single PDF with a table spanning many many pages like in the case of that answer, or it can make sense if you have many PDFs identically created by the same software.
This is why I asked for a representative sample file in a comment to your question
Concerning your comments
Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed?
The chunks of text you get as separate RenderText calls are not separated by accident or some random decision of iText. They are the very strings drawn separately in the page content!
In your sample "Fi", "el", "d", and "A" come in different RenderText calls because the content stream contains operations in which first "Fi" is drawn, then "el", then "d", then "A".
This may sound weird at first. A common cause for such torn up words is that PDF does not use the kerning information from fonts; to apply kerning, therefore, the PDF generating software has to insert tiny forward or backward jumps between characters which should be farther from or nearer to each other than without kerning. Thus, words often are torn apart between kerning pairs.
So this cannot be changed, you will get those pieces, and it is the job of the text extraction strategy to put them together.
By the way, there are worse PDFs, some PDF generators position each and every glyph separately, foremost such generators which predominantly build GUIs but can as a feature automatically export GUI canvasses as PDFs.
I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text.
You can... well, you have to decide which of the incoming pieces belong together and which don't. E.g. do glyphs with the same y coordinate form a single line? Or do they form separate lines in different columns which just happen to be located next to each other.
So yes, you decide which glyphs you interpret as a single word or as content of a single table cell, but your input consists of the groups of glyphs used in the actual PDF content stream.
Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (RenderImage is not called)
RenderImage will be called for embedded bitmap images, JPEGs etc. If you want to be informed about vector graphics, your strategy will also have to implement IExtRenderListener which provides methods ModifyPath, RenderPath and ClipPath.
This isn't really an answer but I needed a spot to show some things that might help you understand things.
First "conversion" from Excel, Word, PowerPoint, HTML or whatever to PDF is almost always going to be a destructive change. The destructive part is very important and it happens because you are taking data from a program that has very specific knowledge of what that data represents (Excel) and you are turning it into drawing commands in a very generic universal format (PDF) that only cares about what the data looks like, not the data itself. Unless the data is "tagged" (and it almost never is these days still) then there is no context for the drawing commands. There are no paragraphs, there are no sentences, there are no columns, rows, tables, etc. There's literally just draw this letter at x,y and draw this word at a,b.
Second, imagine you Excel file had that following data and for some reason that last column was narrower than the others when the PDF was made:
Column A | Column B | Column
C
Data #1 Data #2 Data
#3
You and I have context so we know that the second and fourth lines are really just the continuation of the first and third lines. But since iText doesn't have any context during extraction it doesn't think like that and it sees four lines of text. In fact, since it doesn't have context it doesn't even see columns, just the lines themselves.
Third, although a very small thing you need to understand that you don't draw spaces in PDF. Imagine the three column table below:
Column A | Column B | Column C
Yes
If you extracted that from a PDF you'd get this data:
Column A | Column B | Column C
Yes
Inside the PDF the word "Yes" will be just drawn at a certain x coordinate that you and I consider to be under the third column and it won't have a bunch of spaces in front of it.
As I said at the beginning, this isn't much of an answer but hopefully it will explain to you the problem that you are trying to solve. If your PDF is tagged then it will have context and you can use that context during extraction. Context isn't universal, however, so there usually isn't just a magic "insert context" checkbox. Excel actually does have a checkbox (if I remember correctly) to make a tagged PDF during export and it ultimately creates a tagged PDF using HTML-like tags for tables. Very primitive but it will works. However it will be up to you to parse this context.
Leaving here an alternative strategy for extracting the data - that does not solve the problem of who are spaces treated/can be treated, but gives you somewhat more control over the extraction by specifying geometric areas you want to extract text from. Taken from here.
public static System.util.RectangleJ GetRectangle(float distanceInPixelsFromLeft, float distanceInPixelsFromBottom, float width, float height)
{
return new System.util.RectangleJ(
distanceInPixelsFromLeft,
distanceInPixelsFromBottom,
width,
height);
}
public static void Strategy2()
{
// In this example, I'll declare a pageNumber integer variable to
// only capture text from the page I'm interested in
int pageNumber = 1;
var text = new StringBuilder();
List<Tuple<string, int>> result = new List<Tuple<string, int>>();
// The PdfReader object implements IDisposable.Dispose, so you can
// wrap it in the using keyword to automatically dispose of it
using (var pdfReader = new PdfReader("D:/Example.pdf"))
{
float distanceInPixelsFromLeft = 20;
//float distanceInPixelsFromBottom = 730;
float width = 300;
float height = 10;
for (int i = 800; i >= 0; i -= 10)
{
var rect = GetRectangle(distanceInPixelsFromLeft, i, width, height);
var filters = new RenderFilter[1];
filters[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(
new LocationTextExtractionStrategy(),
filters);
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
pageNumber,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
//text.Append(currentText);
result.Add(new Tuple<string, int>(currentText, currentText.Length));
}
}
// You'll do something else with it, here I write it to a console window
//Console.WriteLine(text.ToString());
foreach (var line in result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1)))
{
Console.WriteLine("Text: [{0}], Length: {1}", line.Item1, line.Item2);
}
//Console.WriteLine("", string.Join("\r\n", result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))));
Outputs:
PS.: We are still left with the problem of how to deal with spaces/non text data.

Merging documents using OpenXml and section breaks causes empty paragraphs

I am stitching a couple of documents together with a requirement that each document should retain its header and footer information in the final document. Using AltChunk instead of raw OpenXml or DocumentBuilder saves a lot of effort with regards to styles, formatting, references, parts, etc.
Unfortunately, after a couple of days I can't seem to get a 100% working version due to a small and frustrating issue and I need some insight.
My code is loosly based on this article
I modify each sub document, prior to appending it (as an AltChunk) to a working document, by moving the last section properties into the last paragraph (in order to retain header and footer references), but Word seems to be adding a blank paragraph to each of these documents as it renders them in the final document. I end up with:
document 1 with correct header and footer
section properties/break
blank paragraph
document 2 with correct header and footer
section properties/break
blank paragraph
etc.
I cant remove the blank paragraphs afterwards, as I ideally don't want to use WAS to render the document first.
It seems as if you cannot have a next-page section break without a following paragraph?
After further investigation, it seems that will not be away around my usage scenario. I would need to place the last section properties in the body element, but due to my way of processing with nested AltChunk, it would not work.
I have changed my approach completely and went back to a more detailed append procedure using OpenXml Power Tools and some LINQ to Xml.
I'm using Document Builder and works perfectly for me!
var sources = new List<OpenXmlPowerTools.Source>();
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart1)));
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart2)));
var outputPath = #"C:\Users\xpto\Documents\TestFolder\myNewDocument.docx";
DocumentBuilder.BuildDocument(sources, outputPath);
I have the similar empty paragraph issue while importing HTML files.
My solution is,
After inserting HTML AltChunk, I add a GUID place holder. After processing the file, I will open the file again, locate the GUID and check if there is a empty paragraph before it, if so remove the empty paragraph and GUID. it seems work perfectly in my solution.
Hope it helps.

Update TOC after merging 2 pdfs using itext

Let me explain the scenario.
1) i have an existing pdf with TOC named A.pdf with 10 pages
2) i have two more pdf named B.pdf, C.pdf with TOC with 5 pages
3) Now i need to add B.pdf to A.pdf after 3rd page
4) and C.pdf to A.pdf after 7th page.
5) And need to update the TOC based on the Final sequence
Could any one have idea how to implement by using itext.
Please read the documentation, more specifically chapter 7 of my book. You'll find an example named ConcatenateBookmarks that does exactly what you're asking. That is: if by TOC, you are referring to bookmarks stored in an outline tree. In the example, we read all bookmarks using the SimpleBookmark class, we compose a new outline tree, shifting the bookmarks depending on the number of pages in each of the existing documents, and then we add the composed outline tree to the resulting PDF using the setOutlines() method.
If by TOC you mean, a sequence of pages showing a table of contents without any semantics or interactive features, you're asking something that is impossible due to the nature of PDF (which you'll discover once you start reading ISO-32000).

Set xlsx to recalculate formulae on open

I am generating xlsx files and would like to not have to compute the values of all formulae during this process.
That is, I would like to set <v> to 0 (or omit it) for cells with an <f>, and have Excel fill in the values when it is opened.
One suggestion was to have a macro run Calculate on startup, but have been unable to find a complete guide on how to do this with signed macros to avoid prompting the user. A flag you can set somewhere within the xlsx would be far better.
Edit: I'm not looking for answers that involve using Office programs to make changes. I am looking for file format details.
The Python module XlsxWriter sets the formula <v> value to 0 (unless the actual value is known) and the <calcPr> fullCalcOnLoad attribute to true in the xl/workbook.xml file:
<calcPr fullCalcOnLoad="1"/>
This works for all Excel and OpenOffice, LibreOffice, Google Docs and Gnumeric versions that I have tested.
The place it won't work is for non-spreadsheet applications that cannot re-calculate the formula value such as file viewers.
If calculation mode is set to automatic, Excel always (re)calculates workbooks on open.
So, just generate your files with calculation mode set to "Automatic".
In xl/workbook.xml, add following node to workbook node:
<calcPr calcMode="auto"/>
Also check Description of how Excel determines the current mode of calculation.
You can use macros as suggested, however you will create a less secure and less compatible workbook without avoiding user interaction to force calculation.
If you opt by using VBA, you may Application.Calculate in Workbook_Open event.
In your XML contents, simply omit the <v> entity in each cell that have a formula, this will force Ms Excel to actualize the formula whatever the Excel options are.
Instead of:
<c r="B2" s="1">
<f>SUM(A1:C1)</f>
<v>6</v>
</c>
Have:
<c r="B2" s="1">
<f>SUM(A1:C1)</f>
</c>
If you have to actualize formula in an already given XML contents, then you can code easily a small parser that search for each <c> entities. If the <c> entity has a <f> entity, then delete its <v> entity.
Faced the same problem when exporting xlsx'es via openxml (with fastest SAX + template file approach w/o zip stream rewinds).
Despite Calculation option=Automatic, no recalculation on opening the file.
Furthermore no recalculation via Calculate Now and Calculate Sheet buttons.
Only upon selecting the cell and pressing enter ;(
Original formula: SUM(A3:A999)
Solution:
Create an internal hidden sheet
Place end row number (999 in my case) into any cell in hidden sheet (P1 in my case)
Reference row number in the cell via INDIRECT operator
Final formula: SUM(A3:INDIRECT("A"&Internal!P1))
Please refer to the attached gifs
before.gif
after.gif
P.S.
Theoretically, in P1 you can implement dynamic row number calculation via smth like =LOOKUP(2;1/(Sheet1!A:A<>"");ROW(Sheet1!A:A)), but my customers were satisfied with hardcoded row number solution

Table of contents with page number - how to implement

Is it possible to implement Table of contents with page number on first page of the PDF report?
I've read the below links and refered in google:
1) http://community.jaspersoft.com/questions/541300/table-contents-ireport
2) http://community.jaspersoft.com/questions/529040/generation-page-numbers-table-content
On first link, They are using scriptlets for this. I want Table of contents with page number on the first page of pdf report. But I do not understand where to start. Any ideas?
I would recommend you checking this sample from the original documentation.
There is no way (or at least I don't know/haven't found how) to generate the Table of Contents at the beginning (since there is no way to know the pages numbers). So you will have to generate it at the end (in the summary band) and move it afterwards to where you want to place it. To move it use JasperPrint class, methods getPages, addPage, removePage.
I guess you will have subreports, if so, you need to pass the JRBeanCollectionDataSource you will be filling during runtime to each subreport (and return the value back to the master report).
Hope that helps.