I am using iTextSharp to extract data from pdfs.
I stumbled across the following problem, depicted by the scenario below:
I created a sample excel file to illustrate. Here is what it looks like:
I convert it to a pdf, using one of the many free online converters available out there, which generates a pdf looking like (when I generated the pdf I did not apply the styling to the excel):
Now, using iTextSharp to extract the data from the pdf, returns me the following string as the data extracted:
As you can see, wrapped cell data generate new lines, where each wrapped piece of data separated by a single white space.
The problem: how does one identify, now, to which column a given piece of wrapped data belongs to ? If only iTextSharp preserved as many white spaces as columns...
In my example - how can I identify to which column does 111 belong ?
Update 1:
A similar problem occurs whenever a field has more than one word (i.e., contains white spaces). For example, considering the 1st line of the sample above:
say it looked like
---A--- ---B--- ---C--- ---D---
aaaaaaa bb b cccc
iText again would generate the extraction for this one as:
aaaaaaa bb b cccc
Same problem here, in having to determine the borders of each column.
Update 2:
A sample of the real pdf file I am working with:
This is how the pdf data looks like.
In addition to Chris' generic answer, some background in iText(Sharp) content parsing...
iText(Sharp) provides a framework for content extraction in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser. This franework reads the page content, keeps track of the current graphics state, and forwards information on pieces of content to the IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener the user (i.e. you) provides. In particular it does not interpret structure into this information.
This render listener may be a text extraction strategy (ITextExtractionStrategy / TextExtractionStrategy), i.e. a special render listener which is predominantly designed to extract a pure text stream without formatting or layout information. And for this special case iText(Sharp) additionally provides two sample implementations, the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy.
For your task you need a more sophisticated render listener which either
exports the text with coordinates (Chris in one of his answers has provided an extended LocationTextExtractionStrategy which can additionally provide positions and bounding boxes of text chunks) allowing you in additional code to analyse tabular structures; or
does the analysis of tabular data itself.
I do not have an example for the latter variant because generically recognizing and parsing tables is a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at the task of table extraction.
PS: If you feel more at home with trying to extract structured content from a pure string representation of the content which nonetheless tries to reflect the original layout, you might try something like what is proposed in this answer, a variant of the LocationTextExtractionStrategy working similar to the pdftotext -layout tool; only the changes to be applied to the LocationTextExtractionStrategy are shown there.
PPS: Extraction of data from very specific PDF tables may be much easier; for example have a look at this answer which demonstrates that after some PDF analysis the specific way a given table is created might give rise to a simple custom render listener for extracting the table data. This can make sense for a single PDF with a table spanning many many pages like in the case of that answer, or it can make sense if you have many PDFs identically created by the same software.
This is why I asked for a representative sample file in a comment to your question
Concerning your comments
Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed?
The chunks of text you get as separate RenderText calls are not separated by accident or some random decision of iText. They are the very strings drawn separately in the page content!
In your sample "Fi", "el", "d", and "A" come in different RenderText calls because the content stream contains operations in which first "Fi" is drawn, then "el", then "d", then "A".
This may sound weird at first. A common cause for such torn up words is that PDF does not use the kerning information from fonts; to apply kerning, therefore, the PDF generating software has to insert tiny forward or backward jumps between characters which should be farther from or nearer to each other than without kerning. Thus, words often are torn apart between kerning pairs.
So this cannot be changed, you will get those pieces, and it is the job of the text extraction strategy to put them together.
By the way, there are worse PDFs, some PDF generators position each and every glyph separately, foremost such generators which predominantly build GUIs but can as a feature automatically export GUI canvasses as PDFs.
I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text.
You can... well, you have to decide which of the incoming pieces belong together and which don't. E.g. do glyphs with the same y coordinate form a single line? Or do they form separate lines in different columns which just happen to be located next to each other.
So yes, you decide which glyphs you interpret as a single word or as content of a single table cell, but your input consists of the groups of glyphs used in the actual PDF content stream.
Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (RenderImage is not called)
RenderImage will be called for embedded bitmap images, JPEGs etc. If you want to be informed about vector graphics, your strategy will also have to implement IExtRenderListener which provides methods ModifyPath, RenderPath and ClipPath.
This isn't really an answer but I needed a spot to show some things that might help you understand things.
First "conversion" from Excel, Word, PowerPoint, HTML or whatever to PDF is almost always going to be a destructive change. The destructive part is very important and it happens because you are taking data from a program that has very specific knowledge of what that data represents (Excel) and you are turning it into drawing commands in a very generic universal format (PDF) that only cares about what the data looks like, not the data itself. Unless the data is "tagged" (and it almost never is these days still) then there is no context for the drawing commands. There are no paragraphs, there are no sentences, there are no columns, rows, tables, etc. There's literally just draw this letter at x,y and draw this word at a,b.
Second, imagine you Excel file had that following data and for some reason that last column was narrower than the others when the PDF was made:
Column A | Column B | Column
C
Data #1 Data #2 Data
#3
You and I have context so we know that the second and fourth lines are really just the continuation of the first and third lines. But since iText doesn't have any context during extraction it doesn't think like that and it sees four lines of text. In fact, since it doesn't have context it doesn't even see columns, just the lines themselves.
Third, although a very small thing you need to understand that you don't draw spaces in PDF. Imagine the three column table below:
Column A | Column B | Column C
Yes
If you extracted that from a PDF you'd get this data:
Column A | Column B | Column C
Yes
Inside the PDF the word "Yes" will be just drawn at a certain x coordinate that you and I consider to be under the third column and it won't have a bunch of spaces in front of it.
As I said at the beginning, this isn't much of an answer but hopefully it will explain to you the problem that you are trying to solve. If your PDF is tagged then it will have context and you can use that context during extraction. Context isn't universal, however, so there usually isn't just a magic "insert context" checkbox. Excel actually does have a checkbox (if I remember correctly) to make a tagged PDF during export and it ultimately creates a tagged PDF using HTML-like tags for tables. Very primitive but it will works. However it will be up to you to parse this context.
Leaving here an alternative strategy for extracting the data - that does not solve the problem of who are spaces treated/can be treated, but gives you somewhat more control over the extraction by specifying geometric areas you want to extract text from. Taken from here.
public static System.util.RectangleJ GetRectangle(float distanceInPixelsFromLeft, float distanceInPixelsFromBottom, float width, float height)
{
return new System.util.RectangleJ(
distanceInPixelsFromLeft,
distanceInPixelsFromBottom,
width,
height);
}
public static void Strategy2()
{
// In this example, I'll declare a pageNumber integer variable to
// only capture text from the page I'm interested in
int pageNumber = 1;
var text = new StringBuilder();
List<Tuple<string, int>> result = new List<Tuple<string, int>>();
// The PdfReader object implements IDisposable.Dispose, so you can
// wrap it in the using keyword to automatically dispose of it
using (var pdfReader = new PdfReader("D:/Example.pdf"))
{
float distanceInPixelsFromLeft = 20;
//float distanceInPixelsFromBottom = 730;
float width = 300;
float height = 10;
for (int i = 800; i >= 0; i -= 10)
{
var rect = GetRectangle(distanceInPixelsFromLeft, i, width, height);
var filters = new RenderFilter[1];
filters[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(
new LocationTextExtractionStrategy(),
filters);
var currentText = PdfTextExtractor.GetTextFromPage(
pdfReader,
pageNumber,
strategy);
currentText =
Encoding.UTF8.GetString(Encoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
//text.Append(currentText);
result.Add(new Tuple<string, int>(currentText, currentText.Length));
}
}
// You'll do something else with it, here I write it to a console window
//Console.WriteLine(text.ToString());
foreach (var line in result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1)))
{
Console.WriteLine("Text: [{0}], Length: {1}", line.Item1, line.Item2);
}
//Console.WriteLine("", string.Join("\r\n", result.Distinct().Where(r => !string.IsNullOrWhiteSpace(r.Item1))));
Outputs:
PS.: We are still left with the problem of how to deal with spaces/non text data.
I have loaded data in Orange using a CSV file, and some columns are being automatically detected as continuous (C) while in reality they're ordinal dicrete (i.e. may have integer values from 1 to 5).
I was expecting widgets such as the "Edit Domain Data" to let me change these meta attributes, but I couldn't figure it out. The only workaround I've managed so far was to save the CSV as a TAB file and add the meta attributes myself.
Am I missing something very silly? :)
I'm probably biting off more than I can chew with this particular problem, but I'll try to be as specific as possible in case it's within my scope. Disclaimer: I'm not terribly experienced with MS Word, beyond simple data entry/some formatting, and I have absolutely zero experience working with macros or VBasic. Unfortunately, I'm afraid the solution to my problem will come in the form of one of those last two.
THE GOAL:
What I want to do is to have placeholder text throughout my template document that will change content but not formatting when the first instance of it is changed. Basically, I'm writing a template for support manuals for a software suite. Each app has certain similar features like the menu bar, data entry screen, diagnostic log screen, transaction history, etc., so I am pre-writing those sections and using placeholders when I need to insert certain app specific properties.
I started off using the Insert->Quick Parts->Document Property->Subject tool which I used as a placeholder for the app name. I set the Property to [Subject] and then used Insert->Quick Parts->Field->Subject throughout the document, wherever I needed to include the app name. This worked fine in this case because the app name will always be capitalized. I simply change the text in the first [Subject] (which is content controlled) and update the fields throughout the document, and they all match nicely, easy-peasy, work done, go home and drink beer, right?
Not quite.
Our software handles part tracking via scanners and SQL Server, so while the interface and menu in the apps remains largely unchanged, the parts they track change from app to app. Because of this, I need to change the part name when I reference it within the text of the manuals; for example, if I'm working in ToiletPap.app and our TP is tracked by the roll, I need every mention of [Component] to be changed to roll. If I'm working in LightBulbs.app, I need [Component] to say bulb.
My first efforts went toward creating a custom doc property called Component using the Advanced tab under the Document Properties dropmenu. I then created a plaintext content control around my first [Component] titled Component and made my next [Component] a field with modified code: {COMPONENT * MERGEFORMAT}. This comes from copying what I can find when [Subject] works. This didn't work at all; updating the text in the first CC doesn't change the Content doc prop, and my fields return "!Undefined Bookmark, COMPONENT".
I got close to what I need by using the [Comments] doc property, set initially to [Component]. I used it just like [Subject], but (this is when I realized that capitalization was going to be an issue) when I mention my [component] in-text, as often as not, I need to to be lowercase instead of upper.
I've looked on MS's forums and a few others as well as here on SO, and I can't find anyone who's trying to do the same thing, much less an answer to how. Please keep in mind when answering, it would be a great help to me if you would include step-by-step instructions on how to enter/implement the code you provide because, as I mentioned, I have no idea how to go about editing macros/VBasic for MS Word.
To restate and summarize my overall question: How can I use a placeholder that displays the text "[Component]" so that, when I change the first instance of [Component] to something else, say "hopper", every subsequent instance of [Component] is updated to hopper but maintains its current capitalization and formatting scheme?
Apologies for the length of the request, but I wanted to make sure I explained the situation as accurately as possible. Thanks in advance for your consideration and responses.
I managed to solve this one after a couple extra hours of tinkering. I didn't need macros or VBasic, either.
On the first instance of [component] I created a plain-text content control to act as a container (not a necessity, but it makes it look nicer. Will likely cause a problem eventually, but for now, it's working as intended) and bookmarked it. Then, for all other instances of [container] I selected each and used Insert->Quick Parts->Field->Ref with the following field code:
REF Text1 \*Lower
Where "Text1" is my bookmark and "*Lower" indicates all lower case. The *Lower can be replaced with *Upper or *FirstCap to indicate all upper case or capitalize the first letter respectively. Now, each field reflects the text of the first with the capitalization appropriate to each field's location within the document. Just like using the doc prop with [Subject], ^a -> f9 is needed to update all fields within the document.
Using FPDI, I first use a template to create a table of contents. I then import additional pages, which are linked to by the table of contents. What I'm experiencing is that FPDI is shrinking the templates and possible adding white space. I believe I have eliminated any browser added whitespace/shrinkage by merging the same documents into one via the passthru() command.
I have pasted code here: http://pastebin.com/VeLEN8nz. Line 45 - 57 is where the Table of Contents gets included as a template file.
The original file is here: http://truckingshow.com/TOC.pdf
The post-FPDI file is here: http://truckingshow.com/TOC-afterFPDI.pdf
The most noticeable difference is in the right and bottom margins.
Thank you for taking a look, please let me know if I can provide more info.
The document that was given to me was "Letter", not "A4" (the default for FPDF())
Instead of $pdf = new FPDI('P', 'pt') Simply using $pdf = new FPDI('P', 'pt', 'Letter') solved the problem.
Thanks!
When using the Add or Edit form from the pager I'm wondering how a simple static label can be added in the form without it creating any additional columns in it's affect on colNames[]'s and colModel[]'s. For example I have a quite simple typical Add form which opens from the pager containing a few label's and form elements: Name, Email, Web Site, etc., and then the lower section of the form has a few drop down menus containing the number 1 through 10 with the idea being to ask the user to pick a value between 1 and 10 to put a value on the importance to them about the product or service which is listed beside it. Just above this section I want to add some text only to give a brief instruction asking the user to "Choose the importance of the following products and services using the scale: [1=Low interest --- 10=Very high interest]". I cannot figure out how to get a text label inserted in the form without having to define a column with a formoption{} etc which is not needed for just some descriptive text. I know about the "bottominfo: 'some text'" for adding text to the bottom of the form but I need to insert some text similar to that mid-way (or other positions) in the form without it affecting the tabular structure of the grid. Is this even possible? TIA.
You can modify Edit or Add forms inside of afterShowForm. The ids of the form fields are like "tr_Name". There consist from "tr_" prefix and the corresponding column name.
I modified the code example from my old answer so that in the Add dialod there exist an additional line with the bold text "Additional Information:". In the "Edit" dialog (like one want in the original question) the input field for one column is disabled. You can see the example live here. I hope that a working code example can say more as a lot of words.