itextsharp PDF to text dump

itextsharp PDF to text dump - itext

I am looking for a way to actually get the contents of the file itself, in its text format, dumped. E.g.: i don't want a dictionary object, i don't want some sort of extractionstrategy option, i just want the same text document that itextsharp uses to parse... the WHOLE thing as a string or stringbuilder...
I have not yet found a way to do this using any tools what so ever... my problem is that i am trying to read a dynamic PDF into a C# application... and we all know that those darn dynamic PDFs can't be parsed by iTextSharp (AcroForm and AcroFields always comes up empty), so i figured that if i can get the actual text dump of the entire file, i can see what it looks like and parse it myself for this specific task (e.g.: make a class for each document i know i can received, and make a map there based on what i see).
If anyone can help me do that, or even better, find a way, in C#, to extract the XML Source for the PDF (kinda like clicking the XML Source tab in LiveCycle) instead, it would be greatly appreciated.
Thanks!
Matt

If you are looking for the actual operators and commands of each page in the raw text format, try the following code:
var reader = new PdfReader("test.pdf");
int intPageNum = reader.NumberOfPages;
for (int i = 1; i <= intPageNum; i++)
{
byte[] contentBytes = reader.GetPageContent(i);
File.WriteAllBytes("page-" + i + ".txt", contentBytes);
}
reader.Close();

I am looking for a way to actually get the contents of the file
itself, in its text format, dumped. E.g.: i don't want a dictionary
object, i don't want some sort of extractionstrategy option, i just
want the same text document that itextsharp uses to parse... the WHOLE
thing as a string or stringbuilder...
Unfortunately the data that itextsharp uses to parse are not yet text; the operators in that data are given in some textual format but the actual glyphs may be given in a completely arbitrary ad-hoc encoding. That been said, often some standard encoding is used as it is the most simple solution for the components in use. You cannot in general count on that, though. The answer by VahidN shows you how to access the starting points for that content; not seldomly, though, that page content data he extracts only contain references to resources which are contained in different objects.
my problem is that i am trying to read a dynamic PDF into a C#
application... and we all know that those darn dynamic PDFs can't be
parsed by iTextSharp (AcroForm and AcroFields always comes up empty),
This sounds as if you actually have a completely different task at hand. Dynamic forms and their contents are not part of the page content but instead stored in a separate XML Forms Architecture stream.
iText in Action, 2nd edition, in chapter 8 gives you some information on how to access the XFA stream data, for a first glimps look at the sample XfaMovie.cs.
You might also want to look at the iText XML Worker project for easier manipulation of XFA streams.

if you just want to dump the text, try this:
PdfReader reader = new PdfReader(pdfFileName);
String text = "";
nPages = reader.NumberOfPages;
for (int i = 0; i < nPages; i++)
{
text += PdfTextExtractor.GetTextFromPage(reader, i + 1);
}

Related

Additional PDF Attachment in E-Mail (ABAP)

I'm currently trying to send the results of a selection via E-Mail, more precisely as an attachment. My goal is to create a XML-File (which works so far) and a PDF, both fed from the internal table in which the selected data is held. The internal table is declared with a custom type. My current code for sending the E-Mail with the XML attachment looks like following:
lr_send_request = cl_bcs=>create_persistent( ).
lr_document = cl_document_bcs=>create_document( i_type = 'HTM'
i_text = lt_text
i_subject = lv_subject ).
* ----- converting data of internal table so it is suitable for XML
...
* -----
lr_document->add_attachment( i_attachment_type = 'BIN'
i_attachment_subject = 'output.xml'
i_attachment_size = xml_size
i_attachment_language = sy-langu
i_att_content_hex = xml_content ).
lr_send_request->set_document( lr_document ).
On the web I was only able to find how to convert spooljob (whatever that is :/) into PDF. With functions like that I may be able to solve my problem but then I can't attach the XML anymore.
How can I convert the data of the internal table into a PDF file to attach it to the E-Mail in the same way I do with the XML?

There are multiple way to create PDF:
Create report with Smartform and get output in PDF format. Sample code
If your system has adobe form license create with adobe form.
Use zcl_pdf class for creating native pdf file.
Using CONVERT_ABAPSPOOLJOB_2_PDF FM for getting printer spool as pdf (thanks #Sandra Rossi).
If your PDF is simple (not include complex table, vertical text, images, etc) use third option, otherwise try first or second.

Novacode LineChart type

I have a code that implements a Novacode.LineChart. And the LineChart type which is shown by default is this one:
But I dont want this type of chart, I want it without points, like this:
This is the code where I create the chart:
LineChart c = new LineChart();
c.AddLegend(ChartLegendPosition.Bottom, false);
c.Grouping = Grouping.Stacked;
Anyone knows how can I hide thoose points and show only the lines? Thanks to everyone!!

Your question is shown up while I was searching for the exact same feature. It's probably a bit late but I hope it would be useful for other people in need of this feature.
My so called answer is not more than a few lines of dirty and unmanageable hack so unless you are not in dire need, I do not recommend to follow this way.
I also do not know if is it an approved approach here but I prefer to write the solution step by step so it may help you to grasp the concept and use better methods.
After I have realized that I was unable to use DocX to create a line chart without markers, using currently provided API, I wanted to know what were the differences between actual and desired output. So I saved a copy of .docx file with line chart after I manually edited the chart to expected result.
Before and after the edit
As you may already know, a .docx is a container format and essentially comprised of a few different folders and files. You can open it up with a .zip archive extractor. I used 7-Zip for this task and found chart file at location of /word/charts/chart1.xml but this may differ depending on the file, but you can easily figure it out.
Compared both of chart1.xml files and the difference was, the file without the markers had and extra XML tag with an additional attribute;
<c:marker>
<c:symbol val="none" />
</c:marker>
I had to somehow add this segment of code to chart. I added these up to example code provided by DocX. You can follow up from: DocX/ChartSample.cs at master
This is where the fun begins. Easy part first.
using System.Xml;
using System.Xml.Linq;
using Xceed.Words.NET;
// Create a line chart.
var line_chart = new LineChart();
// Create the data.
var PlaceholderData = ChartData.GenerateRandomDataForLinechart();
// Create and add series
var Series_1 = new Series("Your random chart with placeholder data");
Series_1.Bind(PlaceholderData, "X-Axis", "Y-Axis");
line_chart.AddSeries(Series_1);
// Create a new XmlDocument object and clone the actual chart XML
XmlDocument XMLWithNewTags = new XmlDocument();
XMLWithNewTags.LoadXml(line_chart.Xml.ToString());
I've used XPath Visualizer Tool to determine the XPath query, which is important to know because you can't just add the marker tag to somewhere and expect it to work. Why do I tell this? Because I appended marker tag on a random line and expected it to work. Naive.
// Set a namespace manager with the proper XPath location and alias
XmlNamespaceManager NSMngr = new XmlNamespaceManager(XMLWithNewTags.NameTable);
string XPathQuery = "/c:chartSpace/c:chart/c:plotArea/c:lineChart/c:ser";
string xmlns = "http://schemas.openxmlformats.org/drawingml/2006/chart";
NSMngr.AddNamespace("c", xmlns);
XmlNode NewNode = XMLWithNewTags.SelectSingleNode(XPathQuery, NSMngr);
Now create necessary tags on newly created XML Document object with specified namespace
XmlElement Symbol = XMLWithNewTags.CreateElement("c", "symbol", xmlns);
Symbol.SetAttribute("val", "none");
XmlElement Marker = XMLWithNewTags.CreateElement("c", "marker", xmlns);
Marker.AppendChild(Symbol);
NewNode.AppendChild(Marker);
And we should copy the contents of latest changes to actual XML object. But oops, understandably it is defined as private so it is a read-only object. This is where I thought like "Okay, I've fiddled enough with this. I better find another library" but then decided to go on because reasons.
Downloaded DocX repo, changed this line to
get; set;
recompiled, copied Xceed.Words.NET.dll to both projectfolder/packages and projectfolder/projectname/bin/Debug folder and finally last a few lines were
// Copy the contents of latest changes to actual XML object
line_chart.Xml = XDocument.Parse(XMLWithNewTags.InnerXml);
// Insert chart into document
document.InsertChart(line_chart);
// Save this document to disk.
document.Save();
Is it worth it? I'm not sure but I have learned a few things while working on it. There're probably lots of bad programming practises in this answer so please tell me if you see one. Sorry for meh English.

PDF Parsing with SWIFT

I want to parse a PDF that has no images, only text. I'm trying to find pieces of text. For example to search the string "Name:" and be able to read the characters after ":".
I'm already able to open a PDF, get the number of pages, and to loop on them. The problem is when I want to use functions like CGPDFDictionaryGetStream or CGPDFStreamCopyData, because they use pointers. I have not found much info on the internet for swift programmers.
Maybe the easiest way would be to parse all the content to an NSString. Then I could do the rest.
Here my code:
// Get existing Pdf reference
let pdf = CGPDFDocumentCreateWithURL(NSURL(fileURLWithPath: path))
let pageCount = CGPDFDocumentGetNumberOfPages(pdf);
for index in 1...pageCount {
let myPage = CGPDFDocumentGetPage(pdf, index)
//Search somehow the string "Name:" to get whats written next
}

You can use PDFKit to do this. It is part of the Quartz framework and is available on both iOS and MacOS. It is also pretty fast, I was able to search through a PDF with over 15000 characters in just 0.07s.
Here is an example:
import Quartz
let pdf = PDFDocument(url: URL(fileURLWithPath: "/Users/...some path.../test.pdf"))
guard let contents = pdf?.string else {
print("could not get string from pdf: \(String(describing: pdf))")
exit(1)
}
let footNote = contents.components(separatedBy: "FOOT NOTE: ")[1] // get all the text after the first foot note
print(footNote.components(separatedBy: "\n")[0]) // print the first line of that text
// Output: "The operating system being written in C resulted in a more portable software."
You can also still access most of (if not all of) the properties you had before. Such as pdf.pageCount for the number of pages, and pdf.page(at: <Int>) to get a specific page.

This is a pretty intensive task. There are libs like PDFKitten which are not maintained anymore. Here is a port of PDFKitten to swift that i did, with some modifications to the way the string searching / content indexing is done, as well as support for truetype fonts.
https://github.com/SimpleApp/PDFParser
[disclaimer : lib author]
[second disclaimer: this lib is 100% mit open sourced. The library has nothing to do with the company, it's not an ad or even a product, i'm posting this comment to help people, and then maybe grow a community around it, because it's a very common requirement and nothing free works well enough]
EDIT : the reason it's a pretty intensive task (not to mention all the character encoding issues), is that the PDF format doesn't have the notion of a "line of text" or even a "word". All it has is character printing instruction. Which means that if you want to find a "word", you'll have to recompute the frame of every blocks of character, using font information, and find the ones can be coalesced into a single word.
That's the reason why you won't find a lot of libraries doing those kind of features, and even some big project fail sometimes at providing correct copy/paste or text search features.

ITextSharp - Adding a Watermark; Leaving the PDF editable - FormFlattening = false

We have a large amount of PDF files that we are creating a web site to allow users to download them and when they do we want to:
Put a watermark on it with their name.
We want the form fields to be left open so they can enter their information.
We want to be able to print and save the document
When I put the Watermark on the document and then open it I get a message from Adobe:
"The document has been changed since it was created and use of extended features is no longer available. Please contact the author..."
According the book "iText-in-Action", this is a security issue (Chapter 8). There seems to be 2 ways to open them:
Remove usage rights : This breaks # 3 above.
Open it in append mode : It does not matter if modify it and save it with "FormFlattening = false" or true, if I put a water mark on the form the fields are no longer editable.
The error message from Adobe does describe the problem, I have modified the content of the document with the watermark, and the form fields become blocked because of this.
I have tried opening the document putting the watermark on it and saving it to a new file, and then closing it. Then reopening it and the trying to unblock the form fields, but it does not work.
Does anyone know if this is possible?
I have read something about templates; I don't know if this is a solution because of the work to convert the documents to templates? Does anyone know if this would help?
Below is a sample of my code for using an Image as a watermark, although I have tried adding text as well:
PdfReader reader = new PdfReader(sourceFile.FullName);
//reader.RemoveUsageRights();
var fileStream = new FileStream(outputPath, FileMode.Create, FileAccess.ReadWrite);
PdfStamper pdfStamper = new PdfStamper(reader, fileStream, '\0', true);
Image image = Image.GetInstance(imagePath);
image.SetAbsolutePosition(250, 300);
for (int i = 1; i <= reader.NumberOfPages; i++) // Must start at 1 because 0 is not an actual page.
{
PdfContentByte pdfPageContents = pdfStamper.GetUnderContent(i);
pdfPageContents.AddImage(image);
}
pdfStamper.FormFlattening = false; // enable this if you want the PDF flattened.
//bool have = pdfStamper.PartialFormFlattening("test");
pdfStamper.Close(); // Always close the stamper or you'll have a 0 byte stream.

A document that is Reader-enabled is digitally signed using a private key owned by Adobe. If Adobe Reader can validate that signature using Adobe's public key, the extra functionality (e.g. allowing you to save a form that has been filled out) is enabled.
Adding a watermark isn't part of the actions you're allowed to do with a digitally signed document. There is absolutely no way you can achieve what you want without invalidating the digital signature that triggers the reader enabling.
In short: you're trying to do something that is impossible. You can only achieve this by using Adobe software because you need Adobe's private key to 'restore' the reader enabling after breaking it.

Good advice.
Every time I reboot my computer Adobe complains about needing to be updated. Last thing I want is to be stuck with a hack, that may not work in the future.
One of my attempts was to create a watermark on a different layer of the PDF in the hopes that it would not see it as changing the text layer of the PDF, but this did not work. My boss had a thought of grabing the text from the original PDF and coping it to a new document and then putting the watermark on it. even though I created a new PDF it still sees it as modifying it, and the fields are still not editable.
Still stuck

generic text reading

I am working on a project where I need to read some generic text...I am looking for any api by I can read generic text and also can convert it to .csv file...
Can any one plz help...
using java on windows os...
--------------------------MORE Detail---------------------------------------------------------------------------------------
let me clarify:
Assume I have a pdf document or for that matter any file type document. I intend to use Print to Generic text printer option and get the file in that format.Finally, I intend to use some API which shoudl enable me to programatically read this Generic Text Format file. I intend to extract text from this generic text file.
So, be it any file (.doc/.pdf/.xls etc wtatever), I intend to create a Generic Text Format file using print option. Then run my code to read those files and extract some information.
PS: Assume that I have a Status report form with standard fields. Ok. But, some people might submit in .pdf, some in .doc , some in text format. But, every document contains same fields, but probably with diferent layouts.
Now, I am looking for a generic solution, by which i shoudl be able to convert every file type in to generic text file format and then apply some logic to extract my Status report fields.

In Java this is more or less what you need to read a text file, assuming it's comma separated (just change the string in the "line.split" method if you need something else). It also skips the header.
public void parse(String filename) throws IOException {
File file = new File(filename);
FileInputStream fis = new FileInputStream(file);
InputStreamReader isr = new InputStreamReader(fis);
BufferedReader br = new BufferedReader(isr);
String line;
int header = 1;
while ((line = br.readLine()) != null) {
if (header == 1) {
header = 2;
continue; // skips header
}
String[] splitter = line.split(",");
// do whatever
System.out.println(splitter[0]);
}
}

CSV is a format for data in columns. It's not very useful for, say, a Wikipedia article.
The Apache Tika library will take all kinds of data and turn it into bland XML, from which you can make CSV as you like.
It would help if you would edit your question to clarify 'generic' versus' generated', and tell more about the data.
As for Windows printer drivers, are you looking to do something like 'print to pdf' as 'print to csv'? If so, I suspect that you need to start from MSDN samples of printer drivers and code this the hard way.
The so-called 'generic text file format' is not a structured format. It's completely unpredictable what you will find in there for any given input to the printer system.

A generic free book: Text Processing in Python

Just used the standard Java classes for I/O:
BufferedWriter, File, FileWriter, IOException, PrintWriter
.csv is simply a comma-separated values file. So just name your output file with a .csv extension.
You'll also need to figure out how you'd like to split your content.
Here are Java examples to get you going:
writing to a text file
how to read lines from a file

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse