How can I extract the first paragraph of a PDF document using Perl's CAM::PDF? - perl

How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?

print CAM::PDF->new('file.pdf')->getPageText(1);
will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.

Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.
I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.

Related

How do I automate converting PDF to HTML?

I work for a publisher and am trying to extract content from our fully laid out PDFs. I've tried pdftohtml, pdftotext, pdfminer, and other Python-based approaches to getting the content, as well as saving to Word, HTML, XML, etc. from the original Acrobat files.
I don't need just the text, I also need the text formatting. That's because, for example, I need all the blue text in the document.
When I save to HTML, Word, etc. from Acrobat, the resulting files contain screenshots of the pages, not the laid out text. When I extract text using different Python modules I get the text but lose the text formatting.
The only solution I've found is to manually copy and paste from the PDF into a word doc, then saving as HTML. I'm hoping to automate this.
Why does copying from Acrobat into Word achieve what I can't do by other means? Has anybody come across this problem before?
Maybe you can consider another method. The software (https://pdfapi.codeplex.com/) can convert pdf files to html directly via MVS. If you are able to use the MVS, i think the software i mentioned above is useful for you to convert the text in pdf files to html that can keep the format perfectly. Of course, it's just a referral, you can have a try.

How to read pdf table content data?

I have a requirement to read a pdf file having tabular format data only like in excel file. I need to extract the cell value of given pdf file.
Is it be anyhow possible using itext API. If you have something to share then please share it or any other solutions?
The PDF format is just a canvas where text and graphics are placed without any structure information. As such there aren't any iText-objects in a PDF file. In each page there will probably be a number of Strings, but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines.
In short: parsing the content of a PDF-file is NOT POSSIBLE with iText.
You can try this! This lets you read PDF pages.
I recently ran into this problem. I wasn't able to make it work with itext.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.

Modifying the text in a PDF document

I am putting a single line of info to the end of first page of a PDF document by using PdfStamper class. Now I need to update that info periodically. How can I modify the text I stamped previously, is there a way for it?
Thanks.
See this post from Bruno Lowagie (the creator of iText) about PDF not being a word processor. In that post he talks about using forms instead to accomplish what you are looking for which is one route you can go down.
The second route, which I'd recommend, is just having two PDFs. Have your base PDF that you open, write to and save to your output PDF. When you need to update the PDF, delete the output PDF, re-open the base PDF, write your new text and save it to the output PDF again. This accomplishes your goals without having to edit anything.

PDF to Jasper XML

I have a PDF file (softcopy) which was created using iText. Now my company decided to use JasperReports for new release. I need to use that PDF file (softcopy) and need to design JasperReports template and need to populate data.
Do we have any plugin in JasperReports that can convert from PDF to JasperReports JRXML or what do I need to do? Any suggestions?
A PDF is a description of how to render a document on a page. Things
like "draw a vertical line here", "write 'foo bar baz' here in
Courier". It does not contain any information about the format or
organisation of the stuff it is rendering. You won't be able to tell
that you're looking at a table, or a list of bullet points, or a
paragraph, or anything like that.
The PDF format does contain information on a page-by-page basis.
Therefore, page breaks are the one piece of format/organisation
information that you can find.
If you want anything more than a raw stream of completely unformatted,
disorganised text, one per page, you are out of luck. It's virtually
impossible.
from javaranch
You can use http://xmlprinter.com/ and then use a xslt to transform the resulted xml to the desired jrxml.
I'm working in it. If I finish it, i will post the result on github or any other public and open place.
Good Luck

Converting large amounts of text and dynamic data into PDF

I have a three page Word document that needs to be converted into PDF. This Word document was given to me as a template to show me what the PDF output should look like. I tried converting this document into PDF, created a PDF form and used iTextSharp to open the form, populate it with data and return it back to the client. This is all great but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden.
My second attempt was to create an MVC 2 View without master page, pass the model to the view, take the HTML representation of the View, pass it over to iTextSharp and render the PDF. The problem here was that iTextSharp failed on some tags (one of them was <hr> tag). I managed to get rid of the problematic tag, but then tables were not rendered properly. Namely, the border attribute was ignored so I ended up with borderless tables. That attempt failed.
I need a suggestion or advice on the most efficient way to create a PDF document in MVC 2 which would be maintainable in the long run. I really don't want my actions to be 200+ lines long. Working directly with the Word document is not the best solution as I have never worked with VSTO so I don't quite know what it would look like to open Word and manipulate text inside of it and add dynamic data and then convert that dynamically into PDF.
Any suggestion is highly welcome.
Best regards!
One thing that I've done in the past is to save the Word file as a DOCX and unzip it since DOCX is just a renamed zip file. Within the archive open up /word/document.xml and you'll see your document. There's a lot of weird XML tags in there but overall you should get a pretty good idea of where your content is. Then just add placeholder text like {FIRST_NAME}, save the file and re-zip.
Then from code you can just perform the same steps, unzipping with something like SharpZipLib or DotNetZip, swapping placeholder copy, re-zipping and then using very simple Word automation to Save-As a PDF.
The other route is to fully utilize iTextSharp and actually write Paragraphs and PdfPTable and everything else. It takes a lot longer to setup but would give you the most control.
Q: you say "... but due to large amounts of data stored, the placeholders were insufficient and the text would be truncated or hidden"
How do you end up having to much data ? If the word template can "hold" the data in 3 pages, they should fit in 3 PDF pages.
I used to use iTextSharp to create my PDF's, but I also almost always ended up building the PDF document from scratch myself.(not really a <200 line solution) Have you considerate another library, I recently switched to MigraDoc's PDFSharp.Way simpler to use then iText, lotsa examples / docus
Just my two cents
Word documents object model is quite easy to understand. It will either contain series of Paragraphs or Tables. Using the Open XML SDK, you can iterate through each paragraph/table in the word document and retrieve it's content and styles. Then you can generate PDF document on the fly using those retrieved information. This will work under MVC too.
But if your word document contains complex elements, then it will take some more time for you to implement based on this approach. Also, this approach would only work with (Word 2007 and 2010) files.
Also, HTML to PDF options currently available in the ITextSharp library would work with only known set of tags, as far as I know.
Another suggestion is to make use of commercially available .NET components. There are lot of good solution available. For ex: Syncfusion