Read pdf files in perl - perl

I have a collection of pdfs and would like to read those pdfs through a perl program. I just want to read and print those pdfs. Is there any module available in perl to do this?

If you want to automate printing, I would recommend to use the command line interface of the Acrobat reader.If you want to parse the content, I would use CAM::PDF, but the results depend strongly on the pdf.
The command line parameters for printing are
AcroRd32.exe /t filename printername drivername portname

Depending exactly what you want to do with the PDFs and their contents (that is, do you want a perl module which essentially replaces Acrobat Reader, or do you just want to extract and print the text of the documents) CPAN might provide what you want as it contains quite a few modules related to PDFs: CPAN Search of PDF modules

Related

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

Perl Extracting XML Tag Attribute Using Split Or Regex

I am working on a file upload system that also parses the files that are uploaded and generates another file based on info inside the file uploaded. The files being uploaded as XML files. I only need to parse the first XML tag in each file and only need to get the value of the single attribute in the tag.
Sample XML:
<LAB title="lab title goes here">...</LAB>
I am looking for a good way of extracting the value of the title attribute using the Perl split function or using Regex. I would use a Perl XML parser if I had the ability to install Perl modules on the server I am hosting my code on, however I do not have that ability.
This XML is located in an XML file, that I am opening and then attempting to parse out the attribute value. I have tried using both Split and Regex to no luck. However, I am not very familiar with Perl or regular expressions.
This is he basic outline my code so far:
open(LAB, "<", "path-to-file-goes-here") or die "Unable to open lab.\n";
foreach my $line (<LAB>) {
my #pieces = split(/"(.*)"/, $line);
foreach my $piece (#pieces) {
print "$piece\n";
}
}
I have tried using split to match against title alone using
/title/
Or match against the = character or the " character using
/\=/ or /\"/
I have also tried doing similar things using regex and have had no luck as well. I am not sure if I am just not using the proper expression or if this is not possible using split/regex. Any help on the matter would be much appreciated, as I am admittedly a novice at Perl still. If this type of question has been answered elsewhere, I apologize. I did some searching and could not find a solution. Most threads suggest using an XML parsing Perl module, which I would if I had the privileges to install them.
"But I can't use CPAN" is a quick way to get yourself downvoted on the Perl tag (though it wasn't I who did so). There are many ways that you can use CPAN, even if you don't have root. In fact you can have your own Perl even if you don't have root. While I highly recommend some of those options, for now, the easiest way to do this is just to download some Pure Perl modules, and included them in your codebase. Mojolicious has a very small, but very useful XML/DOM parser called Mojo::DOM which is a likely candidate for this kind of process.

Read excel file without using module

Can I read an excel file without using any module?
I tried like just reading a normal file and it printed binary characters; maybe because of encoding?
But reading csv files is working normally.
Excel files are binary files, and the format of the pre-2007 ones is apparently quite hairy. I believe .xlsx files are actually zipped XML, so unzipping them should yield something human-readable, but I've never tried it. Why do you want to not use a module though?
Some further reading, if you're interested:
http://joelonsoftware.com/items/2008/02/19.html
http://en.wikipedia.org/wiki/Office_Open_XML_file_formats
Can I read an excel file without using any module?
In theory yes. In practice no.
An Excel XLS file is a binary file within a binary file. The first step would be to parse the Excel BIFF data out of the OLE COM document container. This data isn't necessarily in sequential order.
Then you have to parse the Excel BIFF data, allowing for differences between versions, a shared string table with different encodings and CONTINUE blocks that map large data records in a parser unfriendly way.
The Excel XLSX format is a little easier since it is a collection of XML files in a Zip container. However, if you aren't using modules then even that would be a pain.
The Perl modules that deal with Excel files represent hundreds of man hours of work. Expect to invest a similar amount of work to avoid them.
And why can't you use modules?
You can try figuring out the format of what an Excel spreadsheet looks like, code for that, and then use that in your program. Maybe write it as a module and submit it to CPAN. Wait a second! There's already a module like that there!
The whole purpose of CPAN is to prevent you from having to reinvent the wheel. You need to read an Excel spreadsheet, and someone has done the hard work to figure out how to do this, and is giving it to you free of charge. A $40,000 value1, and it's yours for free! The CPAN system makes installing modules fairly simple. You run the cpan command. There's no real reason to avoid modules that can save you hundreds of hours of work.
And, what type of modules do you avoid? Is it all modules, or is it only modules that are not included in the standard distribution. I hate to think you don't use things like File::Copy or Data::Dumper just because they're modules even though they're included by default in most Perl distributions.
1 Imagine hiring a team to write code to convert an Excel file, so it can be read by a Perl program. They'd have to figure the ins and outs of the file format, code for all sorts of edge cases, and run it through all sorts of tests to make sure it really works. A rough estimate if we don't include things like charts, embedded content, and remote data access would be about 200 man-hours, but only because it's actually has been documented.

Trouble reading text from a pdf in Perl

I am trying to read the text content of a pdf file into a Perl variable. From other SO questions/answers I get the sense that I need to use CAM::PDF. Here's my code:
#!/usr/bin/perl -w
use CAM::PDF;
my $pdf = CAM::PDF->new('1950-01-01.pdf');
print $pdf->numPages(), " pages\n\n";
my $text = $pdf->getPageText(1);
print $text, "\n";
I tried running this on this pdf file. There are no errors reported by Perl. The first print statement works; it prints "2 pages" which is the correct number of pages in this document.
The next print statement does not return anything readable. Here's what the output looks like in Emacs:
2 pages
^A^B^C^D^E^C^F^D^G^H
^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E
^F^G^G^H^E
^K^L
^M^N^E^O^P^E^O^Q^R^S^E
.... more lines with similar codes ....
Is there something I can do to make this work? I don't understand pdf files too well, but I thought that because I can easily copy and paste the text from the PDF file using Acrobat, it must be recognized as text and not an image, so I hoped this meant I could extract it with Perl.
Any guidance would be much appreciated.
PDFs can have different kinds of content. A PDF may not have any readable text at all, only bitmaps and graphical content, for example. The PDF you linked to, has compressed data in it. Open it with a text editor, and you will see that the content is in a "/Filter/FlateDecode" block. Perhaps CAM::PDF doesn't support that. Google FlateDecode for a few ideas.
Looking further into that PDF, i see that it also uses embedded subsets of fonts, with custom encodings. Even if CAM::PDF handles the compression, the custom encoding may be what's throwing it off. This may help: Web page from a software company, describing the problem
I'm fairly certain that the issue isn't with your perl code, it is with the PDF file. I ran the same script on one of my own PDF files, and it works just fine.

How can I search and replace in a PDF document using Perl?

Does anyone know of a free Perl program (command line preferable), module, or anyway to search and replace text in a PDF file without using it like an editor.
Basically I want to write a program (in Perl preferably) to automate replacing certain words (e.g. our old address) in a few hundred PDF files. I could use any program that supports command line arguments. I know there are many modules on CPAN that manipulate or create pdfs but they don't have (that I've seen) any sort of simple search and replace.
Thanks in advance for any and all advice!!!
Take a look at CAM::PDF. More specifically the changeString method.
How did you generate those PDFs in the first place? Search-and-replace in the original sources and re-generate PDFs seems to be more viable. Direct editing PDFs can be very difficult, and I'm not aware of any free tools that can do it easily.