How can I search and replace in a PDF document using Perl? - perl

Does anyone know of a free Perl program (command line preferable), module, or anyway to search and replace text in a PDF file without using it like an editor.
Basically I want to write a program (in Perl preferably) to automate replacing certain words (e.g. our old address) in a few hundred PDF files. I could use any program that supports command line arguments. I know there are many modules on CPAN that manipulate or create pdfs but they don't have (that I've seen) any sort of simple search and replace.
Thanks in advance for any and all advice!!!

Take a look at CAM::PDF. More specifically the changeString method.

How did you generate those PDFs in the first place? Search-and-replace in the original sources and re-generate PDFs seems to be more viable. Direct editing PDFs can be very difficult, and I'm not aware of any free tools that can do it easily.

Related

Merge 2 pdf files and preserve forms

I'd like to merge at least 2 PDF files into one while preserving all the form elements in the original PDFs. The form elements include text fields, radio buttons, check boxes, drop down menus and others. Please have a look at this sample PDF file with forms:
http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf
Now try to merge it with any other arbitrary PDF file.
Can you do it?
EDIT: As for the implementation, I'd ideally prefer a command line solution on a linux plattform using open source tools such as 'ghostscript', or any other tool that you think is appropriate to solve this task.
Of course, everybody is welcome to supply any working solution to this problem, including a coded solution that involves writing a script which makes some API calls to a pdf-processing library. However, I'd suggest to take the path of least resistance first (CMD Solution).
Best Regards
EDIT #2: Well there are indeed several CMD tools that merge PDFs. However, these tools don't seem to, AFAIK, to preserve the forms in the original PDFs! These tools appear to simply just concatenate the printouts of all those PDFs into a single Printout, which is then presented as a single PDF.
Furthermore, If you printout a PDF file with forms into a file, you lose all the forms in it. This clearly not what I'm looking for.
I have found success using pdftk, which is an open-source software that runs on linux and can be called from your terminal.
To concatenate multiple pdfs into one (and preserve form-fillable elements), you can use the following command:
pdftk input1.pdf input2.pdf cat output output-file.pdf

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

Read excel file without using module

Can I read an excel file without using any module?
I tried like just reading a normal file and it printed binary characters; maybe because of encoding?
But reading csv files is working normally.
Excel files are binary files, and the format of the pre-2007 ones is apparently quite hairy. I believe .xlsx files are actually zipped XML, so unzipping them should yield something human-readable, but I've never tried it. Why do you want to not use a module though?
Some further reading, if you're interested:
http://joelonsoftware.com/items/2008/02/19.html
http://en.wikipedia.org/wiki/Office_Open_XML_file_formats
Can I read an excel file without using any module?
In theory yes. In practice no.
An Excel XLS file is a binary file within a binary file. The first step would be to parse the Excel BIFF data out of the OLE COM document container. This data isn't necessarily in sequential order.
Then you have to parse the Excel BIFF data, allowing for differences between versions, a shared string table with different encodings and CONTINUE blocks that map large data records in a parser unfriendly way.
The Excel XLSX format is a little easier since it is a collection of XML files in a Zip container. However, if you aren't using modules then even that would be a pain.
The Perl modules that deal with Excel files represent hundreds of man hours of work. Expect to invest a similar amount of work to avoid them.
And why can't you use modules?
You can try figuring out the format of what an Excel spreadsheet looks like, code for that, and then use that in your program. Maybe write it as a module and submit it to CPAN. Wait a second! There's already a module like that there!
The whole purpose of CPAN is to prevent you from having to reinvent the wheel. You need to read an Excel spreadsheet, and someone has done the hard work to figure out how to do this, and is giving it to you free of charge. A $40,000 value1, and it's yours for free! The CPAN system makes installing modules fairly simple. You run the cpan command. There's no real reason to avoid modules that can save you hundreds of hours of work.
And, what type of modules do you avoid? Is it all modules, or is it only modules that are not included in the standard distribution. I hate to think you don't use things like File::Copy or Data::Dumper just because they're modules even though they're included by default in most Perl distributions.
1 Imagine hiring a team to write code to convert an Excel file, so it can be read by a Perl program. They'd have to figure the ins and outs of the file format, code for all sorts of edge cases, and run it through all sorts of tests to make sure it really works. A rough estimate if we don't include things like charts, embedded content, and remote data access would be about 200 man-hours, but only because it's actually has been documented.

Screen scraping: Automating a vim script

In vim, I loaded a series of web pages (one at a time) into a vim buffer (using the vim netrw plugin) and then parsed the html (using the vim elinks plugin). All good. I then wrote a series of vim scripts using regexes with a final result of a few thousand lines where each line was formatted correctly (csv) for uploading into a database.
In order to do that I had to use vim's marking functionality so that I could loop over specific points of the document and reassemble it back together into one csv line. Now, I am considering automating this by using Perl's "Mechanize" library of classes (UserAgent, etc).
Questions:
Can vim's ability to "mark" sections of a document (in order to
perform substitutions on) be accomplished in Perl?
It was suggested to use "elinks" directly - which I take to mean to
load the page into a headless browser using ellinks and perform Perl
scripts on the content from there(?)
If that's correct, would there become a deployment problem with
elinks when I migrate the site from my localhost LAMP stack setup to
a hosting company like Bluehost?
Thanks
Edit 1:
TYRING TO MIGRATE KNOWLEDGE FROM VIM TO PERL:
If #flesk (below) is right, then how would I go about performing this routine (written in vim) that "marks" lines in a text file ("i" and "j") and then uses that as a range ('i,'j) to perform the last two substitutions?
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|
\ 'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
I am not seeing this capability in the perldoc perlre manual. Am I missing either a module or some basic Perl understanding of m/ or qr/ ??
I'm sure all you need is some kind of HTML parser. For example I'm using HTML::TreeBuilder::XPath.

Create Numbers file and open it with Numbers on iPad

I would like to do a task that is quite simple on other OS, but it is not so trivial on iOS. Namely, I want to create file and open it in Numbers.
I can preview the file with UIDocumentInteractionController and then offer it to user that he/she opens it.
THis seems to me quite a reasonable solution. However, I need to offer proper file format. I suppose CSV and XLS would be reasonable to implement and it would most probably work, but I would still like to do it in native Numbers format if possible. However, I can't find any info about this file format.
Basically, this task is about exporting data to another app and then working further with them.
I don't know of a library that can create native Numbers files. There are hoewever some libraries that allow creating XLS files. Since Numbers fully supports XLS, this is probably the way to go.
There is a comercial library available that might work on the iPhone (costs $200): http://www.libxl.com/
As for free XLS libraries, I only know xlwt, a Python module. You could set up a webservice that creates an XLS file for your app, using xlwt on the server side.
If you want to pass information to Numbers, you can probably also use CSV files. If you use CSV files, you must be aware of some things. There are two kinds of CSV files: the comma separated version (used in english speaking countries) and the semicolon separated (used in continental europe).
The comma separated CSV files look for example like this:
"ID","First Name","Last Name","Salary"
1,"John","Malkovich",3400.20
2,"Fred","Astaire",2000.60
The second kind of CSV files are semicolon separated and use a comma as decimal mark. They look like this:
"ID";"First Name";"Last Name";"Salary"
1;"John";"Malkovich";3400,20
2;"Fred";"Astaire";2000,60
On the Macintosh, Numbers expects a different format depending on the Region setting. If you have your Region set to the US, it will expect the first kind. If you choose Germany, it will expect the second kind.
I don't know what kind of files Numbers on the iPad expects.
Another alternative would be using copy and paste. Try to copy tab separated text into the clipboard.
I hope this may help you. I've contacted libxl team and they responded with the link to the demo version of their iPhone library: http://www.libxl.com/download/libxl-iphone.zip