I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.
Related
I'd like to merge at least 2 PDF files into one while preserving all the form elements in the original PDFs. The form elements include text fields, radio buttons, check boxes, drop down menus and others. Please have a look at this sample PDF file with forms:
http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf
Now try to merge it with any other arbitrary PDF file.
Can you do it?
EDIT: As for the implementation, I'd ideally prefer a command line solution on a linux plattform using open source tools such as 'ghostscript', or any other tool that you think is appropriate to solve this task.
Of course, everybody is welcome to supply any working solution to this problem, including a coded solution that involves writing a script which makes some API calls to a pdf-processing library. However, I'd suggest to take the path of least resistance first (CMD Solution).
Best Regards
EDIT #2: Well there are indeed several CMD tools that merge PDFs. However, these tools don't seem to, AFAIK, to preserve the forms in the original PDFs! These tools appear to simply just concatenate the printouts of all those PDFs into a single Printout, which is then presented as a single PDF.
Furthermore, If you printout a PDF file with forms into a file, you lose all the forms in it. This clearly not what I'm looking for.
I have found success using pdftk, which is an open-source software that runs on linux and can be called from your terminal.
To concatenate multiple pdfs into one (and preserve form-fillable elements), you can use the following command:
pdftk input1.pdf input2.pdf cat output output-file.pdf
Setting
I'm writing scripts to generate documentation for a very modular project. I need to be able to generate the qch/man/xml documentation for each module and for the whole project.
What I'm currently doing is
Generate tags file
Generate qch, xml,...for each single module.
Generate again for the whole project
What I'm looking for
I would like to be able to do 1, 2 and then merge the results of 2 to get 2, without having to parse again the whole thing. Is it somehow possible?
In [1] and [2] they do not need to get the individual modules, they just parse them separately to gain time. It's useful, but not enough for me.
I'm ok with having some scripting hacks if needed, but until now I did not find anything working.
Things I read
[1] Doxygen is Slow
[2] Merging doxygen modules
Can I read an excel file without using any module?
I tried like just reading a normal file and it printed binary characters; maybe because of encoding?
But reading csv files is working normally.
Excel files are binary files, and the format of the pre-2007 ones is apparently quite hairy. I believe .xlsx files are actually zipped XML, so unzipping them should yield something human-readable, but I've never tried it. Why do you want to not use a module though?
Some further reading, if you're interested:
http://joelonsoftware.com/items/2008/02/19.html
http://en.wikipedia.org/wiki/Office_Open_XML_file_formats
Can I read an excel file without using any module?
In theory yes. In practice no.
An Excel XLS file is a binary file within a binary file. The first step would be to parse the Excel BIFF data out of the OLE COM document container. This data isn't necessarily in sequential order.
Then you have to parse the Excel BIFF data, allowing for differences between versions, a shared string table with different encodings and CONTINUE blocks that map large data records in a parser unfriendly way.
The Excel XLSX format is a little easier since it is a collection of XML files in a Zip container. However, if you aren't using modules then even that would be a pain.
The Perl modules that deal with Excel files represent hundreds of man hours of work. Expect to invest a similar amount of work to avoid them.
And why can't you use modules?
You can try figuring out the format of what an Excel spreadsheet looks like, code for that, and then use that in your program. Maybe write it as a module and submit it to CPAN. Wait a second! There's already a module like that there!
The whole purpose of CPAN is to prevent you from having to reinvent the wheel. You need to read an Excel spreadsheet, and someone has done the hard work to figure out how to do this, and is giving it to you free of charge. A $40,000 value1, and it's yours for free! The CPAN system makes installing modules fairly simple. You run the cpan command. There's no real reason to avoid modules that can save you hundreds of hours of work.
And, what type of modules do you avoid? Is it all modules, or is it only modules that are not included in the standard distribution. I hate to think you don't use things like File::Copy or Data::Dumper just because they're modules even though they're included by default in most Perl distributions.
1 Imagine hiring a team to write code to convert an Excel file, so it can be read by a Perl program. They'd have to figure the ins and outs of the file format, code for all sorts of edge cases, and run it through all sorts of tests to make sure it really works. A rough estimate if we don't include things like charts, embedded content, and remote data access would be about 200 man-hours, but only because it's actually has been documented.
I am using Perl for automation for report generation. Reports are generated in HTML. same report can be opened in MS word format. tables generated in HTML look good in Word too.
Problem:
Ineed to also insert few graphs in the report. For HTML, I am using SVG::TT::Graph::Line Perl module to generate the graphs.
The idea here is to keep single HTML file that contains all tables and graphs.
Currently every thing looks good in HTML. but when i open the same file in Word, the graphs are replaced by data (because I am using SVG Perl module).
Just wondering what would be the best way to generate graphs for Word file that doesn't change my code much.
Any suggestions with the Perl modules to be used would be much appreciated.
I haven't tried this, but the only thing I can think of is to use ImageMagick to convert the SVG to PNG and then use a Data URI to embed the image in the HTML.
Does anyone know of a free Perl program (command line preferable), module, or anyway to search and replace text in a PDF file without using it like an editor.
Basically I want to write a program (in Perl preferably) to automate replacing certain words (e.g. our old address) in a few hundred PDF files. I could use any program that supports command line arguments. I know there are many modules on CPAN that manipulate or create pdfs but they don't have (that I've seen) any sort of simple search and replace.
Thanks in advance for any and all advice!!!
Take a look at CAM::PDF. More specifically the changeString method.
How did you generate those PDFs in the first place? Search-and-replace in the original sources and re-generate PDFs seems to be more viable. Direct editing PDFs can be very difficult, and I'm not aware of any free tools that can do it easily.