Read excel file without using module - perl

Can I read an excel file without using any module?
I tried like just reading a normal file and it printed binary characters; maybe because of encoding?
But reading csv files is working normally.

Excel files are binary files, and the format of the pre-2007 ones is apparently quite hairy. I believe .xlsx files are actually zipped XML, so unzipping them should yield something human-readable, but I've never tried it. Why do you want to not use a module though?
Some further reading, if you're interested:
http://joelonsoftware.com/items/2008/02/19.html
http://en.wikipedia.org/wiki/Office_Open_XML_file_formats

Can I read an excel file without using any module?
In theory yes. In practice no.
An Excel XLS file is a binary file within a binary file. The first step would be to parse the Excel BIFF data out of the OLE COM document container. This data isn't necessarily in sequential order.
Then you have to parse the Excel BIFF data, allowing for differences between versions, a shared string table with different encodings and CONTINUE blocks that map large data records in a parser unfriendly way.
The Excel XLSX format is a little easier since it is a collection of XML files in a Zip container. However, if you aren't using modules then even that would be a pain.
The Perl modules that deal with Excel files represent hundreds of man hours of work. Expect to invest a similar amount of work to avoid them.

And why can't you use modules?
You can try figuring out the format of what an Excel spreadsheet looks like, code for that, and then use that in your program. Maybe write it as a module and submit it to CPAN. Wait a second! There's already a module like that there!
The whole purpose of CPAN is to prevent you from having to reinvent the wheel. You need to read an Excel spreadsheet, and someone has done the hard work to figure out how to do this, and is giving it to you free of charge. A $40,000 value1, and it's yours for free! The CPAN system makes installing modules fairly simple. You run the cpan command. There's no real reason to avoid modules that can save you hundreds of hours of work.
And, what type of modules do you avoid? Is it all modules, or is it only modules that are not included in the standard distribution. I hate to think you don't use things like File::Copy or Data::Dumper just because they're modules even though they're included by default in most Perl distributions.
1 Imagine hiring a team to write code to convert an Excel file, so it can be read by a Perl program. They'd have to figure the ins and outs of the file format, code for all sorts of edge cases, and run it through all sorts of tests to make sure it really works. A rough estimate if we don't include things like charts, embedded content, and remote data access would be about 200 man-hours, but only because it's actually has been documented.

Related

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

use matlab to open a file with an outside program and execute 'save as'

Alright, here's what I'm dealing with (you can skip to TLDR if all you need to see is what I want to run):
I'm having an issue with file formatting for a nasty conglomeration of several ancient programs I've strung together. I have some data in .CSV format, and I need to put it into .SPC format. I've tried a set of proprietary MATLAB programs called 'GS tools' for fast and easy conversion, but fast and easy doesn't look like its gonna happen here since there are discrepancies in how .spc files are organized now and how they were organized back when my ancient programs were written.
If I could find the source code for the old programs I could probably alter the GS tools code to write my .spc files appropriately, but all I can find are broken links circa 2002 and earlier. Seeing as I don't know what my programs are looking for, I have no choice but to try resaving my data with other programs until one of them produces something workable.
I found my Cinderella program: if I open the data I have in a program called Spekwin and save the file with a .spc extension... viola! Everything else runs on those files. The problem is that I have hundreds of these files and I'd like to automate the conversion process.
I either need to extract the writing rubric Spekwin uses for .spc files (I believe that info is stored in a dll file within the program, but I'm not sure if that actually makes sense) and use it as a rule to write a file from my input data, or I need a piece of code that will open a file with Spekwin, tell Spekwin to save that file under the .spc extension, and terminate Spekwin.
TLDR: Need a command that tells the computer to open a file with a certain program, save that file under a different extension through that program (essentially open*.csv>save as>*.spc), then terminate the program.
OR--I need a way to tell MATLAB to write a file according to rules specified by a .dll, but I'm not sure I fully understand what that entails.
Of course I'm open to suggestions on other ways to handle this.

How to transfer data from excel to powerpoint using Perl?

I am using Active Perl from ActiveState and would like to to transfer data from an Excel sheet to a particular table on the slides in a particular Power point file.
How do I do it in Perl?
I suggest you call your Perl script from VBA.
You can do that like this(in VBA):
ShellWait "c:\perl\bin\perl.exe c:\path\to\script.pl", vbNormalFocus
Also try the Shell API that VBA provides.
UPDATE
I suggest you write to disk first(in Excel). And then you read from disk(in Powerpoint). And you do this using VBA, both of these. Using Perl for this makes little sense because you're going to get wrapped up into file parsing problems and all sorts of other stuff.

which module is efficient for parsing a .pdf file in one go ? CAM::PDF or PDF::API2

I want to extract all the keywords from a huge pdf file [50MB] ?
which module is good for large pdf files to parse ?
I'm concerned with memory for parsing huge file & extracting almost all the keywords !
Here i want SAX kind of parsing [one go parsing ] & not DOM kind of [ analogy to XML].
To read text out of a PDF, we use CAM::PDF, and it worked just fine. It wasn't hugely fast on some larger files, but the ability to handle large files was not bad. We certainly had a few that were ~100Mb, and which were handled OK. If I recall, we struggled with a few that were 130Mb on a 32-bit (Windows) Perl, but we had a whole lot of other stuff in memory at the time. We did look at PDF::API2, but it seemed more oriented to generating PDFs that reading from them. We didn't throw large files into PDF::API2, so I can't give a real benchmark figure.
The only significant downside we found with using CAM::PDF is that PDF 1.6 is becoming more common, and that doesn't work at all in CAM::PDF yet. That might not be an issue for you, but it might be something to consider.
In answer to your question, I'm pretty sure both modules read the whole source PDF into memory in one form or another, but I don't think CAM::PDF builds as many more complex structures out of it. So neither is really SAX-like, but CAM::PDF seemed to be lighter in general, and can retrieve one page at a time, so might reduce the load for extracting very large texts.

How can I search and replace in a PDF document using Perl?

Does anyone know of a free Perl program (command line preferable), module, or anyway to search and replace text in a PDF file without using it like an editor.
Basically I want to write a program (in Perl preferably) to automate replacing certain words (e.g. our old address) in a few hundred PDF files. I could use any program that supports command line arguments. I know there are many modules on CPAN that manipulate or create pdfs but they don't have (that I've seen) any sort of simple search and replace.
Thanks in advance for any and all advice!!!
Take a look at CAM::PDF. More specifically the changeString method.
How did you generate those PDFs in the first place? Search-and-replace in the original sources and re-generate PDFs seems to be more viable. Direct editing PDFs can be very difficult, and I'm not aware of any free tools that can do it easily.