Perl code for inserting graphs into MS Word - perl

I am using Perl for automation for report generation. Reports are generated in HTML. same report can be opened in MS word format. tables generated in HTML look good in Word too.
Problem:
Ineed to also insert few graphs in the report. For HTML, I am using SVG::TT::Graph::Line Perl module to generate the graphs.
The idea here is to keep single HTML file that contains all tables and graphs.
Currently every thing looks good in HTML. but when i open the same file in Word, the graphs are replaced by data (because I am using SVG Perl module).
Just wondering what would be the best way to generate graphs for Word file that doesn't change my code much.
Any suggestions with the Perl modules to be used would be much appreciated.

I haven't tried this, but the only thing I can think of is to use ImageMagick to convert the SVG to PNG and then use a Data URI to embed the image in the HTML.

Related

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

How to create reports containing text and figures with MATLAB

I am using a MATLAB script to tune the control system on a machine. When the tuning is complete, I would like a report containing text (especially serial number, date/time and the values determined during tuning) and plots, especially transfer functions.
What do to you recommend?
Whatever solution I use should be compatible with the MATLAB compiler so I can distribute my solution to a team of field engineers.
Ideally the report will be a PDF document.
The MATLAB report generator does not seem to be the right product as it appears that I have to break up my script into little pieces and embed them in the report template. My script contains opportunities for the user to intervene and change values or reject the tune if plots don't look right and my hunch is that this will be difficult if the code runs from the report generator. Also, I fear code structure and maintainability will be lost if the code structure is determined by the requirements of the report template.
Please comment if my assumptions are wrong.
UPDATE
I have now switched to use the MATLAB Report Generator with release r2016b and it is working very well for my compiled code users. Unfortunately it means that colleagues who have a MATLAB licence need to buy the Report Generator too, to use my tools scripted.
As the MATLAB Report Generator's development manager, I am concerned that this question may leave the wrong impression about the Report Generator's capabilities.
For one thing, the Report Generator does not require you to break a script up into little pieces and run them inside a template. You can do this if you choose and in some circumstances, it makes sense, but it is not a requirement. In fact, many Report Generator applications use a MATLAB script or program to interact with a user, generate data in the MATLAB workspace, and as a final step, generate a report from the workspace data.
Moreover, as of the R2014b version, the MATLAB Report Generator comes with a document generation API, called the DOM API, that allows you to embed document generation statements in a MATLAB program. For example, you can programmatically create a document object, add and format text, paragraphs, tables, images, lists, and subdocuments, and output Microsoft Word, HTML, or PDF output, depending on the output type you select. You can even programmatically fill in the blanks in forms that you create, using Word or an HTML editor.
The API runs on Windows, Linux, and Mac platforms and generates Word and HTML output on all three, without the use of Word. On Windows, it uses Word under the hood to produce PDF output from the Word documents that it generates.
The latest release of the MATLAB Report Generator introduces a PowerPoint API with capabilities similar to the DOM API. If you need to include report generation in your MATLAB application, please don't rule out the MATLAB Report Generator based on past impressions. You may be surprised at just how powerful it has become.
I've done this quite a bit. You're right that MATLAB Report Generator is typically not a great solution. #Max suggests the right approach (automating Word through its COM interface), but I'd add a few extra comments and tips, based on my experiences.
Remember that if you're going with this solution, you are depending that your end-users will be running Windows, and have a copy of Office on their machine. If you want to ultimately produce a PDF report, that will need to be Office 2010 or above.
I would bet that you'll find it easier to automate the report generation in Excel rather than Word. Given that you're producing a report from MATLAB, you'll likely be wanting quite a lot of things in tables of numbers, which are easier to lay out in Excel.
If you are going to do it in Word, the easiest way is to first (without MATLAB) create a template .doc/.docx file, which contains any generic text that will be the same for all reports and blank tables for any information. Turn on track changes, and insert empty comments at each point that you will be filling in information. Then within your report creation routine in MATLAB, connect to Word and iterate through each comment, replacing it with whatever data you wish.
If you are learning to automate Excel from MATLAB, this page from the Excel Interop documentation is really helpful. There's an equivalent one for Word.
Unlike #Max, I've never had good results by saving figures to an .emf file and then inserting them. In theory that does preserve editability, but I've never found that valuable. Instead, get the figure looking right (and the right size) in MATLAB, then copy it to the clipboard with print(figHandle, 'dbitmap') and paste to Excel with Worksheet.Range('A1').PasteSpecial.
To save as a PDF, use Workbook.ExportAsFixedFormat('xlTypePDF', pathToOutputFile).
Hope that helps!
I think you are right about the report generator.
In my opinion the fastest/easiest approach would be to generate the report in a html document. For that you just need the figures and write a text file, conversion should be trivial.
Quite similar approach would be to create a Latex file. And then create a pdf from it - though for this you'd need to install latex on your deployed machines.
Lastly you could use the good integration of Java in Matlab. There are several libraries you could use - like this. But I wonder if all the complication will be worth it.
Have you considered driving Microsoft Word through its ActiveX interface? I've done this in compiled Matlab programs and it works well. Look at the Matlab help for actxserver(): The object you want to create is of type Word.Application.
Edit to add: To get figures into the document, save them as .emf files using the -dmeta argument to print(), then add them to the document like this:
WordServer.Selection.InlineShapes.AddPicture(fileName);

Trouble reading text from a pdf in Perl

I am trying to read the text content of a pdf file into a Perl variable. From other SO questions/answers I get the sense that I need to use CAM::PDF. Here's my code:
#!/usr/bin/perl -w
use CAM::PDF;
my $pdf = CAM::PDF->new('1950-01-01.pdf');
print $pdf->numPages(), " pages\n\n";
my $text = $pdf->getPageText(1);
print $text, "\n";
I tried running this on this pdf file. There are no errors reported by Perl. The first print statement works; it prints "2 pages" which is the correct number of pages in this document.
The next print statement does not return anything readable. Here's what the output looks like in Emacs:
2 pages
^A^B^C^D^E^C^F^D^G^H
^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E
^F^G^G^H^E
^K^L
^M^N^E^O^P^E^O^Q^R^S^E
.... more lines with similar codes ....
Is there something I can do to make this work? I don't understand pdf files too well, but I thought that because I can easily copy and paste the text from the PDF file using Acrobat, it must be recognized as text and not an image, so I hoped this meant I could extract it with Perl.
Any guidance would be much appreciated.
PDFs can have different kinds of content. A PDF may not have any readable text at all, only bitmaps and graphical content, for example. The PDF you linked to, has compressed data in it. Open it with a text editor, and you will see that the content is in a "/Filter/FlateDecode" block. Perhaps CAM::PDF doesn't support that. Google FlateDecode for a few ideas.
Looking further into that PDF, i see that it also uses embedded subsets of fonts, with custom encodings. Even if CAM::PDF handles the compression, the custom encoding may be what's throwing it off. This may help: Web page from a software company, describing the problem
I'm fairly certain that the issue isn't with your perl code, it is with the PDF file. I ran the same script on one of my own PDF files, and it works just fine.

Understanding WordProcessingML tags and avoid unnecessary tags

I am using MS Word API to generate .docx which contains the data fetched from DB, in which i am applying the respective styles, fonts, symbols, etc. If the data fetched from the DB is quite huge, then there is a problem in displaying those data in the .docx file. I found that internally MS Word 2007 will write some content through tags which may not be needed to display the data. Hence i am figuring out what are the necessary MS Word tags needed when converting into a .xml file. So that i can avoid unnecessary tags and build only the respective tags which are needed to display the data. Hence i am planning to write my own .xml with the MS Word tags which are needed, than generating a .XML from .docx file
My queries are:-
1) Whether it is right that the MS Word will generate some tags which may not be needed during the conversion of .docx to document.xml? That makes it heavy? If so what are the tags , so that i can avoid them when write by own .xml file.
2) Please send links to understand about the MS Word tags and its advantages, which tags are needed and which are not ?
3) Whether my approach to write a new .xml similar to document.xml (.docx conversion) is worthy one to go forward so that i can build the .xml with the tags i needed , so that i can improve the performance of the data display?
Please shed some light into it and thanks in advance..
Thanks,
Rithu
You'll want to learn WordprocessingML in much more detail to do this. It certainly isn't impossible, but it is quite a learning curve to start with. Probably the best place to start is with this eBook. If you go the manual route, you'll need a zip technology. If you're in Visual Studio, you can make the writing of all of this easier by using the Open XML SDK.
As to your questions on 'unnecessary tags', it's hard to believe that there would be much at all in the file that is unnecessary. But that depends on what you consider not needed - for example, if a word is caught as mispelled, there will be "dirty=1" attribute on the Run tag. If you're okay with displaying mispelled words, then that could be considered unnecessary. Really depends on what you're displaying for and in what.

How can I search and replace in a PDF document using Perl?

Does anyone know of a free Perl program (command line preferable), module, or anyway to search and replace text in a PDF file without using it like an editor.
Basically I want to write a program (in Perl preferably) to automate replacing certain words (e.g. our old address) in a few hundred PDF files. I could use any program that supports command line arguments. I know there are many modules on CPAN that manipulate or create pdfs but they don't have (that I've seen) any sort of simple search and replace.
Thanks in advance for any and all advice!!!
Take a look at CAM::PDF. More specifically the changeString method.
How did you generate those PDFs in the first place? Search-and-replace in the original sources and re-generate PDFs seems to be more viable. Direct editing PDFs can be very difficult, and I'm not aware of any free tools that can do it easily.