Is there any easy way to print text using zend_pdf? - zend-framework

I want to create a PDF file using zend_pdf. But this way seems not efficient at all because I need to define the exact text location :(
$page->drawText(""Some text", $x, $y);
Is there any easier way to print text?

As far as I know I don't think that there is an easy way to print a PDF with Zend_Pdf because you have to position everything.
For printing PDF in PHP, I always used TCPDF and converted HTML to PDF with great results. Unless you are forced to use Zend_Pdf I think you should avoid it. (I used the 1.11 version so things might change).

Related

Perl Extracting XML Tag Attribute Using Split Or Regex

I am working on a file upload system that also parses the files that are uploaded and generates another file based on info inside the file uploaded. The files being uploaded as XML files. I only need to parse the first XML tag in each file and only need to get the value of the single attribute in the tag.
Sample XML:
<LAB title="lab title goes here">...</LAB>
I am looking for a good way of extracting the value of the title attribute using the Perl split function or using Regex. I would use a Perl XML parser if I had the ability to install Perl modules on the server I am hosting my code on, however I do not have that ability.
This XML is located in an XML file, that I am opening and then attempting to parse out the attribute value. I have tried using both Split and Regex to no luck. However, I am not very familiar with Perl or regular expressions.
This is he basic outline my code so far:
open(LAB, "<", "path-to-file-goes-here") or die "Unable to open lab.\n";
foreach my $line (<LAB>) {
my #pieces = split(/"(.*)"/, $line);
foreach my $piece (#pieces) {
print "$piece\n";
}
}
I have tried using split to match against title alone using
/title/
Or match against the = character or the " character using
/\=/ or /\"/
I have also tried doing similar things using regex and have had no luck as well. I am not sure if I am just not using the proper expression or if this is not possible using split/regex. Any help on the matter would be much appreciated, as I am admittedly a novice at Perl still. If this type of question has been answered elsewhere, I apologize. I did some searching and could not find a solution. Most threads suggest using an XML parsing Perl module, which I would if I had the privileges to install them.
"But I can't use CPAN" is a quick way to get yourself downvoted on the Perl tag (though it wasn't I who did so). There are many ways that you can use CPAN, even if you don't have root. In fact you can have your own Perl even if you don't have root. While I highly recommend some of those options, for now, the easiest way to do this is just to download some Pure Perl modules, and included them in your codebase. Mojolicious has a very small, but very useful XML/DOM parser called Mojo::DOM which is a likely candidate for this kind of process.

Trouble reading text from a pdf in Perl

I am trying to read the text content of a pdf file into a Perl variable. From other SO questions/answers I get the sense that I need to use CAM::PDF. Here's my code:
#!/usr/bin/perl -w
use CAM::PDF;
my $pdf = CAM::PDF->new('1950-01-01.pdf');
print $pdf->numPages(), " pages\n\n";
my $text = $pdf->getPageText(1);
print $text, "\n";
I tried running this on this pdf file. There are no errors reported by Perl. The first print statement works; it prints "2 pages" which is the correct number of pages in this document.
The next print statement does not return anything readable. Here's what the output looks like in Emacs:
2 pages
^A^B^C^D^E^C^F^D^G^H
^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E
^F^G^G^H^E
^K^L
^M^N^E^O^P^E^O^Q^R^S^E
.... more lines with similar codes ....
Is there something I can do to make this work? I don't understand pdf files too well, but I thought that because I can easily copy and paste the text from the PDF file using Acrobat, it must be recognized as text and not an image, so I hoped this meant I could extract it with Perl.
Any guidance would be much appreciated.
PDFs can have different kinds of content. A PDF may not have any readable text at all, only bitmaps and graphical content, for example. The PDF you linked to, has compressed data in it. Open it with a text editor, and you will see that the content is in a "/Filter/FlateDecode" block. Perhaps CAM::PDF doesn't support that. Google FlateDecode for a few ideas.
Looking further into that PDF, i see that it also uses embedded subsets of fonts, with custom encodings. Even if CAM::PDF handles the compression, the custom encoding may be what's throwing it off. This may help: Web page from a software company, describing the problem
I'm fairly certain that the issue isn't with your perl code, it is with the PDF file. I ran the same script on one of my own PDF files, and it works just fine.

Setting an Excel worksheet's custom page size (not the printable area) from Perl

Long story short, all I'm really attempting to do is print my reports on half sheets. I had Kinko's chop a pack of printer paper in half and my laser printer is perfectly happy to suck them in and print the reports properly, if the paper size of the Excel report is set to exactly 8.5" x 5.64".
That can be done easily in Excel, but it's the one and only adjustment, in my project, I wasn't able to automate with Perl using Spreadsheet::WriteExcel. The CPAN documentation states that you can pick from some of the default sizes normally available with Excel, but doesn't provide an option to specify your own paper size.
Even if you establish the custom size you need in Excel beforehand, making it available in future spreadsheets, as one of your selectable paper sizes, there doesn't seem to be an index, using set_paper($index), that would specify that newly established custom size.
Thank you in advance!
#!/usr/local/gnu/bin/perl --
use strict;
use warnings;
use Spreadsheet::WriteExcel;
my $repWB = Spreadsheet::WriteExcel->new('../tmp/test.xls');
my $repWS = $repWB->add_worksheet('AA');
$repWS->set_paper(1);
Since VBScript can do many of these things natively, maybe you could try embedding the necessary VBScript into your main Perl script using Inline:WSC.
You could determine the needed VBScript by recording an Excel macro of you setting the print size. Then embed that code into your main Perl script.
I am the author of Spreadsheet::writeExcel.
As far as I know there isn't an option in Excel to set a custom paper size. This is usually set in the printer (correct me if I am wrong).
Excel can store printer information along with the workbook data so there may be a workaround.
Can you send me a single sheet workbook with data in cell A1 only and a copy with the custom page set and I'll take a look to see if it is possible.
P.S. The available option B5 (index 13) in landscape mode is fairly close: 8.27" x5.83". Or the undocumented "Organiser L" (index 129) which should be half Letter size: 8.5" x 5.5".

How to programmatically convert PostScript to PDF with the fewest steps?

Is there any way to just slap on a header and use a PS file as a PDF, assuming that the PS is very simple and do anything complicated?
I want to do this programmatically, not using ps2pdf.
Thanks.
You can certainly *try" "just slapping on a header" ... but I don't think you'll get too far :-)
Personally, I'd suggest ps2pdf is the best solution (for example, invoke it with ShellExec() or system()).
But if you want a programmatic solution, ps2pdf is just a wrapper around Ghostscript. Have you considered using the Ghostscript libraries?
You cannot wrap a PostScript file into a PDF file.
Although a PDF file looks similar to a PostScript file,
a PDF file must have a special structure, including a cross-reference
table at the end with file offsets to different parts of the PDF file.
To understand the PDF file format you can download the PDF Reference from:
http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
If your software generates the PostScript file, maybe you can
extend it to write a PDF file too? It takes some time to understand
the PDF file format but it is not especially difficult if you are familiar with PostScript.
If this is too difficult, then use pdf2ps to do the hard work for you.

How can I do a full-text search of PDF files from Perl?

I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string.
To date I have been using this:
my #search_results = `grep -i -l \"$string\" *.pdf`;
where $string is the text to look for.
However this fails for most pdf's because the file format is obviously not ASCII.
What can I do that's easiest?
Clarification:
There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.
Final solution using Adam Bellaire's suggestion below:
#search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;
The PerlMonks thread here talks about this problem.
It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:
my #search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;
My library, CAM::PDF, has support for extracting text, but it's an inherently hard problem given the graphical orientation of PDF syntax. So, the output is sometimes gibberish. CAM::PDF bundles a getpdftext.pl program, or you can invoke the functionality like so:
my $doc = CAM::PDF->new($filename) || die "$CAM::PDF::errstr\n";
for my $pagenum (1 .. $doc->numPages()) {
my $text = $doc->getPageText($pagenum);
print $text;
}
I second Adam Bellaire solution. I used pdftotext utility to create full-text index of my ebook library. It's somewhat slow but does its job. As for full-text, try PLucene or KinoSearch to store full-text index.
You may want to look at PDF::Core.
The easiest fulltext index/seach I've used is mysql. You just insert into the table with the appropriate index on it. You need to spend some time working out the relative weightings for fields (a match in the title might score higher than a match in the body), but this is all possible, albeit with some hairy sql.
Plucene is deprecated (there hasn't been any active work on it in the last two years afaik) in favour of KinoSearch. KinoSearch grew, in part, out of understanding the architectural limitations of Plucene.
If you have ~300 pdfs, then once you've extracted the text from the PDF (assuming the PDF has text and not just images of text ;) and depending on your query volumes you may find grep is sufficient.
However, I'd strongly suggest the mysql/kinosearch route as they have covered a lot of ground (stemming, stopwords, term weighting, token parsing) that you don't benefit from getting bogged down with.
KinoSearch is probably faster than the mysql route, but the mysql route gives you more widely used standard software/tools/developer-experience. And you get the ability to use the power of sql to augement your freetext search queries.
So unless you're talking HUGE data-sets and insane query volumes, my money would be on mysql.
You could try Lucene (the Perl port is called Plucene). The searches are incredibly fast and I know that PDFBox already knows how to index PDF files with Lucene. PDFBox is Java, but chances are there is something very similar somewhere in CPAN. Even if you can't find something that already adds PDF files to a Lucene index it shouldn't be more than a few lines of code to do it yourself. Lucene will give you quite a few more searching options than simply looking for a string in a file.
There's also a very quick and dirty way. Text in a PDF file is actually stored as plain text. If you open a PDF in a text editor or use 'strings' you can see the text in there. The binary junk is usually embedded fonts, images, etc.