What other diagnostic methods can I use to solve this particular Perl problem? - perl

After a lot of experiments, I still can't get the following script working. I need some guidance on how to diagnoze this particular Perl problem. Thanks in advance.
This script is for testing the use of Office 2007 OCR API:
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const;
Win32::OLE::Const->Load("Microsoft Office Document Imaging 12\.0 Type Library")
or
die "Cannot use the Office 2007 OCR API";
my $miDoc = Win32::OLE->new('MODI.Document')
or die "Cannot create a MODI object";
#Loads an existing TIFF file
$miDoc->Create('OCR-test.tif');
#Performs OCR with the OCR language set to English
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
#Get the OCR result
my $OCRresult = $miDoc->{Images}->Item(0)->{Layout}{Text};
print $OCRresult;
I did a small test. I loaded an .MDI file containing the OCR information. I deleted the OCR method line and ran the script and I got the expected text output of "print $OCRresult". But otherwise, Perl throws me the error saying
Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15
I'm suspecting that something's wrong with the line
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
I tried leaving the parens empty or using three paraments, like 'miLANG_ENGLISH',1,1 etc but without any luck.
I also tried using Microsfot Office Document Imaging to test if the TIF I'm experimenting with was text recognizable and the result was positive.
So what other diagnostic methods do I have?
Or can someone who happens to have Office 2007 test my code with a whatever jpg,bmp or tif pictures that have text content and see if something's wrong?
Thanks in advance.
UPDATE
Haha, I've finally figured out where the problem is and how I can solve it. #hobbs, thank you for leaving the comment :) Things are interesting. When I was trying to respond to your comment, I added the link of the url of Office Document Imaging 2003 VBA Language Reference and I took yet another look at the stuff there. And the following information caught my eyes:
LangId can be one of the following MiLANGUAGES constants.
miLANG_CHINESE_SIMPLIFIED (2052, &H804)
I changed the following OCR method line:
$miDoc->OCR('miLANG_ENGLISH',1,1);
to this:
$miDoc->OCR(2052,1,1);
A few notes:
1. I'm running ActivePerl 5.10.0 on Windows XP (Chinese version)
2. Before this, I already tried $miDoc->(9) but without luck
And suddenly and kind of magically that pesky ERROR saying "Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15" disappeared completely and the OCRed text appeared on the screen. The OCR result was not satisfying but the parameter "2052" refers to Chinese and the TIF image contains all English. So I changed the parameter to
$miDoc->OCR(9,1,1) but this time without luck. Windows threw me this error:
unknown software exception (0x0000000d)
I changed the TIF image to one that contains all Chinese characters and changed the parameter to "$miDoc->OCR(2052,1,1);" again and this time everything worked just like expected. The OCR result was satisfying.
Now I think there's something weird about my Office 2007 OCR API and if someone who happens to run Windows XP (English version) and have installed Office 2007 would probably not encounter that exception error with the parameter
$miDoc->OCR(9,1,1);
Anyway, I'm really happy that I've finally get things working :D

For starters I would try dumping the value of $miDoc->{Images} -- does it exist? If it exists and it's a collection does it contain anything? If it contains anything, what is it? An error? Or maybe just a different structure than you're expecting? warn, Dumper, and a little exploration can go a long way.
Incidentally, if you want to do the "modern" thing and don't mind grabbing a nifty tool off of CPAN, try Devel::Dwarn -- it makes dumping to stderr even more fun than it was already :)

Related

How to locate code causing corrupt binary output in Perl

I have a relatively complex Perl program that manages various pages and resources for my sites. Somewhere along the line I messed up something in a library of several thousand lines that provides essential services to most of the different scripts for the system so that scripts within my codebase that output PDF or PNG files no longer can output those files validly. If I rewrite the scripts that do the output to avoid using that library, they work, but I'd like to figure out what I broke within my library that is causing it to hurt binary output.
For example, in one snippet of code, I open a PDF file (or any sort of file -- it detects the mime type automatically) and then print it directly:
#Figure out MIME type.
use File::MimeInfo::Magic;
$mimeType = mimetype($filename);
my $fileData;
open (resource, $filename);
foreach my $self (<resource>) { $fileData .= $self; }
close (resource);
print "Content-type: " . $mimeType . "\n\n";
print $fileData;
exit;
This worked great, but at some point while editing the suspect library I mentioned, I did something that broke it and I'm stumped as to what I did. I had been playing with different utf8 encoding functions, but as far as I can tell, I removed all of my experimental code and the problem remains. Merely loading that library, without calling any of its functions, breaks the script's ability to output the file.
The output is actually being corrupted visibly, if I open it in a text editor. If I compare the source file that is opened by the code given above and the output, the source file and the output file have many differences despite there being no processing in the code above before output (those links are to a sample PDF that was run through the broken code).
I've tried retracing my steps for days and cannot find what is wrong in the problematic library -- I hadn't used this function in awhile and I wrote a lot of new code since I last tested it, so it is hard to know precisely where the problem is. My hope is someone may be able to look at the corrupted output file in comparison to the source file and at least point me in the direction of what I should be looking for that could cause such a result. I feel like I'm looking for a needle in the haystack.

FileMaker Error: PDF could not be created on this disk

I don't really know if this would be a good place to ask this question, but FileMaker's forums haven't really been all that helpful. Our graphics department recently has been having issues with a script that they have been using for a few years now, and it just stopped working. I know nothing about FileMaker's language and have never used it before, I've just been asked to try and get it figured out.
The version that we are using is Advanced Pro 18.
Here is a snapshot of the script that is being run
This is the error it produces:
Any help would be appreciated, Thanks!
Check, if there is any font used in layout is missing in their computer.
If there are any fonts with upper case extension (.TTF), change it to lower case (.ttf)
It it is not the case, try Arial font for all the fields in the layout.
Make sure you have enough space.
Make sure your pdf document with the same name is not open.
Reinstall your pdf reader.
Suggestion: You can make the script step more simple.
You should use full file path to set the $Filename variable in line 7 and line 18, like these:
in Windows:
Set Variable [$Filename; Value: "filewin:/DriveLetter:/DirectoryName/" & Log Book::calculate job # & ".pdf"]
or in Mac:
Set Variable [$Filename; Value: "filemac:/VolumeName/DirectoryName/" & Log Book::calculate job # & ".pdf"]
One late additional note: make sure the filename is free of "prohibited" characters. If the string produced by Log Book::calculate job # included a "/" character, for example, you'd likely see the same error message.

tSendMail - New Line Trouble

I am trying to create an email with the some job status information, which I wish to put across multiple lines. However, whatever I do, I get the output in one line. Have changed the MIME type to HTML, used "\n", "\r", "\r\n", String Objects newline. Nothing seems to work.
Although I noticed that these characters do get processed, even though the outcome isn't as expected. I don't see them in the email body, which suggests that the text processor accepts them. Just doesn't process them they way it should. Do I see a bug in the component?
I am on Talend Open Studio 7.0.1, on Ubutntu 16.04.4 VM, on Windows 10 system (if that helps).
HTML < BR > works.
I tried it earlier but looks like I didn't structure my html tags well so it failed. Did it from start and got it right.
Guess what - The more you try, the more you learn. :)

Perl Extracting XML Tag Attribute Using Split Or Regex

I am working on a file upload system that also parses the files that are uploaded and generates another file based on info inside the file uploaded. The files being uploaded as XML files. I only need to parse the first XML tag in each file and only need to get the value of the single attribute in the tag.
Sample XML:
<LAB title="lab title goes here">...</LAB>
I am looking for a good way of extracting the value of the title attribute using the Perl split function or using Regex. I would use a Perl XML parser if I had the ability to install Perl modules on the server I am hosting my code on, however I do not have that ability.
This XML is located in an XML file, that I am opening and then attempting to parse out the attribute value. I have tried using both Split and Regex to no luck. However, I am not very familiar with Perl or regular expressions.
This is he basic outline my code so far:
open(LAB, "<", "path-to-file-goes-here") or die "Unable to open lab.\n";
foreach my $line (<LAB>) {
my #pieces = split(/"(.*)"/, $line);
foreach my $piece (#pieces) {
print "$piece\n";
}
}
I have tried using split to match against title alone using
/title/
Or match against the = character or the " character using
/\=/ or /\"/
I have also tried doing similar things using regex and have had no luck as well. I am not sure if I am just not using the proper expression or if this is not possible using split/regex. Any help on the matter would be much appreciated, as I am admittedly a novice at Perl still. If this type of question has been answered elsewhere, I apologize. I did some searching and could not find a solution. Most threads suggest using an XML parsing Perl module, which I would if I had the privileges to install them.
"But I can't use CPAN" is a quick way to get yourself downvoted on the Perl tag (though it wasn't I who did so). There are many ways that you can use CPAN, even if you don't have root. In fact you can have your own Perl even if you don't have root. While I highly recommend some of those options, for now, the easiest way to do this is just to download some Pure Perl modules, and included them in your codebase. Mojolicious has a very small, but very useful XML/DOM parser called Mojo::DOM which is a likely candidate for this kind of process.

Trouble reading text from a pdf in Perl

I am trying to read the text content of a pdf file into a Perl variable. From other SO questions/answers I get the sense that I need to use CAM::PDF. Here's my code:
#!/usr/bin/perl -w
use CAM::PDF;
my $pdf = CAM::PDF->new('1950-01-01.pdf');
print $pdf->numPages(), " pages\n\n";
my $text = $pdf->getPageText(1);
print $text, "\n";
I tried running this on this pdf file. There are no errors reported by Perl. The first print statement works; it prints "2 pages" which is the correct number of pages in this document.
The next print statement does not return anything readable. Here's what the output looks like in Emacs:
2 pages
^A^B^C^D^E^C^F^D^G^H
^D^A^K^L^C^M^D^N^C^M^O^D^P^C^Q^Q^C ^D^R^K^M^O^D ^A^B^C^D^E
^F^G^G^H^E
^K^L
^M^N^E^O^P^E^O^Q^R^S^E
.... more lines with similar codes ....
Is there something I can do to make this work? I don't understand pdf files too well, but I thought that because I can easily copy and paste the text from the PDF file using Acrobat, it must be recognized as text and not an image, so I hoped this meant I could extract it with Perl.
Any guidance would be much appreciated.
PDFs can have different kinds of content. A PDF may not have any readable text at all, only bitmaps and graphical content, for example. The PDF you linked to, has compressed data in it. Open it with a text editor, and you will see that the content is in a "/Filter/FlateDecode" block. Perhaps CAM::PDF doesn't support that. Google FlateDecode for a few ideas.
Looking further into that PDF, i see that it also uses embedded subsets of fonts, with custom encodings. Even if CAM::PDF handles the compression, the custom encoding may be what's throwing it off. This may help: Web page from a software company, describing the problem
I'm fairly certain that the issue isn't with your perl code, it is with the PDF file. I ran the same script on one of my own PDF files, and it works just fine.