I'm trying to go paperless with all my utility bills, and that means downloading the statements from Suddenlink instead of stuffing the paper ones into a filing cabinet.
I've used WWW::Mechanize before and I've liked it (Why did I try to do this stuff in LWP for so long?), and so I've went ahead and gotten a workable script ready. I can log in, navigate to the page that lists the pdf links, and loop through those.
I do the following:
my $pdf = $mech->clone();
for my $link ($mech->find_all_links(url_regex => qr/viewstatement\.html/)) {
[removed for brevity]
unless (-f "Suddenlink/$year/$date.pdf") {
$pdf->get($link->url);
$pdf->save_content("Suddenlink/$year/$date.pdf", binary => 1);
When I compare one of these files with the same downloaded via Chrome, it's apparent what the problem is. Both files are identical on up to about 8-24 kbytes (it varies), but the Chrome pdf will be complete, and the perl-script pdf will be truncated.
It's late, and there's nothing obviously wrong with the code. Google is turning up a few problems with save_content(), but not anything like what I'm getting.
What am I doing wrong?
...[S]et $mech->agent_alias() to something. [Suddenlink is] doing a connection reset whenever they see a weird user agent string.
– John O 18 hours ago
Related
I have a relatively complex Perl program that manages various pages and resources for my sites. Somewhere along the line I messed up something in a library of several thousand lines that provides essential services to most of the different scripts for the system so that scripts within my codebase that output PDF or PNG files no longer can output those files validly. If I rewrite the scripts that do the output to avoid using that library, they work, but I'd like to figure out what I broke within my library that is causing it to hurt binary output.
For example, in one snippet of code, I open a PDF file (or any sort of file -- it detects the mime type automatically) and then print it directly:
#Figure out MIME type.
use File::MimeInfo::Magic;
$mimeType = mimetype($filename);
my $fileData;
open (resource, $filename);
foreach my $self (<resource>) { $fileData .= $self; }
close (resource);
print "Content-type: " . $mimeType . "\n\n";
print $fileData;
exit;
This worked great, but at some point while editing the suspect library I mentioned, I did something that broke it and I'm stumped as to what I did. I had been playing with different utf8 encoding functions, but as far as I can tell, I removed all of my experimental code and the problem remains. Merely loading that library, without calling any of its functions, breaks the script's ability to output the file.
The output is actually being corrupted visibly, if I open it in a text editor. If I compare the source file that is opened by the code given above and the output, the source file and the output file have many differences despite there being no processing in the code above before output (those links are to a sample PDF that was run through the broken code).
I've tried retracing my steps for days and cannot find what is wrong in the problematic library -- I hadn't used this function in awhile and I wrote a lot of new code since I last tested it, so it is hard to know precisely where the problem is. My hope is someone may be able to look at the corrupted output file in comparison to the source file and at least point me in the direction of what I should be looking for that could cause such a result. I feel like I'm looking for a needle in the haystack.
I am trying to create an email with the some job status information, which I wish to put across multiple lines. However, whatever I do, I get the output in one line. Have changed the MIME type to HTML, used "\n", "\r", "\r\n", String Objects newline. Nothing seems to work.
Although I noticed that these characters do get processed, even though the outcome isn't as expected. I don't see them in the email body, which suggests that the text processor accepts them. Just doesn't process them they way it should. Do I see a bug in the component?
I am on Talend Open Studio 7.0.1, on Ubutntu 16.04.4 VM, on Windows 10 system (if that helps).
HTML < BR > works.
I tried it earlier but looks like I didn't structure my html tags well so it failed. Did it from start and got it right.
Guess what - The more you try, the more you learn. :)
I'm not sure how this can be a FOP issue, but I've never seen it with PDFs from any other source, so I've tried to investigate further.
Our application creates PDFs via xsl-fo, using FOP. This has worked great for a couple of years -- occasionally a user will have trouble printing a specific document, and see a very particular type of corruption, wherein most characters are "incremented". That is to say, 1 becomes 2, M becomes N, period becomes a slash, and the word invoice becomes the mildly amusing "jowpjdf". The document displays fine (typically in Adobe Reader). We've generally worked around it, but now an even odder case presents itself.
A new addition to our application creates 2 substantially similar PDFs created with FOP, then concatenates them using Perl's PDF::Reuse to grab the files from the filesystem and create a new document, which is then sent to the user by email. User opens document fine in Reader, hits print, and something new happens... Page 1 prints perfectly, but page 2 is corrupt in exactly the manner described above.
If it was a consistent print driver issue, I'd expect to see both pages corrupted. If it was a FOP issue, likewise. If it was a PDF::Reuse issue, I'd expect to see more fundamental breakage, and this breakage is not new since we started concatenating documents. I'm at a loss where to investigate next.
Has anyone seen similar corruption in PDFs, especially when generating using Apache FOP?
tl;dr PDFs created using FOP sometimes print with every character shifted by 1, e.g. A->B, 3->4
After a lot of experiments, I still can't get the following script working. I need some guidance on how to diagnoze this particular Perl problem. Thanks in advance.
This script is for testing the use of Office 2007 OCR API:
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const;
Win32::OLE::Const->Load("Microsoft Office Document Imaging 12\.0 Type Library")
or
die "Cannot use the Office 2007 OCR API";
my $miDoc = Win32::OLE->new('MODI.Document')
or die "Cannot create a MODI object";
#Loads an existing TIFF file
$miDoc->Create('OCR-test.tif');
#Performs OCR with the OCR language set to English
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
#Get the OCR result
my $OCRresult = $miDoc->{Images}->Item(0)->{Layout}{Text};
print $OCRresult;
I did a small test. I loaded an .MDI file containing the OCR information. I deleted the OCR method line and ran the script and I got the expected text output of "print $OCRresult". But otherwise, Perl throws me the error saying
Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15
I'm suspecting that something's wrong with the line
$miDoc->OCR(LangId => 'miLANG_ENGLISH');
I tried leaving the parens empty or using three paraments, like 'miLANG_ENGLISH',1,1 etc but without any luck.
I also tried using Microsfot Office Document Imaging to test if the TIF I'm experimenting with was text recognizable and the result was positive.
So what other diagnostic methods do I have?
Or can someone who happens to have Office 2007 test my code with a whatever jpg,bmp or tif pictures that have text content and see if something's wrong?
Thanks in advance.
UPDATE
Haha, I've finally figured out where the problem is and how I can solve it. #hobbs, thank you for leaving the comment :) Things are interesting. When I was trying to respond to your comment, I added the link of the url of Office Document Imaging 2003 VBA Language Reference and I took yet another look at the stuff there. And the following information caught my eyes:
LangId can be one of the following MiLANGUAGES constants.
miLANG_CHINESE_SIMPLIFIED (2052, &H804)
I changed the following OCR method line:
$miDoc->OCR('miLANG_ENGLISH',1,1);
to this:
$miDoc->OCR(2052,1,1);
A few notes:
1. I'm running ActivePerl 5.10.0 on Windows XP (Chinese version)
2. Before this, I already tried $miDoc->(9) but without luck
And suddenly and kind of magically that pesky ERROR saying "Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15" disappeared completely and the OCRed text appeared on the screen. The OCR result was not satisfying but the parameter "2052" refers to Chinese and the TIF image contains all English. So I changed the parameter to
$miDoc->OCR(9,1,1) but this time without luck. Windows threw me this error:
unknown software exception (0x0000000d)
I changed the TIF image to one that contains all Chinese characters and changed the parameter to "$miDoc->OCR(2052,1,1);" again and this time everything worked just like expected. The OCR result was satisfying.
Now I think there's something weird about my Office 2007 OCR API and if someone who happens to run Windows XP (English version) and have installed Office 2007 would probably not encounter that exception error with the parameter
$miDoc->OCR(9,1,1);
Anyway, I'm really happy that I've finally get things working :D
For starters I would try dumping the value of $miDoc->{Images} -- does it exist? If it exists and it's a collection does it contain anything? If it contains anything, what is it? An error? Or maybe just a different structure than you're expecting? warn, Dumper, and a little exploration can go a long way.
Incidentally, if you want to do the "modern" thing and don't mind grabbing a nifty tool off of CPAN, try Devel::Dwarn -- it makes dumping to stderr even more fun than it was already :)
I'm using PDF::FromHTML to generate a PDF from HTML(as the name of the module would imply) :)
I'm able to run it from the command line fine and receive my expected output - however when I use the exact same code in my web app, the output doesn't look correct at all - text is appearing in the wrong places and not all of it is appearing.
I am using the exact same input file in the web app and on the command line - some reason when it's called from inside my web app, it's appearing differently.
Here is the code:
use PDF::FromHTML;
my $filename = '/tmp/backup.html';
my $font = 'Helvetica';
my $encoding = 'utf-8';
my $size = 12;
my $landscape = 0;
my $pdf = PDF::FromHTML->new(
encoding => $encoding,
);
my $input_file = $filename;
my $output_file = "$input_file.pdf";
warn "$input_file\n$output_file\n";
$pdf->load_file($input_file);
$pdf->convert(
Font => $font,
LineHeight => $size,
Landscape => $landscape,
);
$pdf->write_file($output_file);
The web app code is the same, just with that block thrown into a method.
I have looked at the two generated PDF files in a hex editor and found the differences. They're the same until a block whose intent I can't understand...
Good PDF contents at that block:
/Length 302 >> stream
**binary data
endstream endobj
10 0 obj << /Filter [ /FlateDecode ] /Length 966
Bad PDF contents:
/Length 306 >> stream
**binary data
endstream endobj
10 0 obj << /Filter [ /FlateDecode ] /Length 559
As you can see, the length of the content contained in here differs, as does the binary data contained in that stream(the length 302 vs length 306 one) as well as the next stream(the length 966 vs 559 one).
I'm not entirely sure what could be causing this discrepancy, the only thing I can think of is some sort of difference in the environments when I'm running this as my user on the command line versus running it from the web app. I don't know where I should start with debugging that, however.
In general, the CGI environment is different than your interactive login environment just like someone else's login environment is different than yours. The trick is to figure out what thing you have set or unset on your command line that makes your program work.
You might want to see my Troubleshooting Perl CGI scripts for a step-by-step method to track down these problems.
Some things to investigate:
Is your CGI script running on the same platform (i.e. is it a Windows versus Unix sorta thing)
What's different about the environment variables?
Does your CGI script use the same version of Perl?
Does that perl binary have different compilation options?
Are you using the same versions of the modules?
If some of those modules use external libraries, are they the same?
A useful technique is to make your login shell temporarily have the same setup as your CGI environment. Once you do that, you should get the same results on the command line even if those results are wrong. However, once you get the wrong results you can start tracking it down from the command line.
Good luck.
Couple of suggestions:
PDF::FromHTML uses PDF::Writer, which in turn uses a PDF rendering library as a plugin (think the options are PDFLib and some others). Are the same version of the libraries available as plugins?
Does your HTML input file have a CSS file that you haven't uploaded?
Try setting the other PDF::FromHTML variables: PageWidth, PageResolution, PageSize etc
Is the ordering of the output text different or merely the postions? If it's position then try setting the PageWidth etc as the library being used (PDFLib or whatever) may pick different defaults between the two environments. If ordering is wrong then I have no idea.
The two PDF blocks you posted don't really show much - just shows that the compressed sections are of different sizes. There's nothing actually wrong syntactically with either example.
Maybe there is some encoding problem? Have a look at the headers.
I would take a good look at what user the Web server is running as and what that users environment variables look like. Also pay attention to that users permissions on the directories. Also are there other things limiting the web server users such as SElinux on a linux box?