Why would PDF::FromHTML behave differently when called from my web app? - perl

I'm using PDF::FromHTML to generate a PDF from HTML(as the name of the module would imply) :)
I'm able to run it from the command line fine and receive my expected output - however when I use the exact same code in my web app, the output doesn't look correct at all - text is appearing in the wrong places and not all of it is appearing.
I am using the exact same input file in the web app and on the command line - some reason when it's called from inside my web app, it's appearing differently.
Here is the code:
use PDF::FromHTML;
my $filename = '/tmp/backup.html';
my $font = 'Helvetica';
my $encoding = 'utf-8';
my $size = 12;
my $landscape = 0;
my $pdf = PDF::FromHTML->new(
encoding => $encoding,
);
my $input_file = $filename;
my $output_file = "$input_file.pdf";
warn "$input_file\n$output_file\n";
$pdf->load_file($input_file);
$pdf->convert(
Font => $font,
LineHeight => $size,
Landscape => $landscape,
);
$pdf->write_file($output_file);
The web app code is the same, just with that block thrown into a method.
I have looked at the two generated PDF files in a hex editor and found the differences. They're the same until a block whose intent I can't understand...
Good PDF contents at that block:
/Length 302 >> stream
**binary data
endstream endobj
10 0 obj << /Filter [ /FlateDecode ] /Length 966
Bad PDF contents:
/Length 306 >> stream
**binary data
endstream endobj
10 0 obj << /Filter [ /FlateDecode ] /Length 559
As you can see, the length of the content contained in here differs, as does the binary data contained in that stream(the length 302 vs length 306 one) as well as the next stream(the length 966 vs 559 one).
I'm not entirely sure what could be causing this discrepancy, the only thing I can think of is some sort of difference in the environments when I'm running this as my user on the command line versus running it from the web app. I don't know where I should start with debugging that, however.

In general, the CGI environment is different than your interactive login environment just like someone else's login environment is different than yours. The trick is to figure out what thing you have set or unset on your command line that makes your program work.
You might want to see my Troubleshooting Perl CGI scripts for a step-by-step method to track down these problems.
Some things to investigate:
Is your CGI script running on the same platform (i.e. is it a Windows versus Unix sorta thing)
What's different about the environment variables?
Does your CGI script use the same version of Perl?
Does that perl binary have different compilation options?
Are you using the same versions of the modules?
If some of those modules use external libraries, are they the same?
A useful technique is to make your login shell temporarily have the same setup as your CGI environment. Once you do that, you should get the same results on the command line even if those results are wrong. However, once you get the wrong results you can start tracking it down from the command line.
Good luck.

Couple of suggestions:
PDF::FromHTML uses PDF::Writer, which in turn uses a PDF rendering library as a plugin (think the options are PDFLib and some others). Are the same version of the libraries available as plugins?
Does your HTML input file have a CSS file that you haven't uploaded?
Try setting the other PDF::FromHTML variables: PageWidth, PageResolution, PageSize etc
Is the ordering of the output text different or merely the postions? If it's position then try setting the PageWidth etc as the library being used (PDFLib or whatever) may pick different defaults between the two environments. If ordering is wrong then I have no idea.
The two PDF blocks you posted don't really show much - just shows that the compressed sections are of different sizes. There's nothing actually wrong syntactically with either example.

Maybe there is some encoding problem? Have a look at the headers.

I would take a good look at what user the Web server is running as and what that users environment variables look like. Also pay attention to that users permissions on the directories. Also are there other things limiting the web server users such as SElinux on a linux box?

Related

How to locate code causing corrupt binary output in Perl

I have a relatively complex Perl program that manages various pages and resources for my sites. Somewhere along the line I messed up something in a library of several thousand lines that provides essential services to most of the different scripts for the system so that scripts within my codebase that output PDF or PNG files no longer can output those files validly. If I rewrite the scripts that do the output to avoid using that library, they work, but I'd like to figure out what I broke within my library that is causing it to hurt binary output.
For example, in one snippet of code, I open a PDF file (or any sort of file -- it detects the mime type automatically) and then print it directly:
#Figure out MIME type.
use File::MimeInfo::Magic;
$mimeType = mimetype($filename);
my $fileData;
open (resource, $filename);
foreach my $self (<resource>) { $fileData .= $self; }
close (resource);
print "Content-type: " . $mimeType . "\n\n";
print $fileData;
exit;
This worked great, but at some point while editing the suspect library I mentioned, I did something that broke it and I'm stumped as to what I did. I had been playing with different utf8 encoding functions, but as far as I can tell, I removed all of my experimental code and the problem remains. Merely loading that library, without calling any of its functions, breaks the script's ability to output the file.
The output is actually being corrupted visibly, if I open it in a text editor. If I compare the source file that is opened by the code given above and the output, the source file and the output file have many differences despite there being no processing in the code above before output (those links are to a sample PDF that was run through the broken code).
I've tried retracing my steps for days and cannot find what is wrong in the problematic library -- I hadn't used this function in awhile and I wrote a lot of new code since I last tested it, so it is hard to know precisely where the problem is. My hope is someone may be able to look at the corrupted output file in comparison to the source file and at least point me in the direction of what I should be looking for that could cause such a result. I feel like I'm looking for a needle in the haystack.

Perl's Mechanize and save_content()

I'm trying to go paperless with all my utility bills, and that means downloading the statements from Suddenlink instead of stuffing the paper ones into a filing cabinet.
I've used WWW::Mechanize before and I've liked it (Why did I try to do this stuff in LWP for so long?), and so I've went ahead and gotten a workable script ready. I can log in, navigate to the page that lists the pdf links, and loop through those.
I do the following:
my $pdf = $mech->clone();
for my $link ($mech->find_all_links(url_regex => qr/viewstatement\.html/)) {
[removed for brevity]
unless (-f "Suddenlink/$year/$date.pdf") {
$pdf->get($link->url);
$pdf->save_content("Suddenlink/$year/$date.pdf", binary => 1);
When I compare one of these files with the same downloaded via Chrome, it's apparent what the problem is. Both files are identical on up to about 8-24 kbytes (it varies), but the Chrome pdf will be complete, and the perl-script pdf will be truncated.
It's late, and there's nothing obviously wrong with the code. Google is turning up a few problems with save_content(), but not anything like what I'm getting.
What am I doing wrong?
...[S]et $mech->agent_alias() to something. [Suddenlink is] doing a connection reset whenever they see a weird user agent string.
– John O 18 hours ago

Screen scraping: Automating a vim script

In vim, I loaded a series of web pages (one at a time) into a vim buffer (using the vim netrw plugin) and then parsed the html (using the vim elinks plugin). All good. I then wrote a series of vim scripts using regexes with a final result of a few thousand lines where each line was formatted correctly (csv) for uploading into a database.
In order to do that I had to use vim's marking functionality so that I could loop over specific points of the document and reassemble it back together into one csv line. Now, I am considering automating this by using Perl's "Mechanize" library of classes (UserAgent, etc).
Questions:
Can vim's ability to "mark" sections of a document (in order to
perform substitutions on) be accomplished in Perl?
It was suggested to use "elinks" directly - which I take to mean to
load the page into a headless browser using ellinks and perform Perl
scripts on the content from there(?)
If that's correct, would there become a deployment problem with
elinks when I migrate the site from my localhost LAMP stack setup to
a hosting company like Bluehost?
Thanks
Edit 1:
TYRING TO MIGRATE KNOWLEDGE FROM VIM TO PERL:
If #flesk (below) is right, then how would I go about performing this routine (written in vim) that "marks" lines in a text file ("i" and "j") and then uses that as a range ('i,'j) to perform the last two substitutions?
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|
\ 'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
I am not seeing this capability in the perldoc perlre manual. Am I missing either a module or some basic Perl understanding of m/ or qr/ ??
I'm sure all you need is some kind of HTML parser. For example I'm using HTML::TreeBuilder::XPath.

Open a local web page from Perl

I'm writing a Perl script that creates HTML output and I would like to have it open in the user's preferred browser. Is there a good way to do this? I can't see a way of using ShellExecute since I don't have an http: address for it.
Assuming you saved your output to "../data/index.html",
$ret = system( 'start ..\data\index.html' );
should open the file in the default browser.
Added:
Advice here:
my $filename = "/xyzzy.html"; #whatever
system("start file://$filename");
If I understand what you're trying to do, this will not work. You would have to setup a web server, like apache and configure it to execute your script. This wouldn't be a trivial task if you've never done it before.
Since this is Windows, the easy option is to dump the data to a temporary file using File::Temp (making sure it has an extension .htm or .html, and that it isn't cleaned up immediately on script exit, so that the file remains, i.e, you probably want something like File::Temp->new(UNLINK => 0, SUFFIX => '.htm')). Then you ought to be able to use Win32::FileOp's ShellExecute to open the file regularly. This does make all sorts of assumptions about file types being associated with file extensions, but then, that's how Windows tends to work.

How does the non executeable exploit work?

Hello the question is how works non executable exploit's, when i say non executable i mean those who don't have the file extension .exe, like word exploits .doc or other. How did they make some executable action if they are not compiled?
That varies from exploit to exploit.
While .doc isn't an executable format it does contain interpreted vba code which is generally where the malicious content was hidden. When you opened the document there would be an onOpen event or some such fired which would execute the malicious payload. Hence why most office installations have macro's disabled by default these days, far too much scope for abuse.
There are also plenty of things that will run on your system without being a .exe for example .com, .vbs, .hta
Then there are formats which have no normal executable content but can be attacked in other ways, usually taking advantage of poorly written routines to load the files which can allow things like buffer overflows
The other way is to exploit bugs in the code that handles those files. Often this will be a 'buffer overflow'. Perhaps the code is expecting a header of 100 bytes, but the malicious file has 120 bytes. That causes the program to overwrite some other data in its memory, and if you can smash the 'stack' with your extra bytes it's possible to redirect the processor to a 'payload' code embedded in your file.
google "buffer overflow exploit" for more.