Image corruption using LWP::UserAgent - perl

The following script runs in a loop, retrieving images using LWP::UserAgent, and resizing them using Image::Magick.
I am getting this error from Image::Magick when reading the downloaded image:
Exception 450: Unsupported marker type 0x54
If I download the LWP-downloaded image to my computer, open it in a photo editor, save as a .jpg file, upload it and attempt to read with Image::Magick then all is fine. This would lead me to believe that the image is not saving correctly.
I need to use LWP::UserAgent because the server I am connecting to won't allow the download unless it thinks a client is requesting the data.
use LWP::UserAgent;
use Image::Magick;
$ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $PICURL ="http://www.example.com/img.aspx?pid=cjfsaf79afffafhfah777af7";
my $PICDEST ="/var/vhosts/mysite.com/httpdocs/images";
my $PICNAME ="01.jpg";
my $response = $ua->get("$PICURL");
open(outfile, ">:raw", "$PICDEST/$PICNAME");
binmode outfile;
if ($response->is_success) {
print outfile $response->content;
$Pi++;
$PTOT++;
}
else {
die $response->status_line;
}
$image = new Image::Magick;
$image->Read("$PICDEST/$PICNAME");
$image->Scale(width=>800, height=>600);
$image->Write("$PICDEST/$PICNAME");
$image->Scale(width=>216, height=>163);
$image->Set(quality=>90);
$image->Write("$PICDEST/TH_$PICNAME");

Never use
$response->content()
You want
$response->decoded_content( charset => 'none' )

You are probably getting a compressed or otherwise encoded result; try ->decoded_content instead of ->content.
From the HTTP::Response doc:
$r->content( $bytes )
This is used to get/set the raw content and it is inherited from the HTTP::Message base class. See HTTP::Message for details and other methods that can be used to access the content.
$r->decoded_content( %options )
This will return the content after any Content-Encoding and charsets have been decoded. See HTTP::Message for details.

I know this very old at this point, but I just ran into this as well, and I was actually saving the image to disk before working with it and while doing that, I needed to set the file handle that I was streaming/writing to to 'binmode'.
open $fh....
binmode($fh)
print $fh .....
close $fh

Related

Session data not being updated by script using that session

Thinking I have narrowed down the issue, here is a better question.
My script, 'index', opens an existing session - because a session does exist from when it was created by a login script.
It does correctly use values from that session in the page output so evidently, it's accessing the session from either memory or the server's sessions_storage dir.
The script is written so as to add two values to the session but, that's not actually happening. And this is where it gets annoyingly frustrating.
After running the script, I check the session file in filezilla. Those two values do not exist. However, if I output a dump of the session, at the bottom of my script, the two values show in that output.
If I delete the session from my browser and then reload the page, those two values and a few others are showing in the new session file but, of course, the other values stored from previous files (eg login) are missing.
This I have worked out thus far:-
All other files (from login thru to this 'index') are creating and/or storing and retrieving to/from the session without issue.
'Index' script is not adding to the existing session file.
New session forced by deleting session cookie from browser shows the data is being stored as expected in the correct server dir.
Using flush(); at the end of my script (or anywhere after session creation/loading); has made no difference.
Can any of you with fresh eyes tell me what's (not) going on?
my $sessions_dir_location = '/' . $var . '/' . $www . '/' . $vhosts . '/' . $domain . '/name_of_sessions_storage_dir/';
my $session = new CGI::Session(undef, $cgi, {Directory=>"$sessions_dir_location"}) or die CGI::Session->errstr;
my $session_id = $session->id();
$session->flush();
my %vars = $cgi-Vars;
my $business_id = $vars{'business_id'};
print qq(<pre>bid=$business_id</pre>); #successful
$session->param('business_id', $business_id); #unsuccessful
print qq(<pre>session_id = $session_id); #successful
print $session->dump; # shows the business_id value as being stored.
print qq(</pre>);
The following works for me, it increases the session parameter business_id by one for each call:
use strict;
use warnings;
use CGI;
use CGI::Session;
my $cgi = CGI->new();
my $sessions_dir_location = "/tmp/sessions";
# Data Source Name, defaults to "driver:file;serializer:default;id:md5"
my $dsn = undef;
# new() : returns new session object, or undef on failure. Error message is
# accessible through errstr() - class method.
my $session = CGI::Session->new(
$dsn, $cgi, {Directory=>"$sessions_dir_location"}) or die CGI::Session->errstr();
my %vars = $cgi->Vars;
my $cgi_bsid = $vars{business_id};
my $session_bsid = $session->param("business_id");
my $new_bsid = $cgi_bsid // $session_bsid // 0;
$new_bsid++;
$session->param('business_id', $new_bsid);
$session->flush() or die CGI::Session->errstr();
# CGI::Session will use this cookie to identify the user at his/her next request
# and will be able to load his/her previously stored session data.
print $session->header();
my $session_id = $session->id();
print $cgi->start_html();
print join "<br>",
qq(session_id=$session_id),
qq(cgi_bsid=$cgi_bsid),
qq(session_bsid=$session_bsid),
qq(new_id=$new_bsid);
print $cgi->end_html();

Can't figure out why Sereal encoder/decoder round-trip is not returning proper object

With all the hating on Storable -- I decided to check out Sereal for serialization needs. Plus I was having some issues with 32bit/64bit cross platform issues with Storable, so I figured this would be a good time.
After having some issues, I boiled the problem down to the following code. (i'm persisting an HTTP::Request object, hence the example code).
This is my encode test, i'm storing to a file:
use Sereal::Encoder;
use HTTP::Request;
use HTTP::Headers;
use URI;
my $encoder = Sereal::Encoder->new();
open(my $fh, ">", 'myfile.data') or die $!;
binmode($fh);
my $uri = URI->new('http://www.example.com');
my $headers = HTTP::Headers->new(
Content_Type => 'application/json',
);
my $http_request = HTTP::Request->new(POST => $uri, $headers, 'bleh');
print $fh $encoder->encode( $http_request );
close($fh);
And on the same machine(same perl etc. on 5.18), I run the following:
use Sereal::Decoder;
use File::Slurp qw(read_file);
use URI;
my $data = read_file('myfile.data') or die $!;
my $dec = Sereal::Decoder->new();
my $decoded = $dec->decode($data);
print $decoded->{_uri}->scheme,"\n";
And the output of running the encoding program, and then the decoding program is:
Can't locate object method "scheme" via package "URI::http" at testd.pl line 8.
Anyhow, was really nagging me as to what the problem was. I ended up reverting back to Storable and using nfreeze to solve my arch issues with Storable but was wondering why my attempt to transition to Sereal crashed and burned.
Thanks!
Sereal, unlike Storable, won't automatically load a module when it encounters a serialized object. This is a security issue with Storable, so Sereal is working as intended.[1]
At the point where scheme is called in the second test program, URI::http hasn't been loaded yet, so the method call results in an error. It seems that URI will load its subclasses when its constructor is used on a string that "looks like" one of them, e.g.
URI->new('http://www.stackoverflow.com');
loads the URI::http module. So one solution would be to add a dummy invocation of that constructor to ensure URI::http is loaded, or manually use URI::http instead. Either option causes the print $decoded->{_uri}->scheme line of the second script to work as expected, but I think the second is the lesser of two evils (importing an undocumented submodule from URI versus an arbitrary method call done specifically for its not-immediately-obvious side effect).

Including an embedded image in an Outlook HTML email via Perl

I need to generate an HTML email with a banner image embedded. It must go through an Outlook2007 mail client. I tried to base64encode the image and put it inline (it looked good) but Outlook would not send the email. I have culled through many different articles (in various programming languages) that have gotten me to this point but it is still not working. This code creates the email and attaches the image but the image is not displayed.
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my $oMailer = new Win32::OLE('Outlook.Application') or
die "Unable to start an Outlook instance: $!\n";
my $oEmail = $oMailer->CreateItem(0) or
die "Unable to create mail item: $!\n";
$oEmail->{'To'} = 'me#here.org';
$oEmail->{'Subject'} = "Embedded image test";
$oEmail->{'BodyFormat'} = olFormatHTML;
$oEmail->{'HTMLBody'} = "<html><body><img src=\"cid:banner.jpg\"></body></html>";
my $attachments = $oEmail->Attachments();
my $bannerAttachment = $attachments->Add('C:/test/banner.jpg', olEmbeddeditem);
$bannerAttachment->PropertyAccessor->SetProperty(
"http://schemas.microsoft.com/mapi/proptag/0x3712001E", "banner.jpg");
$oEmail->save();
(BTW, I removed all the Win32::OLE->LastError() checks before posting because none of them failed anyway.)
When adding the attachment, it does not set the attachment Type to olEmbeddeditem (5); Don't know if this is relevant to the problem.
The SetProperty does not set the value either. That is supposed to set the Content ID (cid) that is referenced in the img src in the HTML. I used the below code to GetProperty and it returns an empty string.
my $CIDvalue = $bannerAttachment->PropertyAccessor->GetProperty(
"http://schemas.microsoft.com/mapi/proptag/0x3712001E");
print ">>>CIDvalue = $CIDvalue\n";
So close I can taste it!
Careful reading in the Perl docs for WIN32::OLE revealed a SetProperty method that was apparently being called instead of the M$ one I thought I was calling. Changing the code to:
$bannerAttachment->PropertyAccessor->Invoke('SetProperty', "http://schemas.microsoft.com/mapi/proptag/0x3712001E", "banner.jpg");
made it work and there was great rejoicing :)

improving LWP::Simple perl performance

Alas, I have yet another question:
I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read.
So, I can set this up fairly simply like so:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
require HTML::TokeParser;
$|=1; # un buffer
my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx';
my $rawHTML = get($path); # attempt to d/l the page to mem
my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";
open (my $out, "> output.xml") or die;
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
if ($url =~ /event\.aspx\?eventid=(\d+)/) {
( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
my $text = $p->get_trimmed_text("/a");
print $out $event_id,"\n";
print $out $text,"\n";
my $details = $base.$url;
my $contents = get($details);
# now set up another HTML::TokeParser, and parse each of those files.
}
}
This would probably be OK if there were maybe 5 links on this page. However, I'm trying to read from ~600 links, and grab info from each of these pages. So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish.
It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :)
Is there any way to improve on this process?
You will probably see a speed boost -- at the expense of less simple code -- if you execute your get()s in parallel instead of sequentially.
Parallel::ForkManager is where I would start (and even includes an LWP::Simple get() example in its documentation), but there are plenty of other alternatives to be found on CPAN, including the fairly dated LWP::Parallel::UserAgent.
If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. Drop the simplistic LWP::Simple and use the regular LWP::UserAgent with the keep_alive option. That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
my #urls = #ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( #urls ) {
my $req = HEAD $_;
print $req->as_string;
my $rsp = $ua->request( $req );
print $rsp->as_string;
}
my $cache = $ua->conn_cache;
my #conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket
WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper
Both have docs at the links I provided and should help you get going quickly.
There's a good chance it's blocking on the http get request while it waits for the response from the network. Use an asynchronous http library and see if it helps.
use strict;
use warnings;
use threads; # or: use forks;
use Thread::Queue qw( );
use constant MAX_WORKERS => 10;
my $request_q = Thread::Queue->new();
my $response_q = Thread::Queue->new();
# Create the workers.
my #workers;
for (1..MAX_WORKERS) {
push #workers, async {
while (my $url = $request_q->dequeue()) {
$response_q->enqueue(process_request($url));
}
};
}
# Submit work to workers.
$request_q->enqueue(#urls);
# Signal the workers they are done.
for (1..#workers) {
$request_q->enqueue(undef);
}
# Wait for the workers to finish.
$_->join() for #workers;
# Collect the results.
while (my $item = $response_q->dequeue()) {
process_response($item);
}
Your issue is scrapping being more CPU-intensive than I/O-intensive. While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language.
Everyone agrees that Libxml2 is an excellent XML/HTML parser. Also, libcurl is an awesome download agent.
However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant).
In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2:
use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();
I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain.
But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper).

How can I fix the "Couldn't create file parser context for file ..." bug with Perl libxml on Debian?

When I try to read an XML file with XML::Simple, sometimes I get this error message:
Couldn't create file parser context for file ...
After some googling, it seems to be a problem with libxml-libxml-perl and is supposed to be fixed in the version I use (1.59-2).
Any ideas?
Edit: (code)
sub Read
{
my ($file, $no_option) = #_;
my %XML_INPUT_OPTIONS = ( KeyAttr => [], ForceArray => 1 );
if ((defined $file) && (-f $file))
{
my #stats = stat($file);
if ((defined $XML_CACHE{$file})
&& ($stats[9] == $XML_CACHE{$file}{modif_time}))
{
return ($XML_CACHE{$file}{xml});
}
else
{
my $xml = eval { XMLin($file,
(defined $no_option ? () : %XML_INPUT_OPTIONS)) };
AAT::Syslog("AAT::XML", "XML_READ_ERROR", $#) if ($#);
$XML_CACHE{$file}{modif_time} = $stats[9];
$XML_CACHE{$file}{xml} = $xml;
return ($xml);
}
}
return (undef);
}
And yes, I should & will use XML::Simple cache feature...
Does the error continue "No such file or directory at..."? If so, then I think that the problem is that (for whatever reason) when you get to that point in the script, whatever you are passing to XML::Simple has no xml file in it. Long story short, the script you are using may be passing a bad variable (blank? empty?) to XML::Simple at which point the module chokes. To debug, add a check on whatever you hand to XML::Simple before you pass it along. (See the next paragraph for a concrete example explaining why I think this may be your problem.)
A few months ago, I had a similar problem with Weather::Google. In a nutshell, the weather module was trying to get data from Google via LWP::Simple without a user agent. Google began (apparently) to reject requests without a user agent. I had to backtrack through the modules because the error appeared to come from XML::Simple. In fact, it was caused by what was done in LWP::Simple and Weather::Google. Or rather, the error was a result of Weather::Google not checking the data that was in an object created via LWP::Simple. In a case like this, it can be hard at first to see what's going wrong and where.