Can't figure out why Sereal encoder/decoder round-trip is not returning proper object - perl

With all the hating on Storable -- I decided to check out Sereal for serialization needs. Plus I was having some issues with 32bit/64bit cross platform issues with Storable, so I figured this would be a good time.
After having some issues, I boiled the problem down to the following code. (i'm persisting an HTTP::Request object, hence the example code).
This is my encode test, i'm storing to a file:
use Sereal::Encoder;
use HTTP::Request;
use HTTP::Headers;
use URI;
my $encoder = Sereal::Encoder->new();
open(my $fh, ">", 'myfile.data') or die $!;
binmode($fh);
my $uri = URI->new('http://www.example.com');
my $headers = HTTP::Headers->new(
Content_Type => 'application/json',
);
my $http_request = HTTP::Request->new(POST => $uri, $headers, 'bleh');
print $fh $encoder->encode( $http_request );
close($fh);
And on the same machine(same perl etc. on 5.18), I run the following:
use Sereal::Decoder;
use File::Slurp qw(read_file);
use URI;
my $data = read_file('myfile.data') or die $!;
my $dec = Sereal::Decoder->new();
my $decoded = $dec->decode($data);
print $decoded->{_uri}->scheme,"\n";
And the output of running the encoding program, and then the decoding program is:
Can't locate object method "scheme" via package "URI::http" at testd.pl line 8.
Anyhow, was really nagging me as to what the problem was. I ended up reverting back to Storable and using nfreeze to solve my arch issues with Storable but was wondering why my attempt to transition to Sereal crashed and burned.
Thanks!

Sereal, unlike Storable, won't automatically load a module when it encounters a serialized object. This is a security issue with Storable, so Sereal is working as intended.[1]
At the point where scheme is called in the second test program, URI::http hasn't been loaded yet, so the method call results in an error. It seems that URI will load its subclasses when its constructor is used on a string that "looks like" one of them, e.g.
URI->new('http://www.stackoverflow.com');
loads the URI::http module. So one solution would be to add a dummy invocation of that constructor to ensure URI::http is loaded, or manually use URI::http instead. Either option causes the print $decoded->{_uri}->scheme line of the second script to work as expected, but I think the second is the lesser of two evils (importing an undocumented submodule from URI versus an arbitrary method call done specifically for its not-immediately-obvious side effect).

Related

Image corruption using LWP::UserAgent

The following script runs in a loop, retrieving images using LWP::UserAgent, and resizing them using Image::Magick.
I am getting this error from Image::Magick when reading the downloaded image:
Exception 450: Unsupported marker type 0x54
If I download the LWP-downloaded image to my computer, open it in a photo editor, save as a .jpg file, upload it and attempt to read with Image::Magick then all is fine. This would lead me to believe that the image is not saving correctly.
I need to use LWP::UserAgent because the server I am connecting to won't allow the download unless it thinks a client is requesting the data.
use LWP::UserAgent;
use Image::Magick;
$ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $PICURL ="http://www.example.com/img.aspx?pid=cjfsaf79afffafhfah777af7";
my $PICDEST ="/var/vhosts/mysite.com/httpdocs/images";
my $PICNAME ="01.jpg";
my $response = $ua->get("$PICURL");
open(outfile, ">:raw", "$PICDEST/$PICNAME");
binmode outfile;
if ($response->is_success) {
print outfile $response->content;
$Pi++;
$PTOT++;
}
else {
die $response->status_line;
}
$image = new Image::Magick;
$image->Read("$PICDEST/$PICNAME");
$image->Scale(width=>800, height=>600);
$image->Write("$PICDEST/$PICNAME");
$image->Scale(width=>216, height=>163);
$image->Set(quality=>90);
$image->Write("$PICDEST/TH_$PICNAME");
Never use
$response->content()
You want
$response->decoded_content( charset => 'none' )
You are probably getting a compressed or otherwise encoded result; try ->decoded_content instead of ->content.
From the HTTP::Response doc:
$r->content( $bytes )
This is used to get/set the raw content and it is inherited from the HTTP::Message base class. See HTTP::Message for details and other methods that can be used to access the content.
$r->decoded_content( %options )
This will return the content after any Content-Encoding and charsets have been decoded. See HTTP::Message for details.
I know this very old at this point, but I just ran into this as well, and I was actually saving the image to disk before working with it and while doing that, I needed to set the file handle that I was streaming/writing to to 'binmode'.
open $fh....
binmode($fh)
print $fh .....
close $fh

Perl module loading - Safeguarding against: perhaps you forgot to load "Bla"?

When you run perl -e "Bla->new", you get this well-known error:
Can't locate object method "new" via package "Bla"
(perhaps you forgot to load "Bla"?)
Happened in a Perl server process the other day due to an oversight of mine. There are multiple scripts, and most of them have the proper use statements in place. But there was one script that was doing Bla->new in sub blub at line 123 but missing a use Bla at the top, and when it was hit by a click without any of the other scripts using Bla having been loaded by the server process before, then bang!
Testing the script in isolation would be the obvious way to safeguard against this particular mistake, but alas the code is dependent upon a humungous environment. Do you know of another way to safeguard against this oversight?
Here's one example how PPI (despite its merits) is limited in its view on Perl:
use strict;
use HTTP::Request::Common;
my $req = GET 'http://www.example.com';
$req->headers->push_header( Bla => time );
my $au=Auweia->new;
__END__
PPI::Token::Symbol '$req'
PPI::Token::Operator '->'
PPI::Token::Word 'headers'
PPI::Token::Operator '->'
PPI::Token::Word 'push_header'
PPI::Token::Symbol '$au'
PPI::Token::Operator '='
PPI::Token::Word 'Auweia'
PPI::Token::Operator '->'
PPI::Token::Word 'new'
Setting the header and assigning the Auweia->new parse the same. So I'm not sure how you can build upon such a shaky foundation. I think the problem is that Auweia could also be a subroutine; perl.exe cannot tell until runtime.
Further Update
Okay, from #Schwern's instructive comments below I learnt that PPI is just a tokenizer, and you can build upon it if you accept its limitations.
Testing is the only answer worth the effort. If the code contains mistakes like forgetting to load a class, it probably contains other mistakes. Whatever the obstacles, make it testable. Otherwise you're patching a sieve.
That said, you have two options. You can use Class::Autouse which will try to load a module if it isn't already loaded. It's handy, but because it affects the entire process it can have unintended effects.
Or you can use PPI to scan your code and find all the class method calls. PPI::Dumper is very handy to understand how PPI sees Perl.
use strict;
use warnings;
use PPI;
use PPI::Dumper;
my $file = shift;
my $doc = PPI::Document->new($file);
# How PPI sees a class method call.
# PPI::Token::Word 'Class'
# PPI::Token::Operator '->'
# PPI::Token::Word 'method'
$doc->find( sub {
my($node, $class) = #_;
# First we want a word
return 0 unless $class->isa("PPI::Token::Word");
# It's not a class, it's actually a method call.
return 0 if $class->method_call;
my $class_name = $class->literal;
# Next to it is a -> operator
my $op = $class->snext_sibling or return 0;
return 0 unless $op->isa("PPI::Token::Operator") and $op->content eq '->';
# And then another word which PPI identifies as a method call.
my $method = $op->snext_sibling or return 0;
return 0 unless $method->isa("PPI::Token::Word") and $method->method_call;
my $method_name = $method->literal;
printf "$class->$method_name seen at %s line %d.\n", $file, $class->line_number;
});
You don't say what server enviroment you're running under, but from what you say it sounds like you could do with preloading all your modules in advance before executing any individual pages. Not only would this prevent the problems you're describing (where every script has to remember to load all the modules it uses) but it would also save you memory.
In pre-forking servers (as is commonly used with mod_perl and Apache) you really want to load as much of your code before your server forks for the first time so that the code is stored once in copy-on-write shared memory rather than mulitple times in each child process when it is loaded on demand.
For information on pre-loading in Apache, see the section of Practical mod_perl

improving LWP::Simple perl performance

Alas, I have yet another question:
I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read.
So, I can set this up fairly simply like so:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
require HTML::TokeParser;
$|=1; # un buffer
my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx';
my $rawHTML = get($path); # attempt to d/l the page to mem
my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";
open (my $out, "> output.xml") or die;
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
if ($url =~ /event\.aspx\?eventid=(\d+)/) {
( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
my $text = $p->get_trimmed_text("/a");
print $out $event_id,"\n";
print $out $text,"\n";
my $details = $base.$url;
my $contents = get($details);
# now set up another HTML::TokeParser, and parse each of those files.
}
}
This would probably be OK if there were maybe 5 links on this page. However, I'm trying to read from ~600 links, and grab info from each of these pages. So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish.
It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :)
Is there any way to improve on this process?
You will probably see a speed boost -- at the expense of less simple code -- if you execute your get()s in parallel instead of sequentially.
Parallel::ForkManager is where I would start (and even includes an LWP::Simple get() example in its documentation), but there are plenty of other alternatives to be found on CPAN, including the fairly dated LWP::Parallel::UserAgent.
If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. Drop the simplistic LWP::Simple and use the regular LWP::UserAgent with the keep_alive option. That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
my #urls = #ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( #urls ) {
my $req = HEAD $_;
print $req->as_string;
my $rsp = $ua->request( $req );
print $rsp->as_string;
}
my $cache = $ua->conn_cache;
my #conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket
WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper
Both have docs at the links I provided and should help you get going quickly.
There's a good chance it's blocking on the http get request while it waits for the response from the network. Use an asynchronous http library and see if it helps.
use strict;
use warnings;
use threads; # or: use forks;
use Thread::Queue qw( );
use constant MAX_WORKERS => 10;
my $request_q = Thread::Queue->new();
my $response_q = Thread::Queue->new();
# Create the workers.
my #workers;
for (1..MAX_WORKERS) {
push #workers, async {
while (my $url = $request_q->dequeue()) {
$response_q->enqueue(process_request($url));
}
};
}
# Submit work to workers.
$request_q->enqueue(#urls);
# Signal the workers they are done.
for (1..#workers) {
$request_q->enqueue(undef);
}
# Wait for the workers to finish.
$_->join() for #workers;
# Collect the results.
while (my $item = $response_q->dequeue()) {
process_response($item);
}
Your issue is scrapping being more CPU-intensive than I/O-intensive. While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language.
Everyone agrees that Libxml2 is an excellent XML/HTML parser. Also, libcurl is an awesome download agent.
However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant).
In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2:
use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();
I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain.
But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper).

Curl Perl module not working, formadd method missing

I want to use following script:
use FileHandle;
use WWW::Curl::Easy;
use WWW::Curl::Form;
my $file, my $curl, my $curlf, my $return, my $minified;
$file = new FileHandle();
$curl = new WWW::Curl::Easy();
$curl->setopt(CURLOPT_URL, "http://closure-compiler.appspot.com/compile");
$curlf = new WWW::Curl::Form();
$curlf->formadd('output_format', 'text');
$curlf->formadd('output_info', 'compiled_code');
$curlf->formadd('compilation_level', 'ADVANCED_OPTIMIZATIONS');
$curlf->formaddfile($name, 'js_code', 'multipart/form-data');
$curl->setopt(CURLOPT_HTTPPOST, $curlf);
$file->open(\$minified, ">");
$curl->setopt(CURLOPT_WRITEDATA, $file);
$return = $curl->perform();
Following error is thrown:
Can't locate object method "formadd" via package "WWW::Curl::Form" at ./minifyjs.pl ....
WHY??? The WWW::Curl module is installed properly, I used package libwww-curl-perl under Debian/Ubuntu.
Can anyone help me please?
Whoops.
Looks like this commit broke formadd. The XS sub doesn't match the PREFIX = curl_form_ declaration (as it's named curl_formadd), so perl doesn't know how to map the Perl version of the method back to XS.
4.12 was the first release that tried to support WWW::Curl::Form, looks like it didn't work after all. Not sure how I've missed this one. I should probably note it here that WWW::Curl::Form support wasn't exactly a high priority TODO item on my list, due to the existence of various high quality form handling modules on CPAN. I've only accepted the patch for the sake of feature completeness. You're encouraged to use those modules for managing form content. The standard WWW::Curl use case statement applies.
I released 4.13 to fix this issue. Good catch!
Check out WWW::Mechanize. It has a lot of nice form methods.

How can I fix the "Couldn't create file parser context for file ..." bug with Perl libxml on Debian?

When I try to read an XML file with XML::Simple, sometimes I get this error message:
Couldn't create file parser context for file ...
After some googling, it seems to be a problem with libxml-libxml-perl and is supposed to be fixed in the version I use (1.59-2).
Any ideas?
Edit: (code)
sub Read
{
my ($file, $no_option) = #_;
my %XML_INPUT_OPTIONS = ( KeyAttr => [], ForceArray => 1 );
if ((defined $file) && (-f $file))
{
my #stats = stat($file);
if ((defined $XML_CACHE{$file})
&& ($stats[9] == $XML_CACHE{$file}{modif_time}))
{
return ($XML_CACHE{$file}{xml});
}
else
{
my $xml = eval { XMLin($file,
(defined $no_option ? () : %XML_INPUT_OPTIONS)) };
AAT::Syslog("AAT::XML", "XML_READ_ERROR", $#) if ($#);
$XML_CACHE{$file}{modif_time} = $stats[9];
$XML_CACHE{$file}{xml} = $xml;
return ($xml);
}
}
return (undef);
}
And yes, I should & will use XML::Simple cache feature...
Does the error continue "No such file or directory at..."? If so, then I think that the problem is that (for whatever reason) when you get to that point in the script, whatever you are passing to XML::Simple has no xml file in it. Long story short, the script you are using may be passing a bad variable (blank? empty?) to XML::Simple at which point the module chokes. To debug, add a check on whatever you hand to XML::Simple before you pass it along. (See the next paragraph for a concrete example explaining why I think this may be your problem.)
A few months ago, I had a similar problem with Weather::Google. In a nutshell, the weather module was trying to get data from Google via LWP::Simple without a user agent. Google began (apparently) to reject requests without a user agent. I had to backtrack through the modules because the error appeared to come from XML::Simple. In fact, it was caused by what was done in LWP::Simple and Weather::Google. Or rather, the error was a result of Weather::Google not checking the data that was in an object created via LWP::Simple. In a case like this, it can be hard at first to see what's going wrong and where.