improving LWP::Simple perl performance - perl

Alas, I have yet another question:
I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read.
So, I can set this up fairly simply like so:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
require HTML::TokeParser;
$|=1; # un buffer
my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx';
my $rawHTML = get($path); # attempt to d/l the page to mem
my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";
open (my $out, "> output.xml") or die;
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
if ($url =~ /event\.aspx\?eventid=(\d+)/) {
( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
my $text = $p->get_trimmed_text("/a");
print $out $event_id,"\n";
print $out $text,"\n";
my $details = $base.$url;
my $contents = get($details);
# now set up another HTML::TokeParser, and parse each of those files.
}
}
This would probably be OK if there were maybe 5 links on this page. However, I'm trying to read from ~600 links, and grab info from each of these pages. So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish.
It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :)
Is there any way to improve on this process?

You will probably see a speed boost -- at the expense of less simple code -- if you execute your get()s in parallel instead of sequentially.
Parallel::ForkManager is where I would start (and even includes an LWP::Simple get() example in its documentation), but there are plenty of other alternatives to be found on CPAN, including the fairly dated LWP::Parallel::UserAgent.

If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. Drop the simplistic LWP::Simple and use the regular LWP::UserAgent with the keep_alive option. That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
my #urls = #ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( #urls ) {
my $req = HEAD $_;
print $req->as_string;
my $rsp = $ua->request( $req );
print $rsp->as_string;
}
my $cache = $ua->conn_cache;
my #conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket

WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper
Both have docs at the links I provided and should help you get going quickly.

There's a good chance it's blocking on the http get request while it waits for the response from the network. Use an asynchronous http library and see if it helps.

use strict;
use warnings;
use threads; # or: use forks;
use Thread::Queue qw( );
use constant MAX_WORKERS => 10;
my $request_q = Thread::Queue->new();
my $response_q = Thread::Queue->new();
# Create the workers.
my #workers;
for (1..MAX_WORKERS) {
push #workers, async {
while (my $url = $request_q->dequeue()) {
$response_q->enqueue(process_request($url));
}
};
}
# Submit work to workers.
$request_q->enqueue(#urls);
# Signal the workers they are done.
for (1..#workers) {
$request_q->enqueue(undef);
}
# Wait for the workers to finish.
$_->join() for #workers;
# Collect the results.
while (my $item = $response_q->dequeue()) {
process_response($item);
}

Your issue is scrapping being more CPU-intensive than I/O-intensive. While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language.
Everyone agrees that Libxml2 is an excellent XML/HTML parser. Also, libcurl is an awesome download agent.
However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant).
In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2:
use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();
I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain.
But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper).

Related

Can't figure out why Sereal encoder/decoder round-trip is not returning proper object

With all the hating on Storable -- I decided to check out Sereal for serialization needs. Plus I was having some issues with 32bit/64bit cross platform issues with Storable, so I figured this would be a good time.
After having some issues, I boiled the problem down to the following code. (i'm persisting an HTTP::Request object, hence the example code).
This is my encode test, i'm storing to a file:
use Sereal::Encoder;
use HTTP::Request;
use HTTP::Headers;
use URI;
my $encoder = Sereal::Encoder->new();
open(my $fh, ">", 'myfile.data') or die $!;
binmode($fh);
my $uri = URI->new('http://www.example.com');
my $headers = HTTP::Headers->new(
Content_Type => 'application/json',
);
my $http_request = HTTP::Request->new(POST => $uri, $headers, 'bleh');
print $fh $encoder->encode( $http_request );
close($fh);
And on the same machine(same perl etc. on 5.18), I run the following:
use Sereal::Decoder;
use File::Slurp qw(read_file);
use URI;
my $data = read_file('myfile.data') or die $!;
my $dec = Sereal::Decoder->new();
my $decoded = $dec->decode($data);
print $decoded->{_uri}->scheme,"\n";
And the output of running the encoding program, and then the decoding program is:
Can't locate object method "scheme" via package "URI::http" at testd.pl line 8.
Anyhow, was really nagging me as to what the problem was. I ended up reverting back to Storable and using nfreeze to solve my arch issues with Storable but was wondering why my attempt to transition to Sereal crashed and burned.
Thanks!
Sereal, unlike Storable, won't automatically load a module when it encounters a serialized object. This is a security issue with Storable, so Sereal is working as intended.[1]
At the point where scheme is called in the second test program, URI::http hasn't been loaded yet, so the method call results in an error. It seems that URI will load its subclasses when its constructor is used on a string that "looks like" one of them, e.g.
URI->new('http://www.stackoverflow.com');
loads the URI::http module. So one solution would be to add a dummy invocation of that constructor to ensure URI::http is loaded, or manually use URI::http instead. Either option causes the print $decoded->{_uri}->scheme line of the second script to work as expected, but I think the second is the lesser of two evils (importing an undocumented submodule from URI versus an arbitrary method call done specifically for its not-immediately-obvious side effect).

In Perl, can I dynamically add methods to only one object of a package?

I'm working with WWW::Mechanize to automate web-based back office clicking I need to do to get my test e-commerce orders into the state I need them to be to test changes I have made to a particular part of a long, multi-part workflow. To process a lot of orders in a batch, I need to click the Home link often. To make that shorter, I hacked a method into WWW::Mechanize at run time like this (based on an example in Mastering Perl by brian d foy):
{ # Shortcut to go back to the home page by calling $mech->go_home
# I know I'll get a warning and do not want it!
no warnings 'once';
my $homeLink = $mech->find_link( text => 'Home' )->url_abs();
$homeLink =~ s/system=0/system=1/;
*WWW::Mechanize::go_home = sub {
my ($self) = #_;
return $self->get($homeLink);
};
}
This works great, and does not hurt anyone because the script I'm using it in is only used by me and is not part of the larger system.
But now I wonder if it is possible to actually only tell one $mech object that is has this method, while another WWW::Mechanize object that might be created later (to, say, do some cross-referencing without mixing up the other one that has an active session to my back office) cannot use that method.
I'm not sure if that is possible at all, since, if I understand the way objects work in Perl, the -> operator tells it to look for the subroutine go_home inside the package WWW::Mechanize and pass the $mech as the first argument to it. Please correct me if this understanding is wrong.
I've experimented by adding a sort of hard-coded check that only lets the original $mech object use the function.
my $onlyThisMechMayAccessThisMethod = "$mech";
my $homeLink = $mech->find_link( text => 'Home' )->url_abs();
$homeLink =~ s/system=0/system=1/;
*WWW::Mechanize::go_home = sub {
my ($self) = #_;
return undef unless $self eq $onlyThisMechMayAccessThisMethod;
return $self->get($homeLink);
};
Since "$mech" contains the address of where the data is stored (e.g. WWW::Mechanize=HASH(0x2fa25e8)), another object will look differently when stringified this way.
I am not convinced however that this is the way to go. So my question is: Is there a better way to only let one object of the WWW::Mechanize class have this method? I'm also glad about other suggestions regarding this code.
This is just
$mech->follow_link(text => 'Home')
and I don't think it's special enough to warrant a method of its own, or to need restricting to an exclusive club of objects.
It's also worth noting that there is no need to mess with typeglobs to declare a subroutine in a different package. You just have to write, for example
sub WWW::Mechanize::go_home {
my ($self) = #_;
return $self->get($homeLink);
};
But the general solution is to subclass WWW::Mechanize and declare as members only those objects you want to have the new method.
File MyMechanize.pm
package MyMechanize;
use strict;
use warnings;
use parent 'WWW::Mechanize';
sub go_home {
my $self = shift;
my $homeLink = $self->find_link(text => 'Home')->url_abs;
$homeLink =~ s/system=0/system=1/;
return $self->get($homeLink);
}
1;
File test.pl
use strict;
use warnings;
use MyMechanize;
my $mech = MyMechanize->new;
$mech->get('http://mydomain.com/path/to/site/page.html')
$mech->go_home;

RequestTracker and Mason destroying my class over and over again

I'm hacking on Request Tracker, which is written in perl and uses Mason
for the web interface. I'm trying to make a customized page, that involves an autohandler, an html page, and pulls in some methods in other comps. I have a simple class that I want to use to track a few things that I need for my parts of the interface. Right now all it
tracks is a database handle.
package RTx::FooBar::Web;
use strict;
use warnings;
use RTx::FooBar::Handle;
sub new
{
my $proto = shift;
$RT::Logger->debug("creating new");
my $class = ref($proto) || $proto;
my $self = {};
bless( $self, $class);
my $handle = RTx::FooBar::Handle->new();
$handle->Connect();
$self->{cfHandle} = $handle;
return $self;
}
sub DESTROY {
my $self = shift;
$RT::Logger->debug("destroy");
delete $self->{cfHandle};
}
sub CFHandle
{
my $self = shift;
return $self->{cfHandle};
}
1;
I tried sticking that into the session so I could use it wherever I needed
it in the web interface. So I try to use it in one web page - the autohandler does:
% $m->call_next;
<%INIT>
$RT::Logger->debug("my autohandler");
use RTx::FooBar::Web;
$session{cfWeb} ||= RTx::FooBar::Web->new();
</%INIT>
The thing that's bugging me right now (other than the fact that it's not
working) is that the logging in the "new" method prints out once, but the
logging in the DESTROY method prints out 56 times. And each time, the
debug in RTx::FooBar::Handle->DESTROY prints out as well, indicating that
$self->{cfHandle} wasn't removed. Can anybody suggest why this might be
happening? Is it because session is a tied hash?
*Update* I'm no longer using $session, and it's STILL destroying my handle 56 times after creating it once.
Just a guess here - $session is re-creating your objects from an externalized version (perhaps via Storable or the like). So it gets the object back without calling new so no logging. Not sure why it would be getting destroyed every time tho.
It looks like $session is in fact a tied hash, to an Apache::Session or something implementing the same interface (the docs talk about db backing for sessions).
If you want to have a global persistent object (i.e., not tied to a single request) then give it a fully-qualified name, e.g., $RTx::FooBar::Web::TheObject ||= RTx::FooBar::Web->new(); or use something like Cache::MemoryCache. Alternately, set up your own global variable in the MasonAllowGlobals setting, although you might need to get into RT's config files to change that.

Perl module loading - Safeguarding against: perhaps you forgot to load "Bla"?

When you run perl -e "Bla->new", you get this well-known error:
Can't locate object method "new" via package "Bla"
(perhaps you forgot to load "Bla"?)
Happened in a Perl server process the other day due to an oversight of mine. There are multiple scripts, and most of them have the proper use statements in place. But there was one script that was doing Bla->new in sub blub at line 123 but missing a use Bla at the top, and when it was hit by a click without any of the other scripts using Bla having been loaded by the server process before, then bang!
Testing the script in isolation would be the obvious way to safeguard against this particular mistake, but alas the code is dependent upon a humungous environment. Do you know of another way to safeguard against this oversight?
Here's one example how PPI (despite its merits) is limited in its view on Perl:
use strict;
use HTTP::Request::Common;
my $req = GET 'http://www.example.com';
$req->headers->push_header( Bla => time );
my $au=Auweia->new;
__END__
PPI::Token::Symbol '$req'
PPI::Token::Operator '->'
PPI::Token::Word 'headers'
PPI::Token::Operator '->'
PPI::Token::Word 'push_header'
PPI::Token::Symbol '$au'
PPI::Token::Operator '='
PPI::Token::Word 'Auweia'
PPI::Token::Operator '->'
PPI::Token::Word 'new'
Setting the header and assigning the Auweia->new parse the same. So I'm not sure how you can build upon such a shaky foundation. I think the problem is that Auweia could also be a subroutine; perl.exe cannot tell until runtime.
Further Update
Okay, from #Schwern's instructive comments below I learnt that PPI is just a tokenizer, and you can build upon it if you accept its limitations.
Testing is the only answer worth the effort. If the code contains mistakes like forgetting to load a class, it probably contains other mistakes. Whatever the obstacles, make it testable. Otherwise you're patching a sieve.
That said, you have two options. You can use Class::Autouse which will try to load a module if it isn't already loaded. It's handy, but because it affects the entire process it can have unintended effects.
Or you can use PPI to scan your code and find all the class method calls. PPI::Dumper is very handy to understand how PPI sees Perl.
use strict;
use warnings;
use PPI;
use PPI::Dumper;
my $file = shift;
my $doc = PPI::Document->new($file);
# How PPI sees a class method call.
# PPI::Token::Word 'Class'
# PPI::Token::Operator '->'
# PPI::Token::Word 'method'
$doc->find( sub {
my($node, $class) = #_;
# First we want a word
return 0 unless $class->isa("PPI::Token::Word");
# It's not a class, it's actually a method call.
return 0 if $class->method_call;
my $class_name = $class->literal;
# Next to it is a -> operator
my $op = $class->snext_sibling or return 0;
return 0 unless $op->isa("PPI::Token::Operator") and $op->content eq '->';
# And then another word which PPI identifies as a method call.
my $method = $op->snext_sibling or return 0;
return 0 unless $method->isa("PPI::Token::Word") and $method->method_call;
my $method_name = $method->literal;
printf "$class->$method_name seen at %s line %d.\n", $file, $class->line_number;
});
You don't say what server enviroment you're running under, but from what you say it sounds like you could do with preloading all your modules in advance before executing any individual pages. Not only would this prevent the problems you're describing (where every script has to remember to load all the modules it uses) but it would also save you memory.
In pre-forking servers (as is commonly used with mod_perl and Apache) you really want to load as much of your code before your server forks for the first time so that the code is stored once in copy-on-write shared memory rather than mulitple times in each child process when it is loaded on demand.
For information on pre-loading in Apache, see the section of Practical mod_perl

Efficient Way To Check 10,000s of Blog Feeds in Perl

We have 10,000s of blogs we want to check multiple times a day for new posts. I'd love some ideas with example code on the most efficient way to do this using Perl.
Currently we are just using LWP::UserAgent to download each RSS feed and then checking each URL in the resulting feed against a MySQL database table of already found URLs one at a time. Needless to say this doesn't scale well and is super inefficient.
Thanks in advance for your help & advice!
Unfortunately, there is probably no other way than do some kind of polling.
Luckily, implementing the PubSubHubbub protocol can greatly help reduce the amount of polling for the feeds who support it.
For those feeds who don't support PubSubHubbub, then you'll have to make sure you use HTTP-level protocols (like ETags or If-Modified-Since headers to know if/when a resource has been updated).
Also make sure you implement some kind of back-off mechanisms.
Perhaps look at AnyEvent::Feed, it is asynchronous (using the AnyEvent event loop) with configurable polling intervals as well as built in support for 'seen' articles, and support for RSS and Atom feeds. You could possibly create a single process polling every feed or multiple processes polling different sections of your feed list.
From the synopsis:
use AnyEvent;
use AnyEvent::Feed;
my $feed_reader =
AnyEvent::Feed->new (
url => 'http://example.com/atom.xml',
interval => $seconds,
on_fetch => sub {
my ($feed_reader, $new_entries, $feed, $error) = #_;
if (defined $error) {
warn "ERROR: $error\n";
return;
}
for (#$new_entries) {
my ($hash, $entry) = #_;
# $hash a unique hash describing the $entry
# $entry is the XML::Feed::Entry object of the new entries
# since the last fetch.
}
}
);
Seems like two questions rolled into one: fetching an comparing. Others have answered the fetch part. As for comparing:
I've been reading about redis lately and it seems like a good fit for you as it can do a lot of simple operations per second (lets say ~80k /s). So checking if you already have an url should go really fast. Never actually used it though ;)
An idea: Have you tried comparing on size before parsing the RSS? Might save you some time if the change infrequently.
10000 are not so many.
You could probably handle then using some simple approach like forking some worker processes that get RSS URLs from the db, fetch them and update the database:
for (1..$n) {
my $pid = fork;
if (!$pid) {
defined $pid or die "fork failed";
my $db = open_db();
while (1) {
$url = get_next_url($db) or last;
$rss = feed_rss($url);
update_rss($db, $rss);
}
exit(0);
}
}
wait_for_workers(#pid);
That, considering you are not able to use some of the existent applications already pointed by other responders.