How can I add a progress bar to WWW::Mechanize? - perl

I have the following code:
$mech->get($someurl, ":content_file" => "$i.flv");
So I'm getting the contents of a url and saving it as an flv file. I'd like to print out every second or so how much of the download is remaining. Is there any way to accomplish this in WWW::Mechanize?

WWW::Mechanize says that the get method is a "well-behaved" overload of LWP::UserAgent get. Looking at the docs for LWP::UserAgent, you can provide a content_cb key which is called with each chunk of the downloaded file:
$mech->get( $someurl, ":content_cb" => \&callback );
sub callback
{
my( $data, $response, $proto ) = #_;
# save $data to $i.flv
# print download notification
}

Many thanks to Peter Kovacs' answer for leading me to the correct answer. It turned out to be a bit more elaborate than I'd expected though so I decided to (horror) answer my own question.
As Peter showed, I can set a callback like so:
$m->get($u, ":content_cb" => \&callback);
But now I can't save the content using the :content_file value, because I can only choose one of the two. The callback function gets passed the data, and I ended up writing that to a file instead.
I also get a response object which contains the total size of the content as friedo pointed out. So by keeping a running total of content received so far and dividing it by the total content I can find out what percent of the content has been downloaded. Here's the full callback function:
open (VID,">$i.flv") or die "$!";
$total = 0;
sub callback
{
my( $data, $response, $proto ) = #_;
print VID "$data"; # write data to file
$total+= length($data);
$size = $response->header('Content-Length');
print floor(($total/$size)*100),"% downloaded\n"; # print percent downloaded
}
I hope that helps someone.

Related

How to read from URL line-wise?

I'm looking for the "moral equivalent" of the (fictitious) openremote below:
my $handle = openremote( 'http://some.domain.org/huge.tsv' ) or die $!;
while ( <$handle> ) {
chomp;
# etc.
# do stuff with $_
}
close $handle;
IOW, I'm looking for a way to open a read handle to a remote file so that I can read from it line-by-line. (Typically this file will be larger than I want to read entirely into memory. This means that solutions based on stuffing the value returned by LWP::Simple::get (for example) into an IO::String are not suitable.)
I'm sure this is really basic stuff, but I have not been able to find it after a lot of searching.
Here's a "solution" much like the other responses but it cheats a bit by using IO::All
use IO::All ;
my $http_io = io->http("http://some.domain.org/huge.tsv");
while (my $line = $http_io->getline || $http_io->getline) {
print $line;
}
After you have an object with io->http you can use IO methods to look at it (like getline() etc.).
Cheers.
You can use LWP::UserAgent's parameter :content_file => $filename to save the big file to disk directly, without filling the memory with it, and then you can read that file in your program.
$ua->get( $url, ':content_file' => $filename );
Or you can use the parameter :content_cb => \&callback and in the callback subroutine you can process the data chunk by chunk as it is downloaded. This is probably the way you need.
$ua->get( $url, ':content_cb' => \&callback );
sub callback {
my ( $chunk, $response, $protocol ) = #_;
#Do whatever you like with $chunk
}
Read (a little) more about this with perldoc LWP::UserAgent.
Use LWP::Simple coupled with IO::String like so:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use IO::String;
my $handle = IO::String->new(get("http://stackoverflow.com"));
while (defined (my $line = <$handle>)) {
print $line;
}
close $handle;
Hope it works for you.
Paul

Downloading Text from Several Links using WWW::Mechanize

For an entire week I have been attempting to write a code that will download links from a webpage and then loop through each link to dump the content written on each link's page. The original webpage I downloaded has 500 links to separate web pages that each contain important information for me. I only want to go one level down. However I am having several issues.
RECAP: I want to download the links from a webpage and automatically have my program print off the text contained in those links. I would prefer to have them printed in a file.
1) When I download the links from the original website, the useful ones are not written out fully. (ie they say "/festevents.nsf/all?openform" which is not a usable webpage)
2) I have been unable to print the text content of the page. I have been able to print the font details, but that is useless.
#Download all the modules I used#
use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;
#Download original webpage and acquire 500+ Links#
$url = "http://wx.toronto.ca/festevents.nsf/all?openform";
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->get($url);
my $title = $mechanize->title;
print "<b>$title</b><br />";
my #links = $mechanize->links;
foreach my $link (#links) {
# Retrieve the link URL
my $href = $link->url_abs;
#
# $URL1= get("$link");
#
my $ua = LWP::UserAgent->new;
my $response = $ua->get($href);
unless($response->is_success) {
die $response->status_line;
}
my $URL1 = $response->decoded_content;
die Dumper($URL1);
#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);
open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);
}
Please help me! I am desperate! If possible please explain to me the logic behind each step? I have been frying my brain on this for a week and I want help seeing other peoples logic behind the problems.
Too much work. Study the WWW::Mechanize API to realise that almost all of that functionality is already built-in. Untested:
use strictures;
use WWW::Mechanize qw();
use autodie qw(:all);
open my $h, '>:encoding(UTF-8)', 'TorontoParties.txt';
my $mechanize = WWW::Mechanize->new;
$mechanize->get('http://wx.toronto.ca/festevents.nsf/all?openform');
foreach my $link (
$mechanize->find_all_links(url_regex => qr'/festevents[.]nsf/[0-9a-f]{32}/[0-9a-f]{32}[?]OpenDocument')
) {
$mechanize->get($link->url_abs);
print {$h} $mechanize->content(format => 'text');
}
close $h;

perl html parsing lib/tool

Is there some powerful tools/libs for perl like BeautifulSoup to python?
Thanks
HTML::TreeBuilder::XPath is a decent solution for most problems.
I never used BeautifulSoup, but from quick skim over its documentation you might want HTML::TreeBuilder. It can process even broken documents well and allows traverse over parsed tree or query items - look at look_down method in HTML::Element.
If you like/know XPath, see daxim's recommendation. If you like to pick items via CSS selector, have a look at Web::Scraper or Mojo::DOM.
As you're looking for power, you can use XML::LibXML to parse HTML. The advantage then is that you have all the power of the fastest and best XML toolchain (excecpt MSXML, which is MS only) available to Perl to process your document, including XPath and XSLT (which would require a re-parse if you used another parser than XML::LibXML).
use strict;
use warnings;
use XML::LibXML;
# In 1.70, the recover and suppress_warnings options won't shup up the
# warnings. Hence, a workaround is needed to keep the messages away from
# the screen.
sub shutup_stderr {
my( $subref, $bufref ) = #_;
open my $fhbuf, '>', $bufref;
local *STDERR = $fhbuf;
$subref->(); # execute code that needs to be shut up
return;
}
# ==== main ============================================================
my $url = shift || 'http://www.google.de';
my $parser = XML::LibXML->new( recover => 2 ); # suppress_warnings => 1
# Note that "recover" and "suppress_warnings" might not work - see above.
# https://rt.cpan.org/Public/Bug/Display.html?id=58024
my $dom; # receive document
shutup_stderr
sub { $dom = $parser->load_html( location => $url ) }, # code
\my $errmsg; # buffer
# Now process document as XML.
my #nodes = $dom->getElementsByLocalName( 'title' );
printf "Document title: %s\n", $_->textContent for #nodes;
printf "Lenght of error messages: %u\n", length $errmsg;
print '-' x 72, "\n";
print $dom->toString( 1 );

Mechanize question regarding submit()

I tried looking around on the forum and googling for answers but cannot figure it out. After submitting a form for a webpage that requires time to do some computation does Mechanize wait for all the computation to finish (even if it's taking an hour?). It seems as if that doesn't happen. I am iterating through a subroutine that creates a Mechanize object and submits a form and downloads the output file after computation is done. However, I feel like it jumps to the next iteration of loop without completing all those tasks since some times the computation takes a long time. Does anyone have any suggestions? Thanks. This is the subroutine
sub microinspector {
my ($sequence, $folder) = #_;
print STDOUT "subroutine sequence: $sequence\n";
my $browser = WWW::Mechanize->new();
$browser->get("http://bioinfo.uni-plovdiv.bg/microinspector/");
$browser->form_number(1);
$browser->field("target_sequence", $sequence);
$browser->select("Choose an organism : ", "Mus musculus");
$browser->submit();
#print $browser->content();
my #links = $browser->links();
chdir($folder) or die "Cannot chdir to $folder";
foreach my $link (#links) {
#print $link->url();
if( $link->url() =~ /csv$/i ){
my $result = $browser->get( $link->url() );
my $filename = ( $link->url() =~ /\/([^\/]+)$/ )[0];
print "Saving $filename\n";
open( OUT, ">$filename" );
print OUT $result->content();
close( OUT );
}
}
}
WWW::Mechanize can take an optional timeout parameter (specified in seconds) in its constructor (which is passed to its parent class LWP::UserAgent in this case). I think the default is like 180 seconds.
Try increasing it, like:
my $browser = WWW::Mechanize->new(
timeout => 60 * 10, # 10 minutes
);
See the LWP::UserAgent docs on the timeout method for the specific semantics of how this is treated. It's mostly as you expect, but just in case.

What is the easiest way in pure Perl to stream from another HTTP resource?

What is the easiest way (without opening a shell to curl and reading from stdin) in Perl to stream from another HTTP resource? I'm assuming here that the HTTP resource I'm reading from is a potentially infinite stream (or just really, really long)
Good old LWP allows you to process the result as a stream.
E.g., here's a callback to yourFunc, reading/passing byte_count bytes to each call to yourFunc (you can drop that param if you don't care how large the data is to each call, and just want to process the stream as fast as possible):
use LWP;
...
$browser = LWP::UserAgent->new();
$response = $browser->get($url,
':content_cb' => \&yourFunc,
':read_size_hint' => byte_count,);
...
sub yourFunc {
my($data, $response) = #_;
# do your magic with $data
# $respose will be a response object created once/if get() returns
}
HTTP::Lite's request method allows you to specify a callback.
The $data_callback parameter, if used, is a way to filter the data as it is received or to handle large transfers. It must be a function reference, and will be passed: a reference to the instance of the http request making the callback, a reference to the current block of data about to be added to the body, and the $cbargs parameter (which may be anything). It must return either a reference to the data to add to the body of the document, or undef.
However, looking at the source, there seems to be a bug in sub request in that it seems to ignore the passed callback. It seems safer to use set_callback:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Lite;
my $http = HTTP::Lite->new;
$http->set_callback(\&process_http_stream);
$http->http11_mode(1);
$http->request('http://www.example.com/');
sub process_http_stream {
my ($self, $phase, $dataref, $cbargs) = #_;
warn $phase, "\n";
return;
}
Output:
C:\Temp> ht
connect
content-length
done-headers
content
content-done
data
done
It looks like a callback passed to the request method is treated differently:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Lite;
my $http = HTTP::Lite->new;
$http->http11_mode(1);
my $count = 0;
$http->request('http://www.example.com/',
\&process_http_stream,
\$count,
);
sub process_http_stream {
my ($self, $data, $times) = #_;
++$$times;
print "$$times====\n$$data\n===\n";
}
Wait, I don't understand. Why are you ruling out a separate process? This:
open my $stream, "-|", "curl $url" or die;
while(<$stream>) { ... }
sure looks like the "easiest way" to me. It's certainly easier than the other suggestions here...
Event::Lib will give you an easy interface to the fastest asynchronous IO method for your platform.
IO::Lambda is also quite nice for creating fast, responsive, IO applications.
Here is a version I ended up using via Net::HTTP
This is basically a copy of the example from the Net::HTTP man page / perl doc
use Net::HTTP;
my $s = Net::HTTP->new(Host => "www.example.com") || die $#;
$s->write_request(GET => "/somestreamingdatasource.mp3");
my ($code, $mess, %h) = $s->read_response_headers;
while (1) {
my $buf;
my $n = $s->read_entity_body($buf, 4096);
die "read failed: $!" unless defined $n;
last unless $n;
print STDERR "got $n bytes\n";
print STDOUT $buf;
}