perl html parsing lib/tool - perl

Is there some powerful tools/libs for perl like BeautifulSoup to python?
Thanks

HTML::TreeBuilder::XPath is a decent solution for most problems.

I never used BeautifulSoup, but from quick skim over its documentation you might want HTML::TreeBuilder. It can process even broken documents well and allows traverse over parsed tree or query items - look at look_down method in HTML::Element.
If you like/know XPath, see daxim's recommendation. If you like to pick items via CSS selector, have a look at Web::Scraper or Mojo::DOM.

As you're looking for power, you can use XML::LibXML to parse HTML. The advantage then is that you have all the power of the fastest and best XML toolchain (excecpt MSXML, which is MS only) available to Perl to process your document, including XPath and XSLT (which would require a re-parse if you used another parser than XML::LibXML).
use strict;
use warnings;
use XML::LibXML;
# In 1.70, the recover and suppress_warnings options won't shup up the
# warnings. Hence, a workaround is needed to keep the messages away from
# the screen.
sub shutup_stderr {
my( $subref, $bufref ) = #_;
open my $fhbuf, '>', $bufref;
local *STDERR = $fhbuf;
$subref->(); # execute code that needs to be shut up
return;
}
# ==== main ============================================================
my $url = shift || 'http://www.google.de';
my $parser = XML::LibXML->new( recover => 2 ); # suppress_warnings => 1
# Note that "recover" and "suppress_warnings" might not work - see above.
# https://rt.cpan.org/Public/Bug/Display.html?id=58024
my $dom; # receive document
shutup_stderr
sub { $dom = $parser->load_html( location => $url ) }, # code
\my $errmsg; # buffer
# Now process document as XML.
my #nodes = $dom->getElementsByLocalName( 'title' );
printf "Document title: %s\n", $_->textContent for #nodes;
printf "Lenght of error messages: %u\n", length $errmsg;
print '-' x 72, "\n";
print $dom->toString( 1 );

Related

Find Favicons in HTML using Perl

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.
I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.
For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."
my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;
I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.
The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:
my $results = $mech->find_all_links( );
foreach my $result (#{ $results }) {
my $attrs = $result->attrs();
#'tag' => "apple-touch-icon"
foreach my $attr (sort keys %{ $attrs }) {
if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
say STDERR "I found it:" . $result->url();
}
# Add tests for other types of icons here.
# E.g. "mask-icon" and "shortcut icon."
}
}
That works, but it seems messy. Is there a better way?
Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:
link[rel*=icon i][href]
This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].
Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):
use v5.10;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
my $results = $ua->get( shift )
->result
->dom
->find( 'link[rel*=icon i][href]' )
->map( attr => 'href' )
->to_array
;
say join "\n", #$results;
That works pretty well so far:
$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico
$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2
$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png
Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:
link[rel*=icon][href], link[rel*=ICON][href]
or different values in rel:
link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]
Line up as many of those as you like.
But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:
my %Interesting = ...;
my $results = $ua->get( shift )
->result
->dom
->find( '...' )
->grep( sub { exists $Interesting{ $_->attr('rel') } } )
->map( attr => 'href' )
->to_array
;
I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.
The problem is very easy to solve with:
assistance of any module allowing to load webpage
define $regex for all possible favicon variations
look for <link rel="$regex" href="icon_address" ...>
Note:
The script has default YouTube url embedded in the code
use strict;
use warnings;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.youtube.com/';
my $icons = get_favicon($url);
say for #{$icons};
sub get_favicon {
my $url = shift;
my #lookup = (
'shortcut icon',
'apple-touch-icon',
'image_src',
'icon',
'alternative icon'
);
my $re = join('|',#lookup);
my $html = load_page($url);
my #icons = ($html =~ /<link rel="(?:$re)" href="(.*?)"/gmsi);
return \#icons;
}
sub load_page {
my $url = shift;
my $response = HTTP::Tiny->new->get($url);
my $html;
if ($response->{success}) {
$html = $response->{content};
} else {
say 'ERROR: Could not extract webpage';
say 'Status: ' . $response->{status};
say 'Reason: ' . $response->{reason};
exit;
}
return $html;
}
Run as script.pl
https://www.youtube.com/s/desktop/8259e7c9/img/favicon.ico
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_32.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_48.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_96.png
https://www.youtube.com/s/desktop/8259e7c9/img/favicon_144.png
https://www.youtube.com/img/desktop/yt_1200.png
Run as script.pl "http://www.microsoft.com/"
https://c.s-microsoft.com/favicon.ico?v2
Run as script.pl "http://finance.yahoo.com/"
https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg

How to read from URL line-wise?

I'm looking for the "moral equivalent" of the (fictitious) openremote below:
my $handle = openremote( 'http://some.domain.org/huge.tsv' ) or die $!;
while ( <$handle> ) {
chomp;
# etc.
# do stuff with $_
}
close $handle;
IOW, I'm looking for a way to open a read handle to a remote file so that I can read from it line-by-line. (Typically this file will be larger than I want to read entirely into memory. This means that solutions based on stuffing the value returned by LWP::Simple::get (for example) into an IO::String are not suitable.)
I'm sure this is really basic stuff, but I have not been able to find it after a lot of searching.
Here's a "solution" much like the other responses but it cheats a bit by using IO::All
use IO::All ;
my $http_io = io->http("http://some.domain.org/huge.tsv");
while (my $line = $http_io->getline || $http_io->getline) {
print $line;
}
After you have an object with io->http you can use IO methods to look at it (like getline() etc.).
Cheers.
You can use LWP::UserAgent's parameter :content_file => $filename to save the big file to disk directly, without filling the memory with it, and then you can read that file in your program.
$ua->get( $url, ':content_file' => $filename );
Or you can use the parameter :content_cb => \&callback and in the callback subroutine you can process the data chunk by chunk as it is downloaded. This is probably the way you need.
$ua->get( $url, ':content_cb' => \&callback );
sub callback {
my ( $chunk, $response, $protocol ) = #_;
#Do whatever you like with $chunk
}
Read (a little) more about this with perldoc LWP::UserAgent.
Use LWP::Simple coupled with IO::String like so:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use IO::String;
my $handle = IO::String->new(get("http://stackoverflow.com"));
while (defined (my $line = <$handle>)) {
print $line;
}
close $handle;
Hope it works for you.
Paul

Retrieve a value from object in perl

I am getting
$VAR1 = bless( \*{'Fh::fh00001Screenshot.png'}, 'Fh' );
in a variable. But I need to retrieve fh00001Screenshot.png from it. How can I get it?
The Fh package is used internally by the CGI module to handle temporary files used for building multipart data. You shouldn't be using it directly.
Check carefully to make sure there is no better way before using this code which comes from the CGI code for Fh::asString
(my $name = $$VAR1) =~ s/^\*(\w+::fh\d{5})+//;
print $name;
output
Screenshot.png
Update
Rather than picking bits out of the CGI code, it looks like this package - which should really be a private one - is accessible from calling code. Use just $var->asString instead, like this
use strict;
use warnings;
use CGI;
my $var = do {
no strict 'refs';
my $var = bless( \*{'Fh::fh00001Screenshot.png'}, 'Fh' );
};
print $var->asString;

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.
In the file, locations.txt, I put the list of counties, such as San Diego County.
The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.
I would very much appreciate any help!
#!/usr/bin/perl
use strict;
use LWP::Simple;
open(L, "locations26.txt");
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
chomp($location);
$location =~ s/ /+/g;
foreach my $year (1923..1923) {
my $u = $url;
$u =~ s/%LOCATION%/$location/;
$u =~ s/%YEAR%/$year/;
#print "$u\n";
my $content = get($u);
my $val = 'unknown';
if ($content =~ / of .strong.([0-9,]+)..strong. /) {
$val = $1;
}
$val =~ s/,//g;
$location =~ s/\+/ /g;
print "'$location',$year,$val\n";
print O "'$location',$year,$val\n";
}
}
Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.
It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/
I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.
You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.
http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples
You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.
Here is an example (maybe working)
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');
my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[#class="form-submit"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.
This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.
use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;
# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";
# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
# This regular expression is like chomp, but removes both Windows and
# *nix line-endings, regardless of the system the script is running on.
$location =~ s/[\r\n]//g;
foreach my $year (1923..1923) {
# If you need to add quotes around the location, use "\"$location\"".
my %args = (LOCATION => $location, YEAR => $year);
my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
# Note that values need to be doubly-escaped because of the
# weird way their website is set up (the "/proxy" URL we're
# requesting is subsequently loading some *other* URL which
# is provided to "/proxy" as a URL-encoded URL).
#
# This regular expression replaces any ^WHATEVER^ in the URL
# with the double-URL-encoded value of WHATEVER in %args.
# The /e flag causes the replacement to be evaluated as Perl
# code. This way I can look data up in a hash and do URL-encoding
# as part of the regular expression without an extra step.
$url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
#print "$url\n";
# Create an HTTP request object for this URL.
my $request = HTTP::Request->new(GET => $url);
# This HTTP header is required. The server outputs garbage if
# it's not present.
$request->push_header('Content-Type' => 'application/json');
# Send the request and check for an error from the server.
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
# The response should be JSON.
my $obj = from_json($response->content);
my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
print O $str;
print $str;
}
}
What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :
use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;
my $ua = LWP::UserAgent->new();
open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;
# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
# remove line ending
chomp $place;
# iterate over years
foreach my $year (1923..1925) {
# building URL with the variables
my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
my $request = HTTP::Request->new(GET => $url);
# faking referer (where we comes from)
$request->header('Referer', 'https://familysearch.org/search/collection/results');
# setting expected format header for response as JSON
$request->header('content_type', 'application/json');
my $response = $ua->request($request);
if ($response->code == 200) {
# this line convert a JSON to Perl HASH
my $hash = decode_json $response->content;
my $val = $hash->{totalHits};
print $fh2 "year $year, place $place : $val\n";
}
else {
die $response->status_line;
}
}
}
END{ close $fh; close $fh2; }
This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.
use 5.010;
use warnings;
use WWW::Mechanize::Firefox;
STDOUT->autoflush;
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);
$mech->get('about:blank');
$mech->get($url);
my $text;
while () {
sleep 1;
$text = $mech->xpath('//p[#class="num-search-results"]/text()', maybe => 1);
last if defined $text;
}
my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
(my $n = $1) =~ tr/,//d;
say $n;
}
output
1-20 of 1,975 results
1975
Update
This update is with special thanks to #nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.
Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).
use strict;
use warnings;
use LWP::UserAgent;
use URI;
use JSON;
use autodie;
STDOUT->autoflush;
open my $fh, '<', 'locations26.txt';
my #locations = <$fh>;
chomp #locations;
open my $outfh, '>', 'out26.txt';
my $ua = LWP::UserAgent->new;
for my $county (#locations[36, 0..2]) {
for my $year (1923 .. 1926) {
my $total = familysearch_info($county, $year);
print STDOUT "$county,$year,$total\n";
print $outfh "$county,$year,$total\n";
}
print "\n";
}
sub familysearch_info {
my ($county, $year) = #_;
my $query = join ' ', (
'+event_place_level_1:California',
sprintf('+event_place_level_2:"%s"', $county),
sprintf('+birth_year:%1$d-%1$d~', $year),
'+gender:M',
'+race:White',
);
my $url = URI->new('https://familysearch.org/search/records');
$url->query_form(
collection_id => 2000219,
count => 20,
query => $query);
my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
my $data = decode_json($resp->decoded_content);
return $data->{totalHits};
}
output
San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908
Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464
Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1
Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67
I do not know how to post revised code from the solution above.
This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.
I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.
It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);
my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[#id="hourglass"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout while waiting for application" if 0 > $retries;
# Now the hourglass is not visible anymore
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
You should set the current form before accessing a field:
"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."
$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );
Sorry, I am not able too try this code out and thanks for open a question for a new question.

What is the easiest way in pure Perl to stream from another HTTP resource?

What is the easiest way (without opening a shell to curl and reading from stdin) in Perl to stream from another HTTP resource? I'm assuming here that the HTTP resource I'm reading from is a potentially infinite stream (or just really, really long)
Good old LWP allows you to process the result as a stream.
E.g., here's a callback to yourFunc, reading/passing byte_count bytes to each call to yourFunc (you can drop that param if you don't care how large the data is to each call, and just want to process the stream as fast as possible):
use LWP;
...
$browser = LWP::UserAgent->new();
$response = $browser->get($url,
':content_cb' => \&yourFunc,
':read_size_hint' => byte_count,);
...
sub yourFunc {
my($data, $response) = #_;
# do your magic with $data
# $respose will be a response object created once/if get() returns
}
HTTP::Lite's request method allows you to specify a callback.
The $data_callback parameter, if used, is a way to filter the data as it is received or to handle large transfers. It must be a function reference, and will be passed: a reference to the instance of the http request making the callback, a reference to the current block of data about to be added to the body, and the $cbargs parameter (which may be anything). It must return either a reference to the data to add to the body of the document, or undef.
However, looking at the source, there seems to be a bug in sub request in that it seems to ignore the passed callback. It seems safer to use set_callback:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Lite;
my $http = HTTP::Lite->new;
$http->set_callback(\&process_http_stream);
$http->http11_mode(1);
$http->request('http://www.example.com/');
sub process_http_stream {
my ($self, $phase, $dataref, $cbargs) = #_;
warn $phase, "\n";
return;
}
Output:
C:\Temp> ht
connect
content-length
done-headers
content
content-done
data
done
It looks like a callback passed to the request method is treated differently:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Lite;
my $http = HTTP::Lite->new;
$http->http11_mode(1);
my $count = 0;
$http->request('http://www.example.com/',
\&process_http_stream,
\$count,
);
sub process_http_stream {
my ($self, $data, $times) = #_;
++$$times;
print "$$times====\n$$data\n===\n";
}
Wait, I don't understand. Why are you ruling out a separate process? This:
open my $stream, "-|", "curl $url" or die;
while(<$stream>) { ... }
sure looks like the "easiest way" to me. It's certainly easier than the other suggestions here...
Event::Lib will give you an easy interface to the fastest asynchronous IO method for your platform.
IO::Lambda is also quite nice for creating fast, responsive, IO applications.
Here is a version I ended up using via Net::HTTP
This is basically a copy of the example from the Net::HTTP man page / perl doc
use Net::HTTP;
my $s = Net::HTTP->new(Host => "www.example.com") || die $#;
$s->write_request(GET => "/somestreamingdatasource.mp3");
my ($code, $mess, %h) = $s->read_response_headers;
while (1) {
my $buf;
my $n = $s->read_entity_body($buf, 4096);
die "read failed: $!" unless defined $n;
last unless $n;
print STDERR "got $n bytes\n";
print STDOUT $buf;
}