How do I handle and send POST requests in Perl and FastCGI? - perl

Unfortunately, I'm not familiar with Perl, so asking here. Actually I'm using FCGI with Perl.
I need to 1. accept a POST request -> 2. send it via POST to another url -> 3. get results -> 4. return results to the first POST request (4 steps).
To accept a POST request (step 1) I use the following code (found it somewhere in the Internet):
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST") {
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
}
else {
print ("some error");
}
#pairs = split(/&/, $buffer);
foreach $pair (#pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
$value =~ s/%(..)/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
The content of $name (it's a string) is the result of the first step. Now I need to send $name via POST request to some_url (step 2) which returns me another result (step 3), which I have to return as a result to the very first POST request (step 4).
Any help with this would be greatly appreciated.
Thank you.

To accept the POST, you can use the hand-rolled code you've shown, but the very best way is to make use of CGI (which is now a core module so it should be in your Perl distribution). For passing on a POST to somewhere else, you can use LWP::UserAgent
#/usr/bin/perl
use strict;
use warnings;
use CGI;
use LWP::UserAgent;
my $cgi = CGI->new; # Will process post upon instantiation
my %params = $cgi->Vars;
my $ua = LWP::UserAgent->new;
my $postTo = 'http://www.somewhere.com/path/to/script';
my $response = $ua->post($postTo, %params);
if ($response->is_success) {
print $response->decoded_content; # or maybe $response->content in your case
} else {
die $response->status_line;
}
}

I highly recommend that you do not try to solve this problem yourself but instead use existing libraries to make you life MUCH easier. The best part of Perl is the vast collection of existing libraries. See http://search.cpan.org/
Good starting places include CGI.pm or a web framework like Catalyst.
The code you've quoted is very buggy. Coincidentally, there was just a post by a popular Perl blogger dissecting this exact code.

Related

Web-crawler optimization

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;
my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my #urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
next if $alreadyvisited{$url}; ## check if already visited
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
print $response->status_line, "\n"; ## check for bad URL
}
my $contents = $response->content(); ## get contents from URL
push #c, $contents;
my #text = &RemoveSGMLtags(\#c);
#print "#text\n";
$alreadyvisited{$url} = 1; ## store URL in hash for future reference
$url_count++;
print "$url\n";
if ($url_count == 500) ## exit if number of crawled pages exceed limit
{
exit 0;
}
my ($page_parser) = HTML::LinkExtor->new(undef, $url);
$page_parser->parse($contents)->eof; ## parse page contents
my #links = $page_parser->links;
foreach my $link (#links)
{
$test = $$link[2];
$test =~ s!^https?://(?:www\.)?!!i;
$test =~ s!/.*!!;
$test =~ s/[\?\#\:].*//;
if ($test eq "en.wikipedia.org") ## check if URL belongs to unt domain
{
next if ($$link[2] =~ m/^mailto/);
next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
push #urls, $$link[2];
}
}
sleep 1;
}
sub RemoveSGMLtags
{
my ($input) = #_;
my #INPUTFILEcontent = #$input;
my $j;my #raw_text;
for ($j=0; $j<$#INPUTFILEcontent; $j++)
{
my $INPUTFILEvalue = $INPUTFILEcontent[$j];
use HTML::Parse;
use HTML::FormatText;
my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
push #raw_text, ($plain_text);
}
return #raw_text;
}
Always use strict
Never use the ampersand & on subroutine calls
Use URI to manipulate URLs
You have a sleep 1 in there, which I assume is to avoid hammering the site too much, which is good. But the bottleneck in almost any web-based application is the internet itself, and you won't be able to make your program any faster without requesting more from the site. That means removing your sleep and perhaps making parallel requests to the server using, for instance, LWP::Parallel::RobotUA. Is that a way you should be going?
Use WWW::Mechanize which handles all the URL parsing and extraction for you. So much easier than all the link parsing you're dealing with. It was created specifically for the sort of thing you're doing, and it's a subclass of LWP::UserAgent so you should just be able to change all your LWP::UserAgent to WWW::Mechanize without having to change any code, except for all the link extraction, so you can do this:
my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my #links = $mech->links;
and then #links is an array of WWW::Mechanize::Link objects.

mojolicious script works three times, then crashes

The following script should demonstrate a problem I'm facing using Mojolicious on OpenBSD5.2 using mod_perl.
The script works fine 4 times being called as CGI under mod_perl. Additional runs of the script result in Mojolicious not returning the asynchronous posts. The subs that are usually called when data is arriving just don't seem to be called anymore. Running the script from command line works fine since perl is then completely started from scratch and everything is reinitialized, which is not the case under mod_perl. Stopping and starting Apache reinitializes mod_perl so that the script can be run another 4 times.
I only tested this on OpenBSD5.2 using Mojolicious in the version that's provided in OpenBSDs ports tree (2.76). This is kinda old I think but that's what OpenBSD comes with.
Am I doing something completely wrong here? Or is it possible that Mojolicious has some circular reference or something which causes this issue?
I have no influence on the platform (OpenBSD) being used. So please don't suggest to "use Linux and install latest Mojolicious version". However if you are sure that running a later version of Mojolicous will solve the problem, I might get the permission to install that (though I don't yet know how to do that).
Thanks in advance!
T.
Here's the script:
#!/usr/bin/perl
use diagnostics;
use warnings;
use strict;
use feature qw(switch);
use CGI qw/:param/;
use CGI qw/:url/;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use Mojo::IOLoop;
use Mojo::JSON;
use Mojo::UserAgent;
my ($activeconnections, $md5, $cgi);
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6); # Timeout 6 seconds of which 3 may be connecting
my $delay = Mojo::IOLoop->delay();
sub online{
my $url = "http://www.backgroundtask.eu/Systeemtaken/Search.php";
$delay->begin;
$activeconnections++;
my $response_bt = $ua->post_form($url, { 'ex' => $md5 }, sub {
my ($ua, $tx) = #_;
my $content=$tx->res->body;
$content =~ m/(http:\/\/www\.backgroundtask\.eu\/Systeemtaken\/taakinfo\/.*$md5\/)/;
if ($1){
print "getting $1\n";
my $response_bt2 = $ua->get($1, sub {
$delay->end();
$activeconnections--;
print "got result, ActiveConnections: $activeconnections\n";
($ua, $tx) = #_;
my $filename = $tx->res->dom->find('table.view')->[0]->find('tr.even')->[2]->td->[1]->all_text;
print "fn = " . $filename . "\n";
}
)
} else {
print "query did not return a result\n";
$activeconnections--;
$delay->end;
}
});
}
$cgi = new CGI;
print $cgi->header(-cache_control=>"no-cache, no-store, must-revalidate") . "\n";
$md5 = lc($cgi->param("md5") || ""); # read param
$md5 =~ s/[^a-f0-9]*//g if (length($md5) == 32); # custom input filter for md5 values only
if (length $md5 != 32) {
$md5=lc($ARGV[0]);
$md5=~ s/[^a-f0-9]*//g;
die "invalid MD5 $md5\n" if (length $md5 ne 32);
}
online;
if ($activeconnections) {
print "waiting..., activeconnections: $activeconnections\n" for $delay->wait;
}
print "all pending requests completed, activeconnections is " . $activeconnections . "\n";
print "script done.\n md5 was $md5\n";
exit 0;
Well I hate to say it, but there's a lot wrong here. The most glaring is your use of ... for $delay->wait which doesn't make much sense. Also you are comparing numbers with ne rather than !=. Not my-ing the arguments in the deeper callback seems problematic for async style code.
Then there are some code smells, like regexing for urls and closing over the $md5 variable unnecessarily.
Lastly, why use CGI.pm when Mojolicious can operate under CGI just fine? When you do that, the IOLoop is already running, so some things get easier. And yes I understand that you are using the system provided Mojolicious, however I feel I should mention that the current version is 3.93 :-)
Anyway, here is an example, which strips out a lot of things but still should do pretty much the same thing as the example. Of course I can't test it without a valid md5 for the site (and I also can't get rid of the url regex without sample data).
#!/usr/bin/perl
use Mojolicious::Lite;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6); # Timeout 6 seconds of which 3 may be connecting
any '/' => sub {
my $self = shift;
$self->res->headers->cache_control("no-cache, no-store, must-revalidate");
my $md5 = lc($self->param("md5") || ""); # read param
$md5 =~ s/[^a-f0-9]*//g if (length($md5) == 32); # custom input filter for md5 values only
if (length $md5 != 32) {
$md5=lc($ARGV[0]);
$md5=~ s/[^a-f0-9]*//g;
die "invalid MD5 $md5\n" if (length $md5 != 32);
}
$self->render_later; # wait for ua
my $url = "http://www.backgroundtask.eu/Systeemtaken/Search.php";
$ua->post_form($url, { 'ex' => $md5 }, sub {
my ($ua, $tx) = #_;
my $content=$tx->res->body;
$content =~ m{(http://www\.backgroundtask\.eu/Systeemtaken/taakinfo/.*$md5/)};
return $self->render( text => 'Failed' ) unless $1;
$ua->get($1, sub {
my ($ua, $tx) = #_;
my $filename = $tx->res->dom->find('table.view')->[0]->find('tr.even')->[2]->td->[1]->all_text;
$self->render( text => "md5 was $md5, filename was $filename" );
});
});
};
app->start;

using Perl to scrape a website

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.
In the file, locations.txt, I put the list of counties, such as San Diego County.
The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.
I would very much appreciate any help!
#!/usr/bin/perl
use strict;
use LWP::Simple;
open(L, "locations26.txt");
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
open(O, ">out26.txt");
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
chomp($location);
$location =~ s/ /+/g;
foreach my $year (1923..1923) {
my $u = $url;
$u =~ s/%LOCATION%/$location/;
$u =~ s/%YEAR%/$year/;
#print "$u\n";
my $content = get($u);
my $val = 'unknown';
if ($content =~ / of .strong.([0-9,]+)..strong. /) {
$val = $1;
}
$val =~ s/,//g;
$location =~ s/\+/ /g;
print "'$location',$year,$val\n";
print O "'$location',$year,$val\n";
}
}
Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.
It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/
I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.
You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.
http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples
You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.
Here is an example (maybe working)
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');
my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[#class="form-submit"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.
This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.
use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;
# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";
# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);
while (my $location = <L>) {
# This regular expression is like chomp, but removes both Windows and
# *nix line-endings, regardless of the system the script is running on.
$location =~ s/[\r\n]//g;
foreach my $year (1923..1923) {
# If you need to add quotes around the location, use "\"$location\"".
my %args = (LOCATION => $location, YEAR => $year);
my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
# Note that values need to be doubly-escaped because of the
# weird way their website is set up (the "/proxy" URL we're
# requesting is subsequently loading some *other* URL which
# is provided to "/proxy" as a URL-encoded URL).
#
# This regular expression replaces any ^WHATEVER^ in the URL
# with the double-URL-encoded value of WHATEVER in %args.
# The /e flag causes the replacement to be evaluated as Perl
# code. This way I can look data up in a hash and do URL-encoding
# as part of the regular expression without an extra step.
$url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
#print "$url\n";
# Create an HTTP request object for this URL.
my $request = HTTP::Request->new(GET => $url);
# This HTTP header is required. The server outputs garbage if
# it's not present.
$request->push_header('Content-Type' => 'application/json');
# Send the request and check for an error from the server.
my $response = $ua->request($request);
die "Error ".$response->code if !$response->is_success;
# The response should be JSON.
my $obj = from_json($response->content);
my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
print O $str;
print $str;
}
}
What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :
use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;
my $ua = LWP::UserAgent->new();
open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;
# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
# remove line ending
chomp $place;
# iterate over years
foreach my $year (1923..1925) {
# building URL with the variables
my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
my $request = HTTP::Request->new(GET => $url);
# faking referer (where we comes from)
$request->header('Referer', 'https://familysearch.org/search/collection/results');
# setting expected format header for response as JSON
$request->header('content_type', 'application/json');
my $response = $ua->request($request);
if ($response->code == 200) {
# this line convert a JSON to Perl HASH
my $hash = decode_json $response->content;
my $val = $hash->{totalHits};
print $fh2 "year $year, place $place : $val\n";
}
else {
die $response->status_line;
}
}
}
END{ close $fh; close $fh2; }
This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.
use 5.010;
use warnings;
use WWW::Mechanize::Firefox;
STDOUT->autoflush;
my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';
my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);
$mech->get('about:blank');
$mech->get($url);
my $text;
while () {
sleep 1;
$text = $mech->xpath('//p[#class="num-search-results"]/text()', maybe => 1);
last if defined $text;
}
my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
(my $n = $1) =~ tr/,//d;
say $n;
}
output
1-20 of 1,975 results
1975
Update
This update is with special thanks to #nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.
Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).
use strict;
use warnings;
use LWP::UserAgent;
use URI;
use JSON;
use autodie;
STDOUT->autoflush;
open my $fh, '<', 'locations26.txt';
my #locations = <$fh>;
chomp #locations;
open my $outfh, '>', 'out26.txt';
my $ua = LWP::UserAgent->new;
for my $county (#locations[36, 0..2]) {
for my $year (1923 .. 1926) {
my $total = familysearch_info($county, $year);
print STDOUT "$county,$year,$total\n";
print $outfh "$county,$year,$total\n";
}
print "\n";
}
sub familysearch_info {
my ($county, $year) = #_;
my $query = join ' ', (
'+event_place_level_1:California',
sprintf('+event_place_level_2:"%s"', $county),
sprintf('+birth_year:%1$d-%1$d~', $year),
'+gender:M',
'+race:White',
);
my $url = URI->new('https://familysearch.org/search/records');
$url->query_form(
collection_id => 2000219,
count => 20,
query => $query);
my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
my $data = decode_json($resp->decoded_content);
return $data->{totalHits};
}
output
San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908
Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464
Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1
Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67
I do not know how to post revised code from the solution above.
This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.
I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.
It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);
my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[#id="hourglass"]' )) {
print "Sleep until we find the thing\n";
sleep 2;
};
die "Timeout while waiting for application" if 0 > $retries;
# Now the hourglass is not visible anymore
#fill out the search form
my #forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[#class="form-submit"]'});
You should set the current form before accessing a field:
"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."
$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );
Sorry, I am not able too try this code out and thanks for open a question for a new question.

How can I handle HTTP redirects in my Perl application?

I use this Perl subroutine to get a line from a webpage, which I then split to get the information I require to proceed. This has worked for a dozen years.
sub capture_line {
my $page_to_get = $_[0];
my $host_to_get_text = $_[1];
my $port = 80;
my $buf = &HTTPGet($page_to_get, $host_to_get_text, $port);
my $image_capture_text;
my #lines = split(/\n/,$buf);
# print "$lines[1]\n";
# print "$page_to_get, $host_to_get_text\n";
# print "$buf\n";
foreach (#lines) {
if (/$text_to_find/i) {
$image_capture_text = $_;
print "in_loop";
last;
}
}
return $image_capture_text;
}
Unforntuately, $page_to_get is now always a 301 redirect and $buf, when printed, gives me a 301 redirection page, which obviously does not contain the sought after text. Is there a $in value pair (for example) that I can use with HTTPGet to hop me over the redirection so that I get the page that I see when I type http://$host_to_get_text$page_to_get into my browser? Or is there a better way to accomplish the same thing (knowlege of an ever changing filename in the source of a viewed webpage)?
Thank you for your time.
Greg Marsh
Where is the HTTPGet function coming from?
If you were to use LWP (http://search.cpan.org/dist/libwww-perl/) to do the HTTP fetching, that will automatically follow redirects (you can specify how many times you want it to follow a redirect before giving up).
e.g.:
use LWP::Simple qw()
my ($page_to_get, $host_to_get_text) = #_;
my $url = "http://$host_to_get_text$page_to_get";
my $buf = LWP::Simple::get($url);
my $image_capture_text;
my #lines = split(/\n/,$buf);
# ...

What is the easiest way in pure Perl to stream from another HTTP resource?

What is the easiest way (without opening a shell to curl and reading from stdin) in Perl to stream from another HTTP resource? I'm assuming here that the HTTP resource I'm reading from is a potentially infinite stream (or just really, really long)
Good old LWP allows you to process the result as a stream.
E.g., here's a callback to yourFunc, reading/passing byte_count bytes to each call to yourFunc (you can drop that param if you don't care how large the data is to each call, and just want to process the stream as fast as possible):
use LWP;
...
$browser = LWP::UserAgent->new();
$response = $browser->get($url,
':content_cb' => \&yourFunc,
':read_size_hint' => byte_count,);
...
sub yourFunc {
my($data, $response) = #_;
# do your magic with $data
# $respose will be a response object created once/if get() returns
}
HTTP::Lite's request method allows you to specify a callback.
The $data_callback parameter, if used, is a way to filter the data as it is received or to handle large transfers. It must be a function reference, and will be passed: a reference to the instance of the http request making the callback, a reference to the current block of data about to be added to the body, and the $cbargs parameter (which may be anything). It must return either a reference to the data to add to the body of the document, or undef.
However, looking at the source, there seems to be a bug in sub request in that it seems to ignore the passed callback. It seems safer to use set_callback:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Lite;
my $http = HTTP::Lite->new;
$http->set_callback(\&process_http_stream);
$http->http11_mode(1);
$http->request('http://www.example.com/');
sub process_http_stream {
my ($self, $phase, $dataref, $cbargs) = #_;
warn $phase, "\n";
return;
}
Output:
C:\Temp> ht
connect
content-length
done-headers
content
content-done
data
done
It looks like a callback passed to the request method is treated differently:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Lite;
my $http = HTTP::Lite->new;
$http->http11_mode(1);
my $count = 0;
$http->request('http://www.example.com/',
\&process_http_stream,
\$count,
);
sub process_http_stream {
my ($self, $data, $times) = #_;
++$$times;
print "$$times====\n$$data\n===\n";
}
Wait, I don't understand. Why are you ruling out a separate process? This:
open my $stream, "-|", "curl $url" or die;
while(<$stream>) { ... }
sure looks like the "easiest way" to me. It's certainly easier than the other suggestions here...
Event::Lib will give you an easy interface to the fastest asynchronous IO method for your platform.
IO::Lambda is also quite nice for creating fast, responsive, IO applications.
Here is a version I ended up using via Net::HTTP
This is basically a copy of the example from the Net::HTTP man page / perl doc
use Net::HTTP;
my $s = Net::HTTP->new(Host => "www.example.com") || die $#;
$s->write_request(GET => "/somestreamingdatasource.mp3");
my ($code, $mess, %h) = $s->read_response_headers;
while (1) {
my $buf;
my $n = $s->read_entity_body($buf, 4096);
die "read failed: $!" unless defined $n;
last unless $n;
print STDERR "got $n bytes\n";
print STDOUT $buf;
}