perl html treebuilder how to handle error condition - perl

The task is quite simple: access a url and parse it based on the result. In case there is an error (404, 500 etc etc), take appropriate action. The last piece is the one that I am having issue with.
I have listed both the pieces of code that I currently use. The longer one (LWP+TreeBuilder) works for both conditions ; the shorter one (TreeBuilder) works for the first condition but does not work for the error condition. If I use TreeBuilder and the site returns a 404 or some other error, the script simply exits ! Any ideas ?
Longer code that works
use LWP::Simple;
use LWP::UserAgent;
use HTML::TreeBuilder;
$url="http://some_url.com/blahblah" ;
$response = LWP::UserAgent->new->request( HTTP::Request->new( GET => $url ));
if ($response->is_success) {
$p = HTML::TreeBuilder->new();
$p->parse($response->content);
} else {
warn "Couldn't get $url: ", $response->status_line, "\n";
}
Shorter one that does not
use HTML::TreeBuilder;
$url="http://some_url.com/blahblah" ;
$tree = HTML::TreeBuilder->new_from_url($url) ;

To quote the docs:
If LWP is unable to fetch the URL, or the response is not HTML (as determined by content_is_html in HTTP::Headers), then new_from_url dies, and the HTTP::Response object is found in $HTML::TreeBuilder::lwp_response.
Try this:
use strict;
use warnings;
use HTML::TreeBuilder 5; # need new_from_url
use Try::Tiny;
my $url="http://some_url.com/blahblah" ;
my $p = try { HTML::TreeBuilder->new_from_url($url) };
unless ($p) {
my $response = $HTML::TreeBuilder::lwp_response;
if ($response->is_success) {
warn "Content of $url is not HTML, it's " . $response->content_type . "\n";
} else {
warn "Couldn't get $url: ", $response->status_line, "\n";
}
}

the script simply exits
No, it throws an exception. You could always catch the exception with eval BLOCK if you so desired.
my $tree = eval { HTML::TreeBuilder->new_from_url($url) }
or warn($#);

Related

perl LWP::UserAgent gives a cryptic error message

Here's the code:
$vizFile ='https://docs.recipeinvesting.com/t.aaaf.html';
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get($vizFile);
if ($response->is_success) {print $response->decoded_content;}
else {print"\nError= $response->status_line]n";}
I get the message:
Error= HTTP::Response=HASH(0x3a9b810)->status_line]n
The url works fine if I put it in a browser.
This was working consistently (with plain http, using LWP::Simple), until the site made some changes.
Could it be the https that's making the difference?
Is there some way to get a less cryptic error message?
You can't put code in string literals and expect it to get executed. Sure, you can place variables for interpolation, but the making method calls falls on the other side of what's supported.
Replace
print"\nError= $response->status_line]n";
with
print "\nError= " . $response->status_line . "\n";
or
use feature qw( say );
say "\nError= " . $response->status_line;
This will print the status line as desired.
Please see following demo code, it is encouraged to include use strict; and use warnings; in the code what would assist you to avoid many potential problems
use strict;
use warnings;
use feature 'say';
use LWP::UserAgent;
my $url ='https://docs.recipeinvesting.com/t.aaaf.html';
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get($url);
if( $response->is_success ){
say $response->decoded_content;
} else {
die $response->status_line;
}
Documentation: LWP::UserAgent

How to handle errors in plack delayed response

Tried to handle errors in delayed response.
Every time i send [200, [ 'Content-Type', 'application/json' ]
and got error before flushing the other things like that
$w->write("MyData");
$w->close();
i've got a warning in stdout and error in stderr, but page keeps loading.
it'll be loading until i stop app or stop page loading by hand.
how i can stop loading page in code or how to correctly handle errors in such apps where i use delayed response?
Perl version 5.24
Kelp version 1.02
Running Plack with Corona.
We're handling error throwing Exception::Class.
Catching errors with Try::Tiny.
Also tried eval and others things, but it doesn't work.
But changed Try::Tiny -> TryCatc and return if got any error, but
i need write return for every catch block, it looks very bad
#!/usr/bin/perl
use strict;
use warnings;
use Kelp::Less;
get '/hello' => sub {
return sub {
my $res = shift;
my $w = $res->([200, [ 'Content-Type', 'application/json' ]]);
my $data = 10 / 0;
$w->write("MyData");
$w->close();
}
};
run;
I'm looking for correct error handling,
do i need try{} catch{}; on every code that might fail?
Thanks #ikegami for answer, but page still loading after tries with Object::Destoyer and Sub::ScopeFinalizer. As i understand $w(writer) doesn't cause page loading. After exiting scope, $w comes undef then there's nothing to close, here is code.
#!/usr/bin/perl
use strict;
use warnings;
use Object::Destroyer;
use Kelp::Less;
get '/hello' => sub {
return sub {
my $res = shift;
my $w = $res->([200, [ 'Content-Type', 'application/json' ]]);
my $g = Object::Destroyer->new( sub { $w->close if $w } );
my $zzz = 1 / 0;
$w->write("DATA");
$w->close();
}
};
run;
so i've come up with that solution, what do you think?
#!/usr/bin/perl
use strict;
use warnings;
use Try::Tiny;
use Object::Destroyer;
use Kelp::Less;
get '/hello' => sub {
return sub {
my $res = shift;
my $w = $res->([200, [ 'Content-Type', 'application/json' ]]);
my $g = Object::Destroyer->new( sub { $w->close if $w; } );
my $ans = try {
my $zzz = 1 / 0;
}
catch {
print $_;
return;
};
return unless $ans;
$w->write("DATA");
$w->close();
}
};
run;
Solve this problem with wrapping app with
Plack::Middleware::HTTPExceptions

Perl Web Login Script With CGI::Session

i'm on the same problem since almost two week ago.
i'm a newbie with Perl and Web :/
i followed the CGI::Session tutorial and Cookbook, the code seems to be good but... not working.
index.cgi
#!/usr/bin/perl
use CGI;
use CGI::Cookie;
use HTML::Template;
use strict;
use warnings;
use CGI::Session;
use CGI::Carp qw(fatalsToBrowser);
require "cgi-bin/web_base.pl";
require "cgi-bin/login.pl";
my $cgi = new CGI;
my $session = new CGI::Session("driver:File", undef, {Directory=>'/tmp'}) or die CGI::Session->errstr;
my $CGISESSID = $session->id();
print header();
print $cgi->header();
print my_topbar();
login_attempt($session, $cgi);
if ( $session->param("~login-trials") >= 3 ) {
print error("You failed 3 times in a row.\n" .
"Your session is blocked. Please contact us with ".
"the details of your action");
exit(0);
}
unless ( $session->param("~logged-in") ) {
print login_form($cgi, $session);
exit(0);
}
print footer();
login.cgi
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
use CGI::Cookie;
use HTML::Template;
use CGI::Session;
use CGI::Carp qw(fatalsToBrowser);
use Fcntl;
my $cgi = new CGI;
my $session = new CGI::Session(undef, $cgi, {Directory=>'/tmp'});
sub login_attempt {
my ($session, $cgi) = #_;
if ( $session->param("~logged-in") ) {
return 1; # Verify if user is not logged.
}
my $username = $cgi->param("username") or return;
my $password=$cgi->param("password") or return;
# Form submited. Try to load profile.
if ( my $profile = load_profile($username, $password) ) {
$session->param("~profile", $profile);
$session->param("~logged-in", 1);
print "YOUPIIIII";
$session->clear(["~login-trials"]);
$session->redirect("dashboard.cgi");
return 1;
}
# Failed to login, wrong credentials.
my $trials = $session->param("~login-trials") || 0;
return $session->param("~login-trials", ++$trials);
}
return 1;
sub load_profile {
my ($username, $password) = #_;
local $/ = "\n";
unless (sysopen(PROFILE, "profile.txt", O_RDONLY) ) {
die ("Couldn't open profile.txt: $!");
}
while ( <PROFILE> ) {
/^(\n|#)/ and next;
chomp;
my ($n, $p) = split "\s+";
if ( ($n eq $username) && ($p eq $password) ) {
my $p_mask = "x" . length($p);
return {username=>$n, password=>$p_mask};
}
}
close(PROFILE);
return undef;
}
profile.txt
Formget 123
When i try to login, nothing happen, even when i try wrong crendentials it should block me after 3 attemps but it is not.
Can someone really help me on this ? i can't take it anymooooore.
feel free for any questions :)
EDIT :
-login_attempt() corrected
-load-profile wasn't working, made a new one, but still need improvement.
-Last Problem is session init
Are you sure that you don't get any errors? Have you checked the web server error log?
You call login_attempt() with two parameters ($session and $cgi) but in login.cgi, that subroutine is defined like this:
sub login_attempt() {
...
}
You're (probably accidentally) using a prototype on that subroutine, telling Perl that it takes no parameters. So I'd be surprised if you don't get an error saying:
Too many arguments for main::login_attempt
Remove the parentheses from that definition.
sub login_attempt {
...
}
Update: I think you're missing one very important step here. From the CGI::Session Tutorial:
There is one small, but very important thing your application needs to perform after creating CGI::Session object as above. It needs to drop Session ID as an HTTP cookie into the user's computer. CGI::Session will use this cookie to identify the user at his/her next request and will be able to load his/her previously stored session data.
To make sure CGI::Session will be able to read your cookie at next request you need to consult its name() method for cookie's suggested name:
$cookie = $query->cookie( -name => $session->name,
-value => $session->id );
print $query->header( -cookie=>$cookie );
name() returns CGISESSID by default. If you prefer a different cookie name, you can change it as easily too, but you have to do it before CGI::Session object is created:
CGI::Session->name("SID");
$session = CGI::Session->new();
Baking the cookie wasn't too difficult, was it? But there is an even easier way to send a cookie using CGI::Session:
print $session->header();
The above will create the cookie using CGI::Cookie and will return proper http headers using CGI.pm's CGI method. Any arguments to CGI::Session will be passed to CGI::header().
Without this, you'll be creating a brand new session for each request.

mojolicious script works three times, then crashes

The following script should demonstrate a problem I'm facing using Mojolicious on OpenBSD5.2 using mod_perl.
The script works fine 4 times being called as CGI under mod_perl. Additional runs of the script result in Mojolicious not returning the asynchronous posts. The subs that are usually called when data is arriving just don't seem to be called anymore. Running the script from command line works fine since perl is then completely started from scratch and everything is reinitialized, which is not the case under mod_perl. Stopping and starting Apache reinitializes mod_perl so that the script can be run another 4 times.
I only tested this on OpenBSD5.2 using Mojolicious in the version that's provided in OpenBSDs ports tree (2.76). This is kinda old I think but that's what OpenBSD comes with.
Am I doing something completely wrong here? Or is it possible that Mojolicious has some circular reference or something which causes this issue?
I have no influence on the platform (OpenBSD) being used. So please don't suggest to "use Linux and install latest Mojolicious version". However if you are sure that running a later version of Mojolicous will solve the problem, I might get the permission to install that (though I don't yet know how to do that).
Thanks in advance!
T.
Here's the script:
#!/usr/bin/perl
use diagnostics;
use warnings;
use strict;
use feature qw(switch);
use CGI qw/:param/;
use CGI qw/:url/;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use Mojo::IOLoop;
use Mojo::JSON;
use Mojo::UserAgent;
my ($activeconnections, $md5, $cgi);
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6); # Timeout 6 seconds of which 3 may be connecting
my $delay = Mojo::IOLoop->delay();
sub online{
my $url = "http://www.backgroundtask.eu/Systeemtaken/Search.php";
$delay->begin;
$activeconnections++;
my $response_bt = $ua->post_form($url, { 'ex' => $md5 }, sub {
my ($ua, $tx) = #_;
my $content=$tx->res->body;
$content =~ m/(http:\/\/www\.backgroundtask\.eu\/Systeemtaken\/taakinfo\/.*$md5\/)/;
if ($1){
print "getting $1\n";
my $response_bt2 = $ua->get($1, sub {
$delay->end();
$activeconnections--;
print "got result, ActiveConnections: $activeconnections\n";
($ua, $tx) = #_;
my $filename = $tx->res->dom->find('table.view')->[0]->find('tr.even')->[2]->td->[1]->all_text;
print "fn = " . $filename . "\n";
}
)
} else {
print "query did not return a result\n";
$activeconnections--;
$delay->end;
}
});
}
$cgi = new CGI;
print $cgi->header(-cache_control=>"no-cache, no-store, must-revalidate") . "\n";
$md5 = lc($cgi->param("md5") || ""); # read param
$md5 =~ s/[^a-f0-9]*//g if (length($md5) == 32); # custom input filter for md5 values only
if (length $md5 != 32) {
$md5=lc($ARGV[0]);
$md5=~ s/[^a-f0-9]*//g;
die "invalid MD5 $md5\n" if (length $md5 ne 32);
}
online;
if ($activeconnections) {
print "waiting..., activeconnections: $activeconnections\n" for $delay->wait;
}
print "all pending requests completed, activeconnections is " . $activeconnections . "\n";
print "script done.\n md5 was $md5\n";
exit 0;
Well I hate to say it, but there's a lot wrong here. The most glaring is your use of ... for $delay->wait which doesn't make much sense. Also you are comparing numbers with ne rather than !=. Not my-ing the arguments in the deeper callback seems problematic for async style code.
Then there are some code smells, like regexing for urls and closing over the $md5 variable unnecessarily.
Lastly, why use CGI.pm when Mojolicious can operate under CGI just fine? When you do that, the IOLoop is already running, so some things get easier. And yes I understand that you are using the system provided Mojolicious, however I feel I should mention that the current version is 3.93 :-)
Anyway, here is an example, which strips out a lot of things but still should do pretty much the same thing as the example. Of course I can't test it without a valid md5 for the site (and I also can't get rid of the url regex without sample data).
#!/usr/bin/perl
use Mojolicious::Lite;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6); # Timeout 6 seconds of which 3 may be connecting
any '/' => sub {
my $self = shift;
$self->res->headers->cache_control("no-cache, no-store, must-revalidate");
my $md5 = lc($self->param("md5") || ""); # read param
$md5 =~ s/[^a-f0-9]*//g if (length($md5) == 32); # custom input filter for md5 values only
if (length $md5 != 32) {
$md5=lc($ARGV[0]);
$md5=~ s/[^a-f0-9]*//g;
die "invalid MD5 $md5\n" if (length $md5 != 32);
}
$self->render_later; # wait for ua
my $url = "http://www.backgroundtask.eu/Systeemtaken/Search.php";
$ua->post_form($url, { 'ex' => $md5 }, sub {
my ($ua, $tx) = #_;
my $content=$tx->res->body;
$content =~ m{(http://www\.backgroundtask\.eu/Systeemtaken/taakinfo/.*$md5/)};
return $self->render( text => 'Failed' ) unless $1;
$ua->get($1, sub {
my ($ua, $tx) = #_;
my $filename = $tx->res->dom->find('table.view')->[0]->find('tr.even')->[2]->td->[1]->all_text;
$self->render( text => "md5 was $md5, filename was $filename" );
});
});
};
app->start;

Can't use an undefined value as a filehandle reference

first off I've searched the forums and didn't find exactly my issue.
I'm running Ubuntu with perl 5.10 installed.
I'm receiving the following error after executing my script:
"Can't use an undefined value as filehandle reference at scraper.pl line 17"
Here is my script....
#!/usr/bin/perl -W
use strict;
use warnings;
use WWW::Curl::Easy;
my $curl = WWW::Curl::Easy->new;
$curl->setopt(CURLOPT_HEADER, 1);
$curl->setopt(CURLOPT_URL, 'http://something.com');
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);
my $return_code = $curl->perform;
if ($return_code == 0)
{
my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
print ("Success ".$response_code);
}
else
{
# Error Code
print ("An error occured: ".$return_code." ".$curl->strerror($return_code)." ".$curl->errbuf."\n");
}
# EOF
Any help here would be much appreciated.
Thanks,
Ben
In place of:
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);
do:
my $response_body = '';
open(my $fileb, ">", \$response_body);
$curl->setopt(CURLOPT_WRITEDATA,$fileb);
If you check the documentation for the version of WWW-Curl you actually have installed, I think you'll see it passes a filehandle, not a scalar reference.
Alternatively, upgrade WWW-Curl.
Also note that -W is not generally advisable; often modules will disable warnings for a particular scope and the capital W switch prevents that. Use -w instead (or just use warnings; for your own code, which you are already doing).
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Curl::Easy;
use File::Temp qw/tempfile/;
my $response_body = tempfile();
my $curl = WWW::Curl::Easy->new;
$curl->setopt(CURLOPT_HEADER, 1);
$curl->setopt(CURLOPT_URL, 'http://yiddele.com/categoryindex.aspx');
#$curl->setopt(CURLOPT_WRITEDATA,\$response_body);
$curl->setopt(CURLOPT_WRITEDATA, \$response_body);
my $return_code = $curl->perform;
if ($return_code == 0)
{
my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
print ("Success ".$response_code);
}
else
{
# Error Code
print ("An error occured: ".$return_code." ".$curl->strerror($return_code)." ".$curl->errbuf."\n");
}
# EOF
Output is:
Success 200
There is bad code at:
print ("Success ".$response_code);
Look at the documentation for print: due to the way arguments are parsed when you use parentheses, the first argument will be interpreted to be a filehandle, which is not what you intended. In your code, the parentheses are unnecessary; just pass a concatenated string, or better, avoid the concatenation and pass a list of strings:
print 'Success ', $response_code;
Also, please please always include use strict; use warnings; at the top of every script you write. You will discover that many errors are highlighted that may otherwise remain hidden for quite some time, and it also saves everyone's time when you catch an error before ever having to ask on Stack Overflow. :)
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);
You've declared $response_body, but haven't assigned a value to it. I assume that this would work if you made it a string.
my $response_body = "";
That said, I can't be sure as I can't reproduce the error. Perhaps installing a newer version of the module would help too.
use Data::Printer ;
use URI::Encode qw(uri_encode uri_decode);
use JSON ();
use JSON::Parse ':all' ;
use WWW::Curl;
use HTTP::Response ;
use utf8 ;
use Carp ;
use Cwd qw ( abs_path ) ;
use Getopt::Long;
use WWW::Curl::Easy;
my $curl = WWW::Curl::Easy->new;
$curl->setopt(WWW::Curl::Easy::CURLOPT_HEADER(),1);
$curl->setopt(WWW::Curl::Easy::CURLOPT_URL(), 'https://www.pivotaltracker.com/services/v5/me?fields=%3Adefault');
$curl->setopt(WWW::Curl::Easy::CURLOPT_HTTPHEADER() , ['X-TrackerToken: ' . $TOKEN] );
#$curl->setopt(WWW::Curl::Easy::CURLOPT_POST(), 1);
# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(WWW::Curl::Easy::CURLOPT_WRITEDATA(),\$response_body);
# Starts the actual request
my $ret = $curl->perform;
if ($ret == 0) {
print("Transfer went ok\n");
my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
# judge result and next action based on $response_code
$response_body = HTTP::Response->parse($response_body);
print("Received response: $response_body\n");
p($response_body);
my $json_data = $response_body->content ;
$json_data = JSON->new->utf8->decode($json_data);
p($json_data);
} else {
# Error code, type of error, error message
print("An error happened: $ret ".$curl->strerror($ret)." ".$curl->errbuf."\n");
}
# my $cmd='curl -X GET -H "X-TrackerToken: ' . "$TOKEN" . '" "https://www.pivotaltracker.com/services/v5/me?fields=%3Adefault"' ;
# my $json_str = `$cmd`;
# p($json_str);
# my $json_data = JSON->new->utf8->decode($json_str);
# p($json_data);