perl get webpage error with LWP:Simple - perl

I have a project I'm working on for school but I seem to be getting an error here...
I get "Can't call method 'content' on an undefined value at line 5"
use LWP::Simple;
for(my $id=0;$id<55;$id++)
{
my $response = get("http://www.gamereplays.org/community/index.php?act=medals&CODE=showmedal&MDSID=" + $id );
my $content = $response->content;
for(my $id2=0;$id2<10;$id2++)
{
$content =~ /<img src="http:\/\/www\.gamereplays.org\/community\/style_medals\/(.*)$id2\.gif" alt=""\/>/;
$url = "http://www.gamereplays.org/community/style_medals/" . $1 . $id2 . ".gif";
getstore($url, $1 . $id2 . ".gif");
}
}

LWP::simple doesn't return a response object, it return directly a string containing the response body.
And your put some pause between each request to avoid pounding the targeted website.

Related

Perl: Scanning body text for URIs and remove using URI::Find::Simple

I'm trying to scan the article text in a joomla database for URIs that result in a 404. If a 404 is found, remove the anchor and just leave the resulting text.
The code below succeeds in locating the URIs, but I have no idea how to then strip them out of the body text, leaving just the text portion of the link.
The fetch_body() function returns a pointer to the article, where the id, fulltext, and other characteristics can be accessed.
I've also experimented with HTTP::Tiny to download the URL and check the status and it appears to work properly.
Is there an existing module I can use to strip the URL, leaving only the text? Can URI::Find:Simple be used to return the entire HTML (anchor?) surrounding the URI?
I don't know what more information I can provide to help me determine what to do next.
foreach my $ref (fetch_body($dbh)) {
print "checking body: $ref->{'id'} ";
my #uris = URI::Find::Simple::list_uris($ref->{fulltext});
foreach my $uri (#uris) {
print "current uri: $uri\n";
# check uri for status code here ***
my $response = HTTP::Tiny->new->get($url);
if ($response->{success}) {
print "status: $response->{'status'}\n";
} else {
print "Failed: $response->{status} $response->{reasons}\n";
}
}
}
These other two functions just open the database and return a pointer to the article in the joomla database.
sub db_connect() {
my %DB = (
'host' => 'db.example.com',
'db' => 'db5',
'user' => 'joomla',
'pass' => 'joomlapass',
);
return DBI->connect("DBI:mysql:database=$DB{'db'};host=$DB{'host'}",$DB{'user'}, $DB{'pass'});
}
sub fetch_body ($) {
$dbh = shift;
my $sql = "select id, title, alias, urls, \`fulltext\`
FROM xxx_content
WHERE state = 1";
my $sth = $dbh->prepare($sql);
my #rv = ();
$sth->execute();
while (my $ref = $sth->fetchrow_hashref()) {
push #rv, $ref;
}
$sth->finish();
return #rv;
}
You can use something like this:
use strict;
use warnings;
use Mojo::DOM qw( );
sub check_url {
my ($url) = #_;
# Replace with code to check of the URL is reachable.
return $url !~ /non-existant/;
}
# From your database or whatever.
my $html = '
<body>
<p>Google</p>
<p>Bad</p>
</body>
';
my $dom = Mojo::DOM->new($html);
for my $ele ($dom->find('a[href]')->each) {
my $url = $ele->attr('href');
if (!check_url($url)) {
delete $ele->attr->{href};
}
}
$html = "$dom";
print $html; # Do whatever you want with the modified HTML.

Getting Absolute URLs with module creating object outside loop

I have a doubt I've been trying to solve myself using CPAN modules documentation, but I'm a bit new and I'm confused with some terminology and sections within the different modules.
I'm trying to create the object in the code below, and get the absolute URL for relative links extracted from a website.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url );
my $content = $response->decoded_content();
my $links = URI->new($content);
my $abs = $links->abs('http:', $content);
my $abs_links = $links->abs($abs);
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {
$abs_links = $1;
print "$abs_links\n";
print "Digest for the above URL is " . md5_hex($abs_links) . "\n";
}
The problem is when I try to add that part outside the While loop (the 3-line block preceding the loop), it does not work, whereas if I add the same part in the While loop, it will work fine. This one just gets the relative URLs from a given website, but instead of printing "Http://..." it prints "//...".
The script that works fine for me is the following:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI::URL;
my $url = $ARGV[0]; ## Url passed in command
if ($url !~ m{^https?://[\w]+-?[\w]+\.com/?}i) {
exit(0); ## Program stops if not valid URL
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url ); ## Get response, not content
my $content = $response->decoded_content(); ## Now let's get the content
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) { ## All links
my $links = $1;
my $abs = new URI::URL "$links";
my $abs_url = $abs->abs('http:', $links);
print "$abs_url\n";
print "Digest for the above URL is " . md5_hex($abs_url) . "\n";
}
Any ideas? Much appreciated.
I don't understand your code. There are a few weird bits:
[^\W] is the same as \w
The regex allows an optional - before and an optional / after .com, i.e. http://bitwise.complement.biz matches but http://cool-beans.com doesn't.
URI->new($content) makes no sense: $content is random HTML, not a URI.
$links->abs('http:', $content) makes no sense: $content is simply ignored, and $links->abs('http:') tries to make $links an absolute URL relative to 'http:', but 'http:' is not a valid URL.
Here's what I think you're trying to do:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::LinkExtor;
use Digest::MD5 qw(md5_hex);
#ARGV == 1 or die "Usage: $0 URL\n";
my $url = $ARGV[0];
my $ua = LWP::UserAgent->new(timeout => 10);
my $response = $ua->get($url);
$response->is_success or die "$0: " . $response->request->uri . ": " . $response->status_line . "\n";
my $content = $response->decoded_content;
my $base = $response->base;
my #links;
my $p = HTML::LinkExtor->new(
sub {
my ($tag, %attrs) = #_;
if ($tag eq 'a' && $attrs{href}) {
push #links, "$attrs{href}"; # stringify
}
},
$base,
);
$p->parse($content);
$p->eof;
for my $link (#links) {
print "$link\n";
print "Digest for the above URL is " . md5_hex($link) . "\n";
}
I don't try to validate the URL passed in $ARGV[0]. Leave it to LWP::UserAgent. (If you don't like this, just add the check back in.)
I make sure $ua->get($url) was successful before proceeding.
I get the base URL for absolutifying relative links from $response->base.
I use HTML::LinkExtor for parsing the content, extracting links, and making them absolute.
I think your biggest mistake is trying to parse links out of HTML using a regular expression. You would be far better advised to use a CPAN module for this. I'd recommend WWW::Mechanize, which would make your code look something like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use WWW::Mechanize;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = WWW::Mechanize->new;
$ua->timeout( 10 );
$ua->get( $url );
foreach ($ua->links) {
say $_->url;
say "Digest for the above URL is " . md5_hex($_->url) . "\n";
}
That looks a lot simpler to me.

Where can I find the request body in HTTP::Server::Simple

I have the following simple server:
And I am trying to locate where the request body (or content) is.
I have tried dumping $self and $cgi but they didn't contain the field (I am asuming because they don't carry any information regarding the request)
How can I get the request body ?
package MyWebServer;
use strict;
use HTTP::Server::Simple::CGI;
use base qw(HTTP::Server::Simple::CGI);
use Data::Dumper;
my %dispatch = (
'/hello' => \&resp_hello,
# ...
);
sub handle_request {
my $self = shift;
my $cgi = shift;
my $path = $cgi->path_info();
my $handler = $dispatch{$path};
print "printing self in request".Dumper($cgi);
my $req = $cgi->get_request;
if (ref($handler) eq "CODE") {
print "HTTP/1.0 200 OK\r\n";
$handler->($cgi, "asd");
} else {
print "HTTP/1.0 404 Not found\r\n";
print $cgi->header,
$cgi->start_html('Not found'),
$cgi->h1('Not found'),
$cgi->end_html;
}
}
sub resp_hello($$) {
my ($cgi, $asd) = #_; # CGI.pm object
my $who = $cgi->param('name');
print $cgi->header,
$cgi->start_html("Hello"),
$cgi->h1("Hello world!!"),
$cgi->h2("Azdh $asd");
$cgi->end_html;
}
# start the server on port 8080
my $pid = MyWebServer->new(8081)->background();
print "Use 'kill $pid' to stop server.\n";
EDIT: Here is an example request:
use strict;
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => "http://localhost:8081/hello");
$req->content("<foo>3.14</foo>"); # the request body
my $resp = $ua->request($req);
if ($resp->is_success) {
my $message = $resp->decoded_content;
print "Received reply: $message\n";
}
else {
print "HTTP GET error code: ", $resp->code, "\n";
print "HTTP GET error message: ", $resp->message, "\n";
}
It's a bit old, but facing the same issue, here's the solution :
$cgi->param('POSTDATA');
That's all you need to retreive the Body contents.
cheers.
The request object you obtained using the line $req = $cgi->get_request is a CGI::Request object. Since this is a request object, it will have only attributes (parameters passed on to the request). Please note that only response objects will have content. So, to see all the parameters you have passed, you can use the as_string() object method as mentioned below.
print $req->as_string;
For more information about accessing individual parameters of the request object, please see CGI::Request documentation in http://search.cpan.org/~mrjc/cvswebedit-v2.0b1/cvs-web/lib/CGI/Request.pm.

Perl LWP:Simple Get URL String Varilable

#!/usr/bin/perl
use LWP::Simple;
use warnings;
$content = 0;
$find = "webvis.edgesuite.net";
open (HOSTLIST,"lists.hosts");
#hosts = <HOSTLIST>;
foreach $host(#hosts) {
$results = `nslookup www.$host`;
my $pos = index($results, $find);
if ($pos > -1 )
{
my $url = "http://www.$host";
$content = get ($url);
print $content;
my $pos1 = index($content, $url);
if($pos1 > -1) {
print "Content Match\n";
} else {
print "No Content Match\n";
}
$count++;
chomp ($host);
print "$count www.$host\n";
}
}
close (HOSTLIST);
exit($errorcode);
Using the code above, I always get the following error:
IO::Socket::INET: Bad hostname 'www.test.com
If change the $url to:
$url = 'http://www.test.com';
I get the content retrieval from the page.
So my question is how do I pass in a string variable to the get attribute so it doesn't produce
the bad hostname error?
Thank you in advance
When you read in the hosts from <HOSTLIST>, each line (except possibly the last) will have a newline at the end of it which does not belong in a domain name and thus has to be explicitly removed with the chomp function before trying to do anything important.

how to download `decoded_content`

***UPDATED CODE with resume functionality**
my $ua = LWP::UserAgent->new;
$ua->credentials('$ip:80', 'Realm', 'username', 'password');
my $response = $ua->mirror($url,$newfile);
if ($response->is_success) {
print "Download Successfull.";
}
else {
print "Error: " . $response->status_line;
}
********OLD CODE*****************
my $ua = LWP::UserAgent->new;
$ua->credentials('$ip:80', 'Realm', 'username', 'password');
my $response = $ua->get($url);
if ($response->is_success) {
print "Retrieved " .length($response->decoded_content) .
" bytes of data.";
}
else {
print "Error: " . $response->status_line;
}
open my $fh, '>encoding(UTF-8)', $tmp;
print {$fh} $response->decoded_content;
close $fh;
if ( -e $tmp ) {
my $filesize = ( stat $tmp )[9];
my $origsize = $queue[$rec][1];
if ( $filesize < $origsize) {
print "Resuming download";
******************************************
code for resuming the partly downloaded file...
*******************************************
}
else {
print "File downloaded correctly\n";
}
}
As i'm newbie to perl, could download decoded_content, though some errors persists.
Need to resume the file download, if we have a partial file.
This was the code i've tried, but am not able to know where to start with, hence any quick thoughts in this regard will be of great help indeed. Please help on this.
See method mirror in LWP::UserAgent. Documentation quote:
This method will get the document identified by $url and store it in file called $filename.
my $response = $ua->mirror($url, $filename); # no single quotes around variables!
See the source code for mirror, it deals correctly with truncated/partially downloaded files.