Download Specific Images - perl

I'm trying to search and download specific images /front and back cover / of a website if found but whatever I do I always download only one of them. What should I change in my code to download both of them if found?
use strict;
use warnings;
use LWP;
use LWP::UserAgent;
use URI::Escape;
use HTTP::Status;
getCover(...);
sub getCover {
......
while ($title_found =~ /'(http:\/\/images.blu-ray.com\/movies\/covers\/\d+_.*?)'/gis) {
$url = getSite($1);
if ($title_found =~ /front/) {
$filename = 'front.jpg';
}
elsif ($title_found =~ /back/) {
$filename = 'back.jpg';
}
}
my $dir = 'somepath'.$filename;
open F, ">", $dir;
binmode F;
print F $url;
close F;
return 0;
}
sub getSite {
$url = shift;
print "URL: $url\n";
my $r;
my $ua = new LWP::UserAgent();
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030312");
my $req = new HTTP::Request GET => $url;
$req->push_header("Accept-Language", "en");
$req = $ua->prepare_request($req);
my $res = $ua->request($req);
my $rc = $res->code;
if(is_success($rc)){
$r = $res->as_string();
$r = $res->content();
}
else {
print "Failed\n";
}
return $r;
}

Try putting the part that saves to 'somepath'.$filename inside the while loop instead of outside it.
Also, it appears that $title_found is supposed to contain multiple URLs. In that case, you need to save $1 to a temporary variable, and look for front/back in that instead of in $title_found. Otherwise, you'll wind up saving both covers to front.jpg.

Related

Getting Absolute URLs with module creating object outside loop

I have a doubt I've been trying to solve myself using CPAN modules documentation, but I'm a bit new and I'm confused with some terminology and sections within the different modules.
I'm trying to create the object in the code below, and get the absolute URL for relative links extracted from a website.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url );
my $content = $response->decoded_content();
my $links = URI->new($content);
my $abs = $links->abs('http:', $content);
my $abs_links = $links->abs($abs);
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {
$abs_links = $1;
print "$abs_links\n";
print "Digest for the above URL is " . md5_hex($abs_links) . "\n";
}
The problem is when I try to add that part outside the While loop (the 3-line block preceding the loop), it does not work, whereas if I add the same part in the While loop, it will work fine. This one just gets the relative URLs from a given website, but instead of printing "Http://..." it prints "//...".
The script that works fine for me is the following:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI::URL;
my $url = $ARGV[0]; ## Url passed in command
if ($url !~ m{^https?://[\w]+-?[\w]+\.com/?}i) {
exit(0); ## Program stops if not valid URL
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url ); ## Get response, not content
my $content = $response->decoded_content(); ## Now let's get the content
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) { ## All links
my $links = $1;
my $abs = new URI::URL "$links";
my $abs_url = $abs->abs('http:', $links);
print "$abs_url\n";
print "Digest for the above URL is " . md5_hex($abs_url) . "\n";
}
Any ideas? Much appreciated.
I don't understand your code. There are a few weird bits:
[^\W] is the same as \w
The regex allows an optional - before and an optional / after .com, i.e. http://bitwise.complement.biz matches but http://cool-beans.com doesn't.
URI->new($content) makes no sense: $content is random HTML, not a URI.
$links->abs('http:', $content) makes no sense: $content is simply ignored, and $links->abs('http:') tries to make $links an absolute URL relative to 'http:', but 'http:' is not a valid URL.
Here's what I think you're trying to do:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::LinkExtor;
use Digest::MD5 qw(md5_hex);
#ARGV == 1 or die "Usage: $0 URL\n";
my $url = $ARGV[0];
my $ua = LWP::UserAgent->new(timeout => 10);
my $response = $ua->get($url);
$response->is_success or die "$0: " . $response->request->uri . ": " . $response->status_line . "\n";
my $content = $response->decoded_content;
my $base = $response->base;
my #links;
my $p = HTML::LinkExtor->new(
sub {
my ($tag, %attrs) = #_;
if ($tag eq 'a' && $attrs{href}) {
push #links, "$attrs{href}"; # stringify
}
},
$base,
);
$p->parse($content);
$p->eof;
for my $link (#links) {
print "$link\n";
print "Digest for the above URL is " . md5_hex($link) . "\n";
}
I don't try to validate the URL passed in $ARGV[0]. Leave it to LWP::UserAgent. (If you don't like this, just add the check back in.)
I make sure $ua->get($url) was successful before proceeding.
I get the base URL for absolutifying relative links from $response->base.
I use HTML::LinkExtor for parsing the content, extracting links, and making them absolute.
I think your biggest mistake is trying to parse links out of HTML using a regular expression. You would be far better advised to use a CPAN module for this. I'd recommend WWW::Mechanize, which would make your code look something like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use WWW::Mechanize;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = WWW::Mechanize->new;
$ua->timeout( 10 );
$ua->get( $url );
foreach ($ua->links) {
say $_->url;
say "Digest for the above URL is " . md5_hex($_->url) . "\n";
}
That looks a lot simpler to me.

Using HTTP::Server::Simple::CGI, how do I get the headers?

Basically, my question is similar to
How do I access HTTP request headers in HTTP::Server::Simple::CGI?
The answer was to use parse_headers(), but there was no example how to use it properly. I tried to use parse_headers() but I'm not getting any result, it just stops at parse_headers() like the program is stucked. I couldn't add a comment on the question above since I don't have enough rep to do so, so I created this new question.
Below is my sample code, basically the example code from CPAN just added the parse_headers:
#!/usr/bin/perl
{
package MyWebServer;
use HTTP::Server::Simple::CGI;
our #ISA = qw(HTTP::Server::Simple::CGI);
use Data::Dumper;
my %dispatch = (
'/hello.cgi' => \&resp_hello,
# ...
);
sub handle_request {
my $self = shift;
my $cgi = shift;
my $path = $cgi->path_info();
my $handler = $dispatch{$path};
my $header = $self->parse_headers();
open F,qq{>>~/MyWebServer.log};
my $dump = Data::Dumper->Dump([$header], [qw($header)]);
print F $dump;
close F;
if (ref($handler) eq "CODE") {
print "HTTP/1.0 200 OK\r\n";
$handler->($cgi);
} else {
print "HTTP/1.0 404 Not found\r\n";
print $cgi->header,
$cgi->start_html('Not found'),
$cgi->h1('Not found'),
$cgi->end_html;
}
}
sub resp_hello {
my $cgi = shift; # CGI.pm object
return if !ref $cgi;
my $who = $cgi->param('name');
print $cgi->header,
$cgi->start_html("Hello"),
$cgi->h1("Hello $who!"),
$cgi->end_html;
}
} # end of package MyWebServer
# start the server on port 8080
my $pid = MyWebServer->new(8080)->background();
print "Use 'kill $pid' to stop server.\n";
Only added this part:
my $header = $self->parse_headers();
open F,qq{>>~/MyWebServer.log};
my $dump = Data::Dumper->Dump([$header], [qw($header)]);
print F $dump;
close F;
My objective is to get all the headers and dump it into a file.
Add
sub headers {
my( $self, $headers ) = #_;
if( $headers ){
$self->{__last_headers} = { #$headers };
}
return $self->{__last_headers};
}
Then inside handle_request use my $header = $self->headers();
FWIW, i'm curious why you're using HTTP::Server::Simple::CGI instead of Mojolicious or Dancer or even HTTP::Server::Simple::PSGI. https://metacpan.org/pod/PSGI is portability.

Detect a broken link (web) in perl

I'm trying to detect if a link is broken or not, as in if it's a web address I could paste into my browser and find a web page. I've tried two methods so far that I found online and both are giving me false positives (LWP::UserAgent and LWP::Simple).
#!/usr/bin/perl -w
use strict;
use LWP::UserAgent;
my $url1 = 'http://www.gutenberg.org';
my $url2 = 'http://www.gooasdfzzzle.com.no/thisisnotarealsite';
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/8.0"); # Pretend to be Mozilla
my $req = HTTP::Request->new(GET => "$url1");
my $res = $ua->request($req);
if ($res->is_success) {
print "Success!\n";
} else {
print "Error: " . $res->status_line . "\n";
}
$req = HTTP::Request->new(GET => "$url2");
$res = $ua->request($req);
if ($res->is_success) {
print "Success!\n";
} else {
print "Error: " . $res->status_line . "\n";
}
Which is giving me output of:
Success!
Success!
and then there's
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
my $url1 = 'http://www.gutenberg.org';
my $url2 = 'http://www.gooasdfzzzle.com.no/thisisnotarealsite';
if (head("$url1")) {
print "Yes\n";
} else {
print "No\n";
}
if (head("$url2")) {
print "Yes\n";
} else {
print "No\n";
}
Which is giving me an output of:
Yes
Yes
Am I missing something here?
Your code worked fine for me, I can only see a problem if your running behind a VPN or gateway as previous stated. Always use strict and warnings, and here is an alternative way so you are not initializing a new Request object everytime you want to check for a valid link.
use strict;
use warnings;
use LWP::UserAgent;
sub check_url {
my ($url) = #_;
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(HEAD => $url);
my $res = $ua->request($req);
return $res->status_line if $res->is_error;
return "Success: $url";
}

Web crawler using perl

I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.
use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
##open file to store links
open my $file1,">>", ("extracted_links.txt");
select($file1);
##starting URL
my #urls = 'http://stackoverflow.com/';
my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);
my %visited;
my $url_count = 0;
while (#urls)
{
my $url = shift #urls;
if (exists $visited{$url}) ##check if URL already exists
{
next;
}
else
{
$url_count++;
}
my $request = HTTP::Request->new(GET => $url);
my $response = $browser->request($request);
if ($response->is_error())
{
printf "%s\n", $response->status_line;
}
else
{
my $contents = $response->content();
$visited{$url} = 1;
#lines = split(/\n/,$contents);
foreach $line(#lines)
{
$line =~ m#(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])#g;
print "$1\n";
push #urls, $$line[2];
}
sleep 60;
if ($visited{$url} == 100)
{
last;
}
}
}
close $file1;
Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.
This seems to work, perhaps it will give you some ideas.
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::URL;
##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;
##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my #urls = $base;
my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;
while (#urls) {
my $url = shift #urls;
next if exists $visited{$url};
print "$url\n";
print $log "$url\n";
$visited{$url} = 1;
$url_count++;
# find all <a> tags and act on each
$ua->get($url)->res->dom('a')->each(sub{
my $url = Mojo::URL->new($_->{href});
if ( $url->is_abs ) {
return unless $url->host eq $base->host;
}
push #urls, $url;
});
last if $url_count == 100;
sleep 1;
}

How can I write a simple HTTP proxy in Perl?

I don't want to use the HTTP::Proxy package because I want to dump out a couple requests. My one liner looks like this, but breaks on trying to pass the header in:
perl -MData::Dumper -MHTTP::Daemon -MHTTP::Status -MLWP::UserAgent -e 'my $ua = LWP::UserAgent->new;my $d=new HTTP::Daemon(LocalPort=>1999);print "Please contact me at: <", $d->url, ">\n";while (my $c = $d->accept) {while (my $r = $c->get_request) {if ($r->method eq 'GET' and $r->url->path eq "/uploader") {$c->send_response("whatever.");print Dumper($r);}else{$response=$ua->request($r->method,"http://localhost:1996".$r->uri,$r->headers,$r->content);$c->send_response($response);}}}'
formatted, that's:
#perl -e '
use Data::Dumper;
use HTTP::Daemon;
use HTTP::Status;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $d=new HTTP::Daemon(LocalPort=>1999);
print "Please contact me at: < ", $d->url, " >\n";
while (my $c = $d->accept) {
while (my $r = $c->get_request) {
if ($r->method eq 'GET' and $r->url->path eq "/uploaded") {
$c->send_response("whatever.");
print Dumper($r);
} else {
$response = $ua -> request(
$r->method,
"http://localhost:1996" . $r->uri,
$r->headers,
$r->content);
$c->send_response($response);
}
}
}#'
So I can't just pass in the request, because I need to change the host, and I can't just pass in the headers it seems... so what should I do to keep it short.
So can anyone make this a better one-liner?
Aw shoot, I fixed it with this:
#perl -e '
use Data::Dumper;
use HTTP::Daemon;
use HTTP::Status;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $d=new HTTP::Daemon(LocalPort=>1999);
print "Please contact me at: < ", $d->url, " >\n";
while (my $c = $d->accept) {
while (my $r = $c->get_request) {
if ($r->method eq "GET" and $r->url->path eq "/uploaded") {
$c->send_response("whatever.");
print Dumper($r);
} else {
$response = $ua -> request( HTTP::Request->new(
$r->method,
"http://localhost:1996" . $r->uri,
$r->headers,
$r->content));
$c->send_response($response);
}
}
}#'
note the HTTP::Request->new yeah... so it works, it's a tad slow. but that's okay