Perl module (WWW::Mechanize) select image from page - perl

I tried to search SO but couldn't find close answers.
What I have is this (links and URLs will be changed, but concepts will be exactly the same)
#!/usr/bin/perl
#Some of the modules are going to be unused for now
use Win32::OLE;
use Win32::Ole::Variant;
use LWP::Simple;
use DBI;
use DBD::mysql;
use WWW::Mechanize qw();
$url = 'http://example.com';
$mechanize = WWW::Mechanize->new(autocheck => 1); #BTW what's autocheck=>1 for?
$mechanize->get($url);
$content = $mechanize->content();
print $content; #Shows the HTML (OK)
$mechanize->form_name('search');
$mechanize->field('level', '100');
$response = $mechanize->submit();
print $response->content(); #Shows the html of the submitted page (OK);
Now this new form has a random image generated that is not .jpg nor other image format. All I want to do is save that image (I know the name) to my folder. The image tag is <img src="someImage.php"> and I would like to save it assomeImage.jpg` in a folder.

It helps to read the documentation of the software you are using, which you didn't do. You need the image methods.
use strictures;
use WWW::Mechanize qw();
my $m = WWW::Mechanize->new; # autocheck is default since v1.50 (year 2008)
$m->get('file:///tmp/so11184595.html');
for my $i ($m->images) {
$m->mirror($i->url_abs, 'some/someImage.jpg')
if 'someImage.php' eq $i->url;
}

Related

HTML::TableExtract an HTTPS site

I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.
It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).
However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.
Anyone know how I can get this to work over HTTPS?
#!/usr/bin/perl
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
# DOESN'T work:
my $content = get("https://datatables.net/");
# DOES work:
# my $content = get("http://www.w3schools.com/html/html_tables.asp");
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";
The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.
One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).
The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $url = 'https://datatables.net/';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success) {
# ok -> I simply print the content in this example, you should parse it
print $res->decoded_content;
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
This is because datatables.net is blocking LWP::Simple requests. You can confirm this by using below code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
print is_success(getprint("https://datatables.net/"));
Output:
$ perl test.pl
403 Forbidden <URL:https://datatables.net/>
You could try using LWP::RobotUA. Below code works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::TableExtract;
my $ua = LWP::RobotUA->new( 'bot_chankey/1.1', 'chankeypathak#stackoverflow.com' );
$ua->delay(5/60); # 5 second delay between requests
my $response = $ua->get('https://datatables.net/');
if ( $response->is_success ) {
my $te = HTML::TableExtract->new();
$te->parse($response->content);
print $te->tables_report(show_content=>1);
}
else {
die $response->status_line;
}
In the end, a combination of Miguel and Chankey's responses provided my solution. Miguel's made up most of my code, so I selected that as the answer, but here is my "final" code (got a lot more to do, but this is all I couldn't figure out.. the rest should be no problem).
I couldn't quite get either mentioned by Miguel/Chankey to work, but they got me 99% of the way.. then I just had to figure out how to get around the error "certificate verify failed". I found that answer with Miguel's method right away, so in the end, I mostly used his code, but both responses were great!
#!/usr/bin/perl
use lib qw( ..);
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TableExtract;
use LWP::RobotUA;
use Data::Dumper;
my $ua = LWP::UserAgent->new(
ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },
);
my $url = 'https://WebsiteIUsedWasSomethingElse.com';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success)
{
my $te = HTML::TableExtract->new();
$te->parse($res->content);
print $te->tables_report(show_content=>1);
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
my $url = "https://ohsesfire01.summit.network/reports/slices";
my $user = 'xxxxxx';
my $pass = 'xxxxxx';
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request GET=> $url;
# authenticate
$request->authorization_basic($user, $pass);
my $page = $ua->request($request);

Xpath won't fiind id

I'm failing to get a node by its id.
The code is straight forward and should be self-explaining.
#!/usr/bin/perl
use Encode;
use utf8;
use LWP::UserAgent;
use URI::URL;
use Data::Dumper;
use HTML::TreeBuilder::XPath;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $browser = LWP::UserAgent->new;
my $resp = $browser->get( $url, 'User-Agent' => 'Mozilla\/5.0' );
if ($resp->is_success) {
my $base = $resp->base || '';
print "-> base URL: $base\n";
my $data = $resp->decoded_content;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_content( $resp->decoded_content() );
binmode STDOUT, ":encoding(UTF-8)";
my $price_day = $tree->find('.//*[#id="price_amount"]/');
print Dumper($price_day);
$tree->delete();
}
The code above prints:
-> base URL: https://www.airbnb.com/rooms/1976460
$VAR1 = undef;
How can I select a node by its ID?
Thanks in advance.
Take that / off the end of that XPath.
.//*[#id="price_amount"]
should do. As it is, it's not valid XPath.
There is a trailing slash in your XPath, that you need to remove
my $price_day = $tree->find('.//*[#id="price_amount"]');
However, from my own testing, I believe that HTML::TreeBuilder::XPath is also having trouble parsing that specific URL. Perhaps because of the conditional comments?
As an alternative approach, I would recommend using Mojo::UserAgent and Mojo::DOM instead.
The following uses the css selector div#price_amount to easily find your desired element and print it out.
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $dom = Mojo::UserAgent->new->get($url)->res->dom;
my $price_day = $dom->at(q{div#price_amount})->all_text;
print $price_day, "\n";
Outputs:
$285
Note, there is a helpful 8 minute introductory video to this set of modules at Mojocast Episode 5.

Downloading Text from Several Links using WWW::Mechanize

For an entire week I have been attempting to write a code that will download links from a webpage and then loop through each link to dump the content written on each link's page. The original webpage I downloaded has 500 links to separate web pages that each contain important information for me. I only want to go one level down. However I am having several issues.
RECAP: I want to download the links from a webpage and automatically have my program print off the text contained in those links. I would prefer to have them printed in a file.
1) When I download the links from the original website, the useful ones are not written out fully. (ie they say "/festevents.nsf/all?openform" which is not a usable webpage)
2) I have been unable to print the text content of the page. I have been able to print the font details, but that is useless.
#Download all the modules I used#
use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;
#Download original webpage and acquire 500+ Links#
$url = "http://wx.toronto.ca/festevents.nsf/all?openform";
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->get($url);
my $title = $mechanize->title;
print "<b>$title</b><br />";
my #links = $mechanize->links;
foreach my $link (#links) {
# Retrieve the link URL
my $href = $link->url_abs;
#
# $URL1= get("$link");
#
my $ua = LWP::UserAgent->new;
my $response = $ua->get($href);
unless($response->is_success) {
die $response->status_line;
}
my $URL1 = $response->decoded_content;
die Dumper($URL1);
#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);
open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);
}
Please help me! I am desperate! If possible please explain to me the logic behind each step? I have been frying my brain on this for a week and I want help seeing other peoples logic behind the problems.
Too much work. Study the WWW::Mechanize API to realise that almost all of that functionality is already built-in. Untested:
use strictures;
use WWW::Mechanize qw();
use autodie qw(:all);
open my $h, '>:encoding(UTF-8)', 'TorontoParties.txt';
my $mechanize = WWW::Mechanize->new;
$mechanize->get('http://wx.toronto.ca/festevents.nsf/all?openform');
foreach my $link (
$mechanize->find_all_links(url_regex => qr'/festevents[.]nsf/[0-9a-f]{32}/[0-9a-f]{32}[?]OpenDocument')
) {
$mechanize->get($link->url_abs);
print {$h} $mechanize->content(format => 'text');
}
close $h;

How do I download a file using Perl?

I'm running Perl on Windows XP, and I need to download a file from the URL http://marinetraffic2.aegean.gr/ais/getkml.aspx.
How should I do this? I have attempted using WWW::Mechanize, but I can't get my head around it.
This is the code I used:
my $url = 'marinetraffic2.aegean.gr/ais/getkml.aspx';
my $mech = WWW::Mechanize->new;
$mech->get($url);
I'd use LWP::Simple for this.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $url = 'http://marinetraffic2.aegean.gr/ais/getkml.aspx';
my $file = 'data.kml';
getstore($url, $file);
I used File::Fetch as this is a core Perl module (I didn't need to install any additional packages) and will try a number of different ways to download a file depending on what's installed on the system.
use File::Fetch;
my $url = 'http://www.example.com/file.txt';
my $ff = File::Fetch->new(uri => $url);
my $file = $ff->fetch() or die $ff->error;
Note that this module will in fact try to use LWP first if it is installed...
use WWW::Mechanize;
my $url = 'marinetraffic2.aegean.gr/ais/getkml.aspx';
my $local_file_name = 'getkml.aspx';
my $mech = WWW::Mechanize->new;
$mech->get( $url, ":content_file" => $local_file_name );
This in fact wraps around the LWP::UserAgent->get.
More details can be found at WWW::Mechanize docs page.
If downloading that file is all you actually do, you'd better go with #davorg's answer.
If this is part of a bigger process, you access the ressource you downloaded as a string using method content on your $mech object.
Just in case someone needs an oneliner ;)
perl -e 'use File::Fetch;my $url = "http://192.168.1.10/myFile.sh";my $ff = File::Fetch->new(uri => $url);my $file = $ff->fetch() or die $ff->error;'
Just change the content of $url

How can I send cookies with Perl's LWP::Simple?

use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;
#my $url = shift #ARGV;
my $content = get('example.com?GET=whateverIwant');
my $parser = HTML::LinkExtor->new(); #create LinkExtor object with no callbacks
$parser->parse($content); #parse content
now if I want to send POST and COOKIE info as well with the HTTP header how can I configure that with the get funciton? or do I have to customize my own method?
My main interest is Cookies! then Post!
LWP::Simple is for very simple HTTP GET requests. If you need to do anything more complex (like cookies), you have to upgrade to a full LWP::UserAgent. The cookie_jar is a HTTP::Cookies object, and you can use its set_cookie method to add a cookie.
use LWP::UserAgent;
my $ua = LWP::UserAgent->new(cookie_jar => {}); # create an empty cookie jar
$ua->cookie_jar->set_cookie(...);
my $rsp = $ua->get('example.com?GET=whateverIwant');
die $rsp->status_line unless $rsp->is_success;
my $content = $rsp->decoded_content;
...
The LWP::UserAgent also has a post method.
You might want to use WWW::Mechanize instead. It already glues together most of the stuff that you want:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->cookie_jar->set_cookie(...);
$mech->get( ... );
my #links = $mech->links;