How do I download a file using Perl? - perl

I'm running Perl on Windows XP, and I need to download a file from the URL http://marinetraffic2.aegean.gr/ais/getkml.aspx.
How should I do this? I have attempted using WWW::Mechanize, but I can't get my head around it.
This is the code I used:
my $url = 'marinetraffic2.aegean.gr/ais/getkml.aspx';
my $mech = WWW::Mechanize->new;
$mech->get($url);

I'd use LWP::Simple for this.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $url = 'http://marinetraffic2.aegean.gr/ais/getkml.aspx';
my $file = 'data.kml';
getstore($url, $file);

I used File::Fetch as this is a core Perl module (I didn't need to install any additional packages) and will try a number of different ways to download a file depending on what's installed on the system.
use File::Fetch;
my $url = 'http://www.example.com/file.txt';
my $ff = File::Fetch->new(uri => $url);
my $file = $ff->fetch() or die $ff->error;
Note that this module will in fact try to use LWP first if it is installed...

use WWW::Mechanize;
my $url = 'marinetraffic2.aegean.gr/ais/getkml.aspx';
my $local_file_name = 'getkml.aspx';
my $mech = WWW::Mechanize->new;
$mech->get( $url, ":content_file" => $local_file_name );
This in fact wraps around the LWP::UserAgent->get.
More details can be found at WWW::Mechanize docs page.

If downloading that file is all you actually do, you'd better go with #davorg's answer.
If this is part of a bigger process, you access the ressource you downloaded as a string using method content on your $mech object.

Just in case someone needs an oneliner ;)
perl -e 'use File::Fetch;my $url = "http://192.168.1.10/myFile.sh";my $ff = File::Fetch->new(uri => $url);my $file = $ff->fetch() or die $ff->error;'
Just change the content of $url

Related

HTML::TableExtract an HTTPS site

I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.
It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).
However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.
Anyone know how I can get this to work over HTTPS?
#!/usr/bin/perl
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
# DOESN'T work:
my $content = get("https://datatables.net/");
# DOES work:
# my $content = get("http://www.w3schools.com/html/html_tables.asp");
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";
The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.
One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).
The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $url = 'https://datatables.net/';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success) {
# ok -> I simply print the content in this example, you should parse it
print $res->decoded_content;
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
This is because datatables.net is blocking LWP::Simple requests. You can confirm this by using below code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
print is_success(getprint("https://datatables.net/"));
Output:
$ perl test.pl
403 Forbidden <URL:https://datatables.net/>
You could try using LWP::RobotUA. Below code works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::TableExtract;
my $ua = LWP::RobotUA->new( 'bot_chankey/1.1', 'chankeypathak#stackoverflow.com' );
$ua->delay(5/60); # 5 second delay between requests
my $response = $ua->get('https://datatables.net/');
if ( $response->is_success ) {
my $te = HTML::TableExtract->new();
$te->parse($response->content);
print $te->tables_report(show_content=>1);
}
else {
die $response->status_line;
}
In the end, a combination of Miguel and Chankey's responses provided my solution. Miguel's made up most of my code, so I selected that as the answer, but here is my "final" code (got a lot more to do, but this is all I couldn't figure out.. the rest should be no problem).
I couldn't quite get either mentioned by Miguel/Chankey to work, but they got me 99% of the way.. then I just had to figure out how to get around the error "certificate verify failed". I found that answer with Miguel's method right away, so in the end, I mostly used his code, but both responses were great!
#!/usr/bin/perl
use lib qw( ..);
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TableExtract;
use LWP::RobotUA;
use Data::Dumper;
my $ua = LWP::UserAgent->new(
ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },
);
my $url = 'https://WebsiteIUsedWasSomethingElse.com';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success)
{
my $te = HTML::TableExtract->new();
$te->parse($res->content);
print $te->tables_report(show_content=>1);
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
my $url = "https://ohsesfire01.summit.network/reports/slices";
my $user = 'xxxxxx';
my $pass = 'xxxxxx';
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request GET=> $url;
# authenticate
$request->authorization_basic($user, $pass);
my $page = $ua->request($request);

Xpath won't fiind id

I'm failing to get a node by its id.
The code is straight forward and should be self-explaining.
#!/usr/bin/perl
use Encode;
use utf8;
use LWP::UserAgent;
use URI::URL;
use Data::Dumper;
use HTML::TreeBuilder::XPath;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $browser = LWP::UserAgent->new;
my $resp = $browser->get( $url, 'User-Agent' => 'Mozilla\/5.0' );
if ($resp->is_success) {
my $base = $resp->base || '';
print "-> base URL: $base\n";
my $data = $resp->decoded_content;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_content( $resp->decoded_content() );
binmode STDOUT, ":encoding(UTF-8)";
my $price_day = $tree->find('.//*[#id="price_amount"]/');
print Dumper($price_day);
$tree->delete();
}
The code above prints:
-> base URL: https://www.airbnb.com/rooms/1976460
$VAR1 = undef;
How can I select a node by its ID?
Thanks in advance.
Take that / off the end of that XPath.
.//*[#id="price_amount"]
should do. As it is, it's not valid XPath.
There is a trailing slash in your XPath, that you need to remove
my $price_day = $tree->find('.//*[#id="price_amount"]');
However, from my own testing, I believe that HTML::TreeBuilder::XPath is also having trouble parsing that specific URL. Perhaps because of the conditional comments?
As an alternative approach, I would recommend using Mojo::UserAgent and Mojo::DOM instead.
The following uses the css selector div#price_amount to easily find your desired element and print it out.
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $dom = Mojo::UserAgent->new->get($url)->res->dom;
my $price_day = $dom->at(q{div#price_amount})->all_text;
print $price_day, "\n";
Outputs:
$285
Note, there is a helpful 8 minute introductory video to this set of modules at Mojocast Episode 5.

My LWP Script Not Working

I am running
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
The variable $site has the html code.
Also you can use the function getstore to save the html data to a file, like:
my $http_code = getstore( 'http://www.google.com/', 'google.html' );
It would help you a lot if you could see the reason for the failure. I suggest you use the core LWP instead of the simple version. Like this:
#!/usr/bin/perl
use strict;
use warnings;
use LWP;
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://www.google.com/');
die 'Couldn't get it: ', $response->status_line unless $response->is_success;
my $site = $response->decoded_content;
print 'Got it.';

Perl module (WWW::Mechanize) select image from page

I tried to search SO but couldn't find close answers.
What I have is this (links and URLs will be changed, but concepts will be exactly the same)
#!/usr/bin/perl
#Some of the modules are going to be unused for now
use Win32::OLE;
use Win32::Ole::Variant;
use LWP::Simple;
use DBI;
use DBD::mysql;
use WWW::Mechanize qw();
$url = 'http://example.com';
$mechanize = WWW::Mechanize->new(autocheck => 1); #BTW what's autocheck=>1 for?
$mechanize->get($url);
$content = $mechanize->content();
print $content; #Shows the HTML (OK)
$mechanize->form_name('search');
$mechanize->field('level', '100');
$response = $mechanize->submit();
print $response->content(); #Shows the html of the submitted page (OK);
Now this new form has a random image generated that is not .jpg nor other image format. All I want to do is save that image (I know the name) to my folder. The image tag is <img src="someImage.php"> and I would like to save it assomeImage.jpg` in a folder.
It helps to read the documentation of the software you are using, which you didn't do. You need the image methods.
use strictures;
use WWW::Mechanize qw();
my $m = WWW::Mechanize->new; # autocheck is default since v1.50 (year 2008)
$m->get('file:///tmp/so11184595.html');
for my $i ($m->images) {
$m->mirror($i->url_abs, 'some/someImage.jpg')
if 'someImage.php' eq $i->url;
}

How can I send cookies with Perl's LWP::Simple?

use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;
#my $url = shift #ARGV;
my $content = get('example.com?GET=whateverIwant');
my $parser = HTML::LinkExtor->new(); #create LinkExtor object with no callbacks
$parser->parse($content); #parse content
now if I want to send POST and COOKIE info as well with the HTTP header how can I configure that with the get funciton? or do I have to customize my own method?
My main interest is Cookies! then Post!
LWP::Simple is for very simple HTTP GET requests. If you need to do anything more complex (like cookies), you have to upgrade to a full LWP::UserAgent. The cookie_jar is a HTTP::Cookies object, and you can use its set_cookie method to add a cookie.
use LWP::UserAgent;
my $ua = LWP::UserAgent->new(cookie_jar => {}); # create an empty cookie jar
$ua->cookie_jar->set_cookie(...);
my $rsp = $ua->get('example.com?GET=whateverIwant');
die $rsp->status_line unless $rsp->is_success;
my $content = $rsp->decoded_content;
...
The LWP::UserAgent also has a post method.
You might want to use WWW::Mechanize instead. It already glues together most of the stuff that you want:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->cookie_jar->set_cookie(...);
$mech->get( ... );
my #links = $mech->links;