Status 403 when using the LWP::Simple - perl

Below is the simple code that I used:
use strict;
use warnings;
use LWP::Simple;
my $url = "http://automanga.com/uploads/manga/bleach/chapters/12/09.jpg";
my $file = "09.jpg";
my $rc = getstore($url, $file);
if (is_error($rc))
{
print "getstore failed with $rc\n";
}
The link is working as I try it in the browser but somehow it just return 403 status.
Appreciate for your advice on this.

The book LWP and Perl (available legally for free online) is a great introduction to the LWP toolkit. In particular, the section Adding Extra Request Header Lines has a useful discussion of the kind of problem you're having here.
Unfortunately, LWP::Simple isn't up to the job. You'll want to switch to LWP::UserAgent and HTTP::Request instead. Then you can use the agent() method on your LWP::UserAgent object and header() on your HTTP::Request object to craft exactly the request that you need.
Update: I played with this a bit during my lunch break. Looks like they are blocking on the UserAgent string. Just changing that to anything will make it work.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->agent('Foo');
my $url = 'http://automanga.com/uploads/manga/bleach/chapters/12/09.jpg';
my $file = '09.jpg';
my $resp = $ua->get($url);
if ($resp->is_error) {
die $resp->status_line, "\n";
}
open my $fh, '>', $file or die $!;
binmode $fh;
print $fh $resp->decoded_content;

Related

perl LWP::UserAgent gives a cryptic error message

Here's the code:
$vizFile ='https://docs.recipeinvesting.com/t.aaaf.html';
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get($vizFile);
if ($response->is_success) {print $response->decoded_content;}
else {print"\nError= $response->status_line]n";}
I get the message:
Error= HTTP::Response=HASH(0x3a9b810)->status_line]n
The url works fine if I put it in a browser.
This was working consistently (with plain http, using LWP::Simple), until the site made some changes.
Could it be the https that's making the difference?
Is there some way to get a less cryptic error message?
You can't put code in string literals and expect it to get executed. Sure, you can place variables for interpolation, but the making method calls falls on the other side of what's supported.
Replace
print"\nError= $response->status_line]n";
with
print "\nError= " . $response->status_line . "\n";
or
use feature qw( say );
say "\nError= " . $response->status_line;
This will print the status line as desired.
Please see following demo code, it is encouraged to include use strict; and use warnings; in the code what would assist you to avoid many potential problems
use strict;
use warnings;
use feature 'say';
use LWP::UserAgent;
my $url ='https://docs.recipeinvesting.com/t.aaaf.html';
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get($url);
if( $response->is_success ){
say $response->decoded_content;
} else {
die $response->status_line;
}
Documentation: LWP::UserAgent

How I can use a text file in LWP?

I need to send requests to an HTTP server using LWP. For example, I have a file with data, and I must send requests to server foobar.baz.
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0")
$req = HTTP::Request->new(GET => 'http://www.foobar.baz');
$req->header('Accept' => 'text/html');
$res = $ua->request($req);
How I can use file.txt in
$req = HTTP::Request->new(GET => 'http://www.foobar.baz')
for every request?
For example file.txt contains
aaaa
bbbb
cccc
dddd
eeee
I need to send a request to
aaaa.foobar.baz
bbbb.foobar.baz
cccc.foobar.baz
and so on.
How can I do it?
This is a very simple question, and I wonder why you can't even attempt it yourself
It's just a matter of reading the file and building the complete URL from each line of text
use strict;
use warnings 'all';
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
open my $fh, '<', 'file.txt' or die $!;
while ( <$fh> ) {
next unless /\S/;
chomp;
my $res = $ua->get( "$_.foobar.baz" );
}
You might find App::SimpleScan on CPAN to be useful. I wrote it for just such an application back at Yahoo! in 2005. It handles combinatorial specifications of URLs, lets you snapshot the output, etc. Plugin-based with a fairly good set of plugins, so if it won't do exactly what you want out of the box, it shouldn't be hard for you to make it work.

HTML::TableExtract an HTTPS site

I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.
It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).
However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.
Anyone know how I can get this to work over HTTPS?
#!/usr/bin/perl
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
# DOESN'T work:
my $content = get("https://datatables.net/");
# DOES work:
# my $content = get("http://www.w3schools.com/html/html_tables.asp");
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";
The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.
One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).
The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $url = 'https://datatables.net/';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success) {
# ok -> I simply print the content in this example, you should parse it
print $res->decoded_content;
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
This is because datatables.net is blocking LWP::Simple requests. You can confirm this by using below code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
print is_success(getprint("https://datatables.net/"));
Output:
$ perl test.pl
403 Forbidden <URL:https://datatables.net/>
You could try using LWP::RobotUA. Below code works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::TableExtract;
my $ua = LWP::RobotUA->new( 'bot_chankey/1.1', 'chankeypathak#stackoverflow.com' );
$ua->delay(5/60); # 5 second delay between requests
my $response = $ua->get('https://datatables.net/');
if ( $response->is_success ) {
my $te = HTML::TableExtract->new();
$te->parse($response->content);
print $te->tables_report(show_content=>1);
}
else {
die $response->status_line;
}
In the end, a combination of Miguel and Chankey's responses provided my solution. Miguel's made up most of my code, so I selected that as the answer, but here is my "final" code (got a lot more to do, but this is all I couldn't figure out.. the rest should be no problem).
I couldn't quite get either mentioned by Miguel/Chankey to work, but they got me 99% of the way.. then I just had to figure out how to get around the error "certificate verify failed". I found that answer with Miguel's method right away, so in the end, I mostly used his code, but both responses were great!
#!/usr/bin/perl
use lib qw( ..);
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TableExtract;
use LWP::RobotUA;
use Data::Dumper;
my $ua = LWP::UserAgent->new(
ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },
);
my $url = 'https://WebsiteIUsedWasSomethingElse.com';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success)
{
my $te = HTML::TableExtract->new();
$te->parse($res->content);
print $te->tables_report(show_content=>1);
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
my $url = "https://ohsesfire01.summit.network/reports/slices";
my $user = 'xxxxxx';
my $pass = 'xxxxxx';
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request GET=> $url;
# authenticate
$request->authorization_basic($user, $pass);
my $page = $ua->request($request);

Xpath won't fiind id

I'm failing to get a node by its id.
The code is straight forward and should be self-explaining.
#!/usr/bin/perl
use Encode;
use utf8;
use LWP::UserAgent;
use URI::URL;
use Data::Dumper;
use HTML::TreeBuilder::XPath;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $browser = LWP::UserAgent->new;
my $resp = $browser->get( $url, 'User-Agent' => 'Mozilla\/5.0' );
if ($resp->is_success) {
my $base = $resp->base || '';
print "-> base URL: $base\n";
my $data = $resp->decoded_content;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_content( $resp->decoded_content() );
binmode STDOUT, ":encoding(UTF-8)";
my $price_day = $tree->find('.//*[#id="price_amount"]/');
print Dumper($price_day);
$tree->delete();
}
The code above prints:
-> base URL: https://www.airbnb.com/rooms/1976460
$VAR1 = undef;
How can I select a node by its ID?
Thanks in advance.
Take that / off the end of that XPath.
.//*[#id="price_amount"]
should do. As it is, it's not valid XPath.
There is a trailing slash in your XPath, that you need to remove
my $price_day = $tree->find('.//*[#id="price_amount"]');
However, from my own testing, I believe that HTML::TreeBuilder::XPath is also having trouble parsing that specific URL. Perhaps because of the conditional comments?
As an alternative approach, I would recommend using Mojo::UserAgent and Mojo::DOM instead.
The following uses the css selector div#price_amount to easily find your desired element and print it out.
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $dom = Mojo::UserAgent->new->get($url)->res->dom;
my $price_day = $dom->at(q{div#price_amount})->all_text;
print $price_day, "\n";
Outputs:
$285
Note, there is a helpful 8 minute introductory video to this set of modules at Mojocast Episode 5.

My LWP Script Not Working

I am running
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
The variable $site has the html code.
Also you can use the function getstore to save the html data to a file, like:
my $http_code = getstore( 'http://www.google.com/', 'google.html' );
It would help you a lot if you could see the reason for the failure. I suggest you use the core LWP instead of the simple version. Like this:
#!/usr/bin/perl
use strict;
use warnings;
use LWP;
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://www.google.com/');
die 'Couldn't get it: ', $response->status_line unless $response->is_success;
my $site = $response->decoded_content;
print 'Got it.';