headless browsing - WWW::Mechanize and HTTP::Response

headless browsing - WWW::Mechanize and HTTP::Response - perl

I'm playing with WWW::Mechanize, i.e.
my $mech = WWW::Mechanize->new(\%opts);
$mech->get($url);
my $reponse = $mech->follow_link(regex_url => qr/some link/);
$response is returned as an HTTP::Response object. My question is, can I use my $mech to continue to follow links in the response, submit forms, etc? What can I do with the $response object?

The HTTP::Response has everything you want back from the other site:
$response->is_success() will tell you if the request was successful,
$response->code() will return you the HTTP Response Code,
$response->header('Content-Type') will return the Content-Type HTTP Header,
$response->content() will give you the response content,
etc. Check out the perldoc on HTTP::Response for more details.
As for $mech, you can continue to use it for links, etc.
Check out WWW::Mechanize::Examples for some good examples.

Related

Base URI while using Furl

I've checked the documentation https://metacpan.org/pod/Furl
but can't found how can I get sites base URI while using Furl?
With LWP it's easy:
my $res = $ua->get($url);
my $base_uri = $res->base;
The base function try to get values from this header fields
my $base = (
$self->header('Content-Base'), # used to be HTTP/1.1
$self->header('Content-Location'), # HTTP/1.1
$self->header('Base'), # HTTP/1.0
)[0];
But I couldn't do the same with Furl.

First: it seems you want to do an anonymous array at $base, thus, it should be:
my $base = [
$res->header('header1'),
$res->header('header2'),
$res->header('header3')
];
Because the code you had just saved the first header (in your case, Content-Base) and did nothing with the last two, you can check that with Data::Dumper.
Maybe that's why it didn't work.
Second: But, after reading through the code of the Furl module, I found out there's no exposed method for getting an url's base, so unless you are also checking in your own code for the <base> html tag and the uri you used to request your response (even after redirects), your code might break with some oldish sites. HTTP::Response does this checking, and that's what LWP uses.
Citation for hierarchy of base URIs: HTTP::Response - HTTP style response message

How to check 302 http response

I'm trying to check the header of a redirection page, and get the 302 status,
but with my code I get the 200 OK status of the forwarded page. What should I do to get the redirection page 302 staus. My code:
use LWP::UserAgent;
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new('GET','http://host.com');
my $res = $ua->request($req);
print $res->status_line;

After initializing $ua, set its requests_redirectable property to undef:
$ua->requests_redirectable(undef);
That way LWP::UserAgent will not follow redirects and instead stop after the first request.
Then yoy can get the code( "302", "301", etc) using:
$res->code()
Here's the official docs for LWP::UserAgent.

$response->previous() will be get you the previous response in the chain.
Or if you want to disable redirection, pass requests_redirectable => [] to LWP::UserAgent's constructor.

OPTIONS HTTP Request in Perl

Need to send HTTP OPTIONS Request in Perl. Looked through several CPAN modules; read the docs, no mention of OPTIONS request method, just GET, POST, PUT, DELETE.
Do I need to format this manually? Or is there possibly another library/module that my google-fu is missing out on?

The documentation for the HTTP::Request module says:
The method should be a short string like "GET", "HEAD", "PUT" or "POST".
So:
use v5.16;
use warnings;
use HTTP::Request;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new(OPTIONS => 'http://www.example.com/');
my $response = $ua->request($request);
I don't have a server that gives a useful response to an OPTIONS request to test the response with, but the request looks OK when I examine it after setting a proxy.

Trying to get source code of a webpage in perl

I'm trying to get a html source of a webpage using the Perl "get" function. I have written the code 5 months back and it was working fine, but yesterday I made a small edit, but it failed to work after that, no matter how much I tried.
Here is the code I tried.
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;
The code works fine , but its not able to retrieve the source, instead it displays
cannot retrieve code

my $link = 'http://www.google.com';

This might be a bit late,
I have been struggling with the same problem and I think I have figured why this occurs. I usually web-scrape websites with python and I have figured out that It is ideal to include some extra header info to the get requests.This fools the website into thinking the bot is a person and gives the bot access to the website and does not invoke a 400 bad request status code.
So I applied this thinking to my Perl script, which was similar to yours, and just added some extra header info. The result gave me the source code for the website with no strugle.
Here is the code:
#!/usr/bin/perl
# This line specifies the LWP version and if not put in place the code will fail.
use LWP 5.64;
# This line defines the virtual browser.
$browser = LWP::UserAgent->new;
# This line defines the header infomation that will be given to the website (eg. google) incase the website invokes a 400 bad request status code.
#ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
# This line defines the url that the user agent will browse.
$url = 'https://www.google.com/';
# This line is used to request data from the specified url above.
$response = $browser->get($url, #ns_headers) or die "cannot retrieve code\n";;
# Decodes responce so the HTML source code is visable.
$HTML = $response->decoded_content;
print($HTML);
I have LWP::Useragent as this has the ability for you to add extra header infomation.
I hope this helped,
ME.
PS. Sorry if you already have the answer for this, just wanted to help.

How can I find the final URL after all redirections in Perl?

Lets say I have "http://www.ritzcarlton.com" and that redirects me to "http://www.ritzcarlton.com/en/Default.htm". Is there a way in Perl to find the end url after all the redirects?

Using LWP will follow the redirections for you. You can then interrogate the HTTP::Request object to find out the URI it requested.
use LWP::UserAgent qw();
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://www.ritzcarlton.com');
print $response->request->uri . "\n";
Output is:
http://www.ritzcarlton.com/en/Default.htm

If you're issuing HTTP requests yourself, then the redirect URL will be in the returned Location: header. If you're using a proper HTTP client like LWP::UserAgent or WWW::Mechanize, which is what you should be doing, redirection is handled automatically.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

headless browsing - WWW::Mechanize and HTTP::Response - perl

Related

Base URI while using Furl

How to check 302 http response

OPTIONS HTTP Request in Perl

Trying to get source code of a webpage in perl

How can I find the final URL after all redirections in Perl?

Categories

Resources