I have the Perl & LWP book, but how do I set the user-agent string?
This is what I've got:
use LWP::UserAgent;
use LWP::Simple; # Used to download files
my $u = URI->new($url);
my $response_u = LWP::UserAgent->new->get($u);
die "Error: ", $response_u->status_line unless $response_u->is_success;
Any suggestions, if I want to use LWP::UserAgent like I do here?
From the LWP cookbook:
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
# $ua->agent("Mozilla/8.0") # pretend we are very capable browser
$req = new HTTP::Request 'GET' => 'http://www.sn.no/libwww-perl';
$req->header('Accept' => 'text/html');
# send request
$res = $ua->request($req);
I do appreciate the LWP cookbook solution which mentions the subclassing solution with a passing reference to lwp-request.
a wise perl monk once said: the ole subclassing LWP::UserAgent trick
package AgentP;
use base 'LWP::UserAgent';
sub _agent { "Mozilla/8.0" }
sub get_basic_credentials {
return 'admin', 'password';
}
package main;
use AgentP;
my $agent = AgentP->new;
my $response = $agent->get( 'http://127.0.0.1/hideout.html' );
print $agent->agent();
the entry has been revised with some poor humor, use statement, _agent override, and updated agent print line.
Bonus material for the interested: HTTP basic auth provided with get_basic_credentials override, which is how most people come to find the subclassing solution. _methods are sacred or something; but it does scratch an itch doesn't it?
Related
I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.
It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).
However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.
Anyone know how I can get this to work over HTTPS?
#!/usr/bin/perl
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
# DOESN'T work:
my $content = get("https://datatables.net/");
# DOES work:
# my $content = get("http://www.w3schools.com/html/html_tables.asp");
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";
The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.
One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).
The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $url = 'https://datatables.net/';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success) {
# ok -> I simply print the content in this example, you should parse it
print $res->decoded_content;
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
This is because datatables.net is blocking LWP::Simple requests. You can confirm this by using below code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
print is_success(getprint("https://datatables.net/"));
Output:
$ perl test.pl
403 Forbidden <URL:https://datatables.net/>
You could try using LWP::RobotUA. Below code works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::TableExtract;
my $ua = LWP::RobotUA->new( 'bot_chankey/1.1', 'chankeypathak#stackoverflow.com' );
$ua->delay(5/60); # 5 second delay between requests
my $response = $ua->get('https://datatables.net/');
if ( $response->is_success ) {
my $te = HTML::TableExtract->new();
$te->parse($response->content);
print $te->tables_report(show_content=>1);
}
else {
die $response->status_line;
}
In the end, a combination of Miguel and Chankey's responses provided my solution. Miguel's made up most of my code, so I selected that as the answer, but here is my "final" code (got a lot more to do, but this is all I couldn't figure out.. the rest should be no problem).
I couldn't quite get either mentioned by Miguel/Chankey to work, but they got me 99% of the way.. then I just had to figure out how to get around the error "certificate verify failed". I found that answer with Miguel's method right away, so in the end, I mostly used his code, but both responses were great!
#!/usr/bin/perl
use lib qw( ..);
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TableExtract;
use LWP::RobotUA;
use Data::Dumper;
my $ua = LWP::UserAgent->new(
ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },
);
my $url = 'https://WebsiteIUsedWasSomethingElse.com';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success)
{
my $te = HTML::TableExtract->new();
$te->parse($res->content);
print $te->tables_report(show_content=>1);
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
my $url = "https://ohsesfire01.summit.network/reports/slices";
my $user = 'xxxxxx';
my $pass = 'xxxxxx';
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request GET=> $url;
# authenticate
$request->authorization_basic($user, $pass);
my $page = $ua->request($request);
I'm using LWP::UserAgent to request a lot of page content. I already know the ip of the urls I am requesting so I'd like to be able to specify the ip address of where the url I am requesting is hosted, so that LWP does not have to spend time doing a dns lookup. I've looked through the documentation but haven't found any solutions. Does anyone know of a way to do this? Thanks!
So I found a module that does exactly what I'm looking for: LWP::UserAgent::DNS::Hosts
Here is an example script that I tested and does what I specified in my question:
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
use LWP::UserAgent::DNS::Hosts;
LWP::UserAgent::DNS::Hosts->register_host(
'www.cpan.org' => '199.15.176.140',
);
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
#actually enforces new DNS settings as if they were in /etc/hosts
LWP::UserAgent::DNS::Hosts->enable_override;
my $response = $ua->get('http://www.cpan.org/');
if ($response->is_success) {
print $response->decoded_content; # or whatever
}
else {
die $response->status_line;
}
Hum, your system should already be caching DNS responses. Are you sure this optimisation would help?
Option 1.
Use
http://192.0.43.10/
instead of
http://www.example.org/
Of course, that will fail if the server does name-based virtual hosting.
Option 2.
Replace Socket::inet_aton (called from IO::Socket::INET called from LWP::Protocol::http) with a caching version.
use Socket qw( );
BEGIN {
my $original = \&Socket::inet_aton;
my %cache;
my $caching = sub {
return $cache{$_[0]} //= $original->($_[0]);
};
no warnings 'redefine';
*Socket::inet_aton = $caching;
}
Simply replace the domain name with the IP address in your URL:
use strict;
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
# my $response = $ua->get('http://stackoverflow.com/');
my $response = $ua->get('http://64.34.119.12/');
if ($response->is_success) {
print $response->decoded_content; # or whatever
}
else {
die $response->status_line;
}
I'm using XML::RSSLite for parsing RSS data I retrieved using LWP. LWP is correctly retrieving in the right encoding but when using RSSLite to parse the data, the encoding seems to be lost and characteres like é, è, à, etc. are deleted from the output. Is there an option to set in order to force the encoding?
Here is my script:
use strict;
use XML::RSSLite;
use LWP::UserAgent;
use HTTP::Headers;
use utf8;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $URL = "http://www.boursier.com/syndication/rss/news/FR0004031839/FR";
my $response = $ua->get($URL);
if ($response->is_success) {
my $content = $response->decoded_content((charset => 'UTF-8'));
my %result;
parseRSS(\%result, \$content);
foreach my $item (#{ $result{items} }) {
print "ITEM: $item->{title}\n";
}
}
I tried to use XML::RSS as it seems to have more option that may be handy in my case but it failed to install unfortunately. :(
I like that Mojo::UserAgent along with Mojo::DOM already have the support I need without me tracking down the right combinations of modules to use, and it handles the UTF-8 bits without me doing anything special:
use v5.10;
use open qw( :std :utf8 );
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $URL = "http://www.boursier.com/syndication/rss/news/FR0004031839/FR";
my $response = $ua->get($URL)->res;
my #links = $response
->dom( 'item > title' )
->map( sub { $_->text } )
->each;
$" = "\n";
print "#links\n";
I have another example at Painless RSS processing with Mojo
The RSSLite documentation explicitely states:
Remove characters other than 0-9~!##$%^&*()-+=a-zA-Z[];',.:"<>?\s
Therefore, the module is hopelessly broken. Try again with XML::Feed
use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;
#my $url = shift #ARGV;
my $content = get('example.com?GET=whateverIwant');
my $parser = HTML::LinkExtor->new(); #create LinkExtor object with no callbacks
$parser->parse($content); #parse content
now if I want to send POST and COOKIE info as well with the HTTP header how can I configure that with the get funciton? or do I have to customize my own method?
My main interest is Cookies! then Post!
LWP::Simple is for very simple HTTP GET requests. If you need to do anything more complex (like cookies), you have to upgrade to a full LWP::UserAgent. The cookie_jar is a HTTP::Cookies object, and you can use its set_cookie method to add a cookie.
use LWP::UserAgent;
my $ua = LWP::UserAgent->new(cookie_jar => {}); # create an empty cookie jar
$ua->cookie_jar->set_cookie(...);
my $rsp = $ua->get('example.com?GET=whateverIwant');
die $rsp->status_line unless $rsp->is_success;
my $content = $rsp->decoded_content;
...
The LWP::UserAgent also has a post method.
You might want to use WWW::Mechanize instead. It already glues together most of the stuff that you want:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->cookie_jar->set_cookie(...);
$mech->get( ... );
my #links = $mech->links;
I want to make a program that communicates with http://www.md5crack.com/crackmd5.php. My goal is to send the site a hash (md5) and hopefully the site will be able to crack it. After, I would like to display the plaintext of the hash. My problem is sending the data to the site. I looked up articles about using LWP however I am still lost. Right now, the hash is not sending, some other junk data is. How would I go about sending a particular string of data to the site?
use HTTP::Request::Common qw(POST);
use LWP::UserAgent;
$ua = LWP::UserAgent->new();
my $req = POST 'http://www.md5crack.com/crackmd5.php', [
maxlength=> '2048',
name=> 'term',
size=>'55',
title=>'md5 hash to crack',
value=> '098f6bcd4621d373cade4e832627b4f6',
name=>'crackbtn',
type=>'submit',
value=>'Crack that hash baby!',
];
$content = $ua->request($req)->as_string;
print "Content-type: text/html\n\n";
print $content;
You are POSTing the wrong data because you're taking the HTML to specify the widget and conflating it with the data it actually sends. The corrected data would be to just send the widget name and its value:
term: 098f6bcd4621d373cade4e832627b4f6
Instead, the data that is getting POSTed currently is:
maxlength: 2048
name: term
size: 55
title: md5 hash to crack
value: 098f6bcd4621d373cade4e832627b4f6
name: crackbtn
type: submit
value: Crack that hash baby!
Corrected program:
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common qw{ POST };
use CGI;
my $md5 = '098f6bcd4621d373cade4e832627b4f6';
my $url = 'http://www.md5crack.com/crackmd5.php';
my $ua = LWP::UserAgent->new();
my $request = POST( $url, [ 'term' => $md5 ] );
my $content = $ua->request($request)->as_string();
my $cgi = CGI->new();
print $cgi->header(), $content;
You can also use LWP::UserAgent's post() method:
use strict;
use warnings;
use LWP::UserAgent;
use CGI;
my $md5 = '098f6bcd4621d373cade4e832627b4f6';
my $url = 'http://www.md5crack.com/crackmd5.php';
my $ua = LWP::UserAgent->new();
my $response = $ua->post( $url, { 'term' => $md5 } );
my $content = $response->decoded_content();
my $cgi = CGI->new();
print $cgi->header(), $content;
Always remember to use strict and use warnings. It is considered good practice and will save your time.
It used to be that crackers would figure this sort of stuff out by reading. There are examples in HTTP::Request::Common, which LWP::UserAgent tells you to check out for sending POST data. You only need to send the form data, not the meta data that goes with it.
You might have an easier time using WWW::Mechanize since it has a much more human-centric interface.