WWW::Mechanize ssl cross-site login failure cookie_jar not populating - perl

The URL $url, redirects to https://auth.outside.com/secure/login for authentication over SSL. The site stores some cookie, as soon as you land on the page, and also some on successful authentication. However, I am not getting the cookie file populated, even when i manage to land on the page. this is an example with google, but real URL is different.
CODE
#!/usr/bin/perl
use warnings;
use strict;
use WWW::Mechanize;
use Crypt::SSLeay;
use HTTP::Cookies;
my $userAgent = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0';
my $cookie_file = 'auth_cookies.txt';
$ENV{HTTPS_PROXY} = 'http://myproxy.net:8080';
my $google='https://www.google.com';
my $url = $google;
my $tempfile='download_details';
my $mech = WWW::Mechanize->new(
noproxy => 0,
agent => $userAgent,
cookie_jar => HTTP::Cookies->new( file => $cookie_file )
);
my $result=$mech->get( $url, ':content_file' => $tempfile );
print sprintf( "User-Agent %s\n redirects to: %s\n\n", $userAgent, $mech->uri() );
print "result=$result\n";
outputs following:
User-Agent Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0
redirects to: https://www.google.com
result=HTTP::Response=HASH(0x3474ef0)
but does not create any cookie file even thou i can see a bunch of cookies in firebug.

after adding this code, the file is populating...
$mech->cookie_jar->set_cookie(
qw(
3
cat
buster
/
.example.com
0
0
0
)
);

Related

Read a web page with Perl

I am trying to read the content of a web page with perl on Windows 10. The code does not work for the following site:
https://www.dividendinvestor.com/dividend-quote/intc/
Here is the code I am using:
use LWP::Simple qw(get);
my $url = 'https://www.dividendinvestor.com/dividend-quote/intc/';
my $html = get $url;
print $html;
Any idea why I cannot read that page?
LWP::Simple is pretty basic and doesn't let you do anything clever like actually looking at the details of the response. So let's change to LWP::UserAgent and see what the response is.
use LWP::UserAgent;
my $url = 'https://www.dividendinvestor.com/dividend-quote/intc/';
my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
print $resp->status_line;
This prints:
403 Forbidden
So I think that Quentin's comment is correct and that the site's owners are blocking people who use technology like LWP.
So let's change the useragent string to look like Internet Explorer.
use LWP::UserAgent;
my $agent = ' Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko';
my $url = 'https://www.dividendinvestor.com/dividend-quote/intc/';
my $ua = LWP::UserAgent->new;
$ua->agent($agent);
my $resp = $ua->get($url);
print $resp->status_line;
Now I get:
200 OK
So we should be ok to get the content.
use LWP::UserAgent;
my $agent = ' Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko';
my $url = 'https://www.dividendinvestor.com/dividend-quote/intc/';
my $ua = LWP::UserAgent->new;
$ua->agent($agent);
my $resp = $ua->get($url);
if ($resp->is_success) {
print $resp->content;
} else {
print $resp->status_line;
}
And that seems to work fine.
Note: Of course, changing the useragent string like this is rather dishonest. Presumably, the site's owners have a good reason for wanting to dissuade people from accessing their site in this way. So don't annoy them by trying to get around their restrictions. Read the site'sterms of service to see what they want to to do. Perhaps they have an API available that will give you the data you want.
As Dave Cross wrote, the problem is related to the user agent. It is possible to use the LWP::Simple module in this way:
use LWP::Simple qw/$ua get/;
$ua->agent('Mozilla/5.0');
my $url = 'https://www.dividendinvestor.com/dividend-quote/intc/';
my $html = get $url;
print $html;
As the documentation points, the user agent created by this module (LWP::Simple) will identify itself as "LWP::Simple/#.##". So we can change it before the "GET" request.

Scrape from .onion site using Web::Scraper

Problem: Scrape from tor .onion site using Web::Scraper
I would like to modify my code to connect to .onion site. I believe I need to connect to the SOCKS5 proxy, but unsure of how to do it with Web::Scraper
Existing code:
use Web::Scraper;
my $piratelink=$PIRATEBAYSERVER.'/search/' . $srstring . '%20'. 's'.$sval[1].'e'.$epinum.'/0/7/0';
my $purlToScrape = $piratelink;
my $ns = scraper {
process "td>a", 'mag[]' => '#href';
process "td>div>a", 'tor[]' => '#href';
process "td font.detDesc", 'sizerow[]' => 'TEXT';
};
my $mres = $ns->scrape(URI->new($purlToScrape));
Web::Scraper uses LWP if you pass a URI to scrape.
You can either fetch the HTML using some other HTTP library that uses SOCKS, or using the shared UserAgent variable from Web::Scraper, you can set up LWP to use SOCKS and pass that as the agent.
use strict;
use LWP::UserAgent;
use Web::Scraper;
# set up a LWP object with Tor socks address
my $ua = LWP::UserAgent->new(
agent => q{Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; YPC 3.2.0; .NET CLR 1.1.4322)},
);
$ua->proxy([qw/ http https /] => 'socks://localhost:9050'); # Tor proxy
$ua->cookie_jar({});
my $PIRATEBAYSERVER = 'http://uj3wazyk5u4hnvtk.onion';
my $srstring = 'photoshop';
my $piratelink=$PIRATEBAYSERVER.'/search/' . $srstring; # . '%20'. 's'.$sval[1].'e'.$epinum.'/0/7/0';
my $purlToScrape = $piratelink;
my $ns = scraper {
process "td>a", 'mag[]' => '#href';
process "td>div>a", 'tor[]' => '#href';
process "td font.detDesc", 'sizerow[]' => 'TEXT';
};
# override Scraper's UserAgent with our SOCKS LWP object
$Web::Scraper::UserAgent = $ua;
my $mres = $ns->scrape(URI->new($purlToScrape));
print $mres;
Note, you will also need to install the CPAN module LWP::Protocol::socks

Cannot login with UserAgent

I have managed to login with the code below. Now I can do it ony once a day.
And then I cant login, but get the login page in the response.
But when i print $reqstr from the code below and paste it to browser(like firefox), I can log in.
Wget doesnt work neiter. Only normal browser.
Soemtimes it seems , that Im logged in, but only get such content:
"<html>\cJ<head>\cJ\cI<meta http-equiv=\"content-type\" content=\"text/html; charset=ISO-8859-1\"><meta http-equiv=\"expires\" content=\"0\"><meta http-equiv=\"pragma\" content=\"no-cache\">\cJ\cI<meta http-equiv=\"refresh\" content=\"0; URL='https://www.address.com/'\">\cJ</head>\cJ</html>\cJ"
I also noticed, that while I cant login, Im getting this part in a debugger:
_uri_canonical' => URI::https=SCALAR(0x17dad28)
-> REUSED_ADDRESS
'handlers' => HASH(0x22dc0c0)
'response_data' => ARRAY(0x22ee8b8)
0 HASH(0x22d9a48)
'callback' => CODE(0x22dba30)
-> &LWP::UserAgent::__ANON__[/usr/lib/perl5/vendor_perl/5.10.0/LWP/UserAgent.pm:682] in /usr/lib/perl5/vendor_perl/5.10.0/LWP/UserAgent.pm:679-682
1 HASH(0x22eea08)
'callback' => CODE(0x22d9cb8)
-> &LWP::Protocol::__ANON__[/usr/lib/perl5/vendor_perl/5.10.0/LWP/Protocol.pm:138] in /usr/lib/perl5/vendor_perl/5.10.0/LWP/Protocol.pm:135-138
Any clue?
Here the code:
my $b = LWP::UserAgent->new(agent => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.5) Gecko/20060719 Firefox/31.2.0',);
my $cookie_jar = HTTP::Cookies->new(
file => 'lwp_cookies.txt',
autosave => 1,
ignore_discard => 1,
);
$cookie_jar->clear;
$cookie_jar->clear_temporary_cookies;
$b->cookie_jar($cookie_jar);
my $url = "https://www.address.com";
my $r = $b->get($url);
$r->decoded_content =~ /FORM ACTION="(.*?)" METHOD/msgi;
my $a = "$url$1";
print $a."\n";
my $reqstr = $a."&LoginAction=Login&Number=55555&KPassword=passw&UserID=uid";
my $req = HTTP::Request->new(POST => $reqstr);
$req->header('Host', 'www.address.com');
$req->header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0');
$req->header('Connection', 'keep-alive');
$req->header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8');
my $c = $b->request($req);
You need to re-request that page with the referrer added via referer() for LWP::UserAgent (or see my second answer if you aren't wedded to that module)
sub login { # Code not tested and not really compilable, just a stub for you
my (#other_args, $url, $referrer_url) = #_;
# Add your login code from the question, up to calling $b->request()
$req->referer($referrer_url) if $referrer_url;
my $c = $b->request($req);
return $c; # Or return the response?
}
my $result1 = login($original_login_url); #first try
# Obtain the redirect_url from the response.
# If it was a 301 redirect, you can do it via
# my #redirects = $response->redirects();
my $referrer_url = $original_login_url;
my $result2 = login($redirect_url, $referrer_url);
References:
http://forums.devshed.com/perl-programming-6/lwp-meta-refresh-tag-handling-63484.html
http://www.herongyang.com/Perl/LWP-UserAgent-Follow-HTTP-Redirects.html
If you aren't dead set on using LWP::UserAgent, use WWW::Mechanize instead.
Best approach: use WWW::Mechanize::Plugin::FollowMetaRedirect. The SYNOPSIS is pretty short and to the point:
use WWW::Mechanize;
use WWW::Mechanize::Plugin::FollowMetaRedirect;
my $mech = WWW::Mechanize->new;
$mech->get( $url );
$mech->follow_meta_redirect;
# Optionally, skip emulating the waiting time
$mech->follow_meta_redirect( ignore_wait => 1 );
If you don't have access to that module, you can create your own, similar to this: http://www.perlmonks.org/?node_id=487286
(Basically, parse the returned content using the regex shudder to extract the refresh URL, and get that URL. As per my other answer, you might need to add the referrer header)

Send a plain string request with LWP

To get a response from a certain website, I have to give one exact request string, HTTP/1.1. I tried that one with telnet, it gives me the response I want (a redirect, but I need it).
But when I try to give the same request string to HTTP::Request->parse(), I merely get the message 400 URL must be absolute.
I am not sure if it's the website or LWP giving me that, because as I said, the response worked with telnet.
This is the code:
my $req = "GET / HTTP/1.1\n".
"Host: www.example-site.de\n".
"User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1\n".
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\n".
"Accept-Language: en-us,en;q=0.5\n".
"Accept-Encoding: gzip, deflate\n".
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\n".
"Keep-Alive: 115\n".
"Connection: keep-alive\n";
# Gives correct request string
print HTTP::Request->parse($req)->as_string;
my $ua = LWP::UserAgent->new( cookie_jar => {}, agent => '' );
my $response = $ua->request(HTTP::Request->parse($req));
# 400 error
print $response->as_string,"\n";
Anyone can help me here?
LWP::UserAgent dies with the error you are getting if there is no schema specified in request. It probably need it to properly work with it.
So, to make it work, you need to specify full url for your request:
my $req_str = "GET http://www.example.de/\n".
"User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1\n".
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\n".
"Accept-Language: en-us,en;q=0.5\n".
"Accept-Encoding: gzip, deflate\n".
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\n".
"Keep-Alive: 115\n".
"Connection: keep-alive\n";
Ok, I did it using Sockets. After all, I had the HTTP request and wanted the plain response. Here the code for people who are interested:
use IO::Sockets;
my $sock = IO::Socket::INET->new(
PeerAddr => 'www.example-site.de',
PeerPort => 80,
Proto => 'Tcp',
);
die "Could not create socket: $!\n" unless $sock;
print $sock, $req;
while(<$sock>) {
# Look for stuff I need
}
close $sock;
It's just important to remember to leave the while, as the HTTP response won't end with an EOF.
It looks to me like parsing the request isn't 100 % round-trip-safe, meaning you cannot feed the response back into a request.
Looks like a bug at first sight, but the module's been out for such a long timeā€¦ On the other hand, I didn't even know you could use this module to parse a request, so maybe it's not so well tested.
The following test case should point you to the problem, which is that the URL isn't properly assembled for being fed to the $req->request method.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request;
use Test::More;
my $host = 'www.example.com';
my $url = '/bla.html';
my $req = <<"EOS";
GET $url HTTP/1.1
Host: $host
EOS
# (1) parse the request
my $reqo = HTTP::Request->parse($req);
isa_ok $reqo, 'HTTP::Request';
diag explain $reqo;
diag $reqo->as_string;
# (2) construct the request
my $reqo2 = HTTP::Request->new( GET => "http://$host$url" );
isa_ok $reqo2, 'HTTP::Request';
diag explain $reqo2;
diag $reqo2->as_string;
is $reqo->uri, $reqo2->uri, 'both URLs are identical';
my $ua = LWP::UserAgent->new( cookie_jar => {}, agent => '' );
for ( $reqo, $reqo2 ) {
my $response = $ua->request( $_ );
diag $response->as_string,"\n";
}
done_testing;

A problem in socket programming in perl

I write this code :
#!/usr/local/bin/perl
use strict;
use LWP::UserAgent;
my $ua = new LWP::UserAgent(agent => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.5) Gecko/20060719 Firefox/1.5.0.5');
$ua->proxy([qw(http https)] => 'http://203.185.28.228:1080' #that is just socks:port);
my $response = $ua->get("http://www.google.com");
print $response->code,' ', $response->message,"\n";
but when i execute it i get this error:
500 Can't connect to 203.185.28.228:1080 (connect: timeout)
what am i going to do ?
I tested your script, its fine, however I got the only error is with
$ua->proxy([qw(http https)] => 'http://203.185.28.228:1080' #that is just socks:port);
The comments should be out of '()' i.e.
$ua->proxy([qw(http https)] => 'http://203.185.28.228:1080'); #that is just socks:port
Also, please check your internet connectivity. Below is the output i got from your script.
200 Assumed OK
Is it SOCKS5? Does it require you to authenticate? (take a look at your Firefox/IE settings if they use the same proxy)