I use WWW::Mechanize::Shell to test stuff.
Since I didn't managed to sign in on a web site I want to scrape, I thought I will use the browser cookie (chrome or firefox) for that specific website with the 'cookie' command WWW::Mechanize::Shell has.
The question is, Cookies usually stored in a single file, which is not good, how to get a cookie for only this specific site?
Why isn't storing cookies in a file good?
Since WWW::Mechanize is built on top of LWP::UserAgent, you handle cookies just like you do in LWP::UserAgent. You can make the cookie jar a file or an in-memory hash.
If you don't want to save the cookies in a file, use an empty hash reference when you construct the mech object:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( cookie_jar => {} );
If you want to use a new file, make a new HTTP::Cookies object:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new(
cookie_jar => HTTP::Cookies->new( file => "$ENV{HOME}/.cookies.txt" )
);
If you want to load a browser specific cookies file, use the right module for it:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new(
cookie_jar => HTTP::Cookies::Netscape->new( file => $filename )
);
If you want no cookies at all, use undef explicitly:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( cookie_jar => undef );
All of this is in the docs.
HTTP::Cookies::Netscape, HTTP::Cookies::Microsoft load your existing browser cookies.
Related
I am using WWW::Mechanize to crawl a website and collect information on the Cookies being set. Here is the code I'm using:
#! /usr/bin/perl
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $mech = WWW::Mechanize->new( cookie_jar => $cookie_jar, autocheck => 1 );
my $response = $mech->get('http://assets.pinterest.com/images/PinExt.png');
print "Cookie:\n" . $cookie_jar->as_string;
When I use Chrome and check the resources, I can see a cookie getting set. However, when I run my code I get nothing. Been having this issue on a number of websites. Why am I missing cookies?
Your code works (prints cookies) for http://google.com/
I have used firefox to visit http://assets.pinterest.com/images/PinExt.png. No cookies have been set.
I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US<mpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox
When I visit usatoday.com with IE, there're cookie files automatically created in my Temporary Internet Files folder. But why doesn't the following Perl script capture anything?
use WWW::Mechanize;
use strict;
use warnings;
my $browser = WWW::Mechanize->new();
my $response = $browser->get( 'http://www.usatoday.com' );
my $cookie_jar = $browser->cookie_jar(HTTP::Cookies->new());
$cookie_jar->extract_cookies( $response );
my $cookie_content = $cookie_jar->as_string;
print $cookie_content;
For some other sites like amazon.com, google.com and yahoo.com, the script works well, but at least it seems to me usatoday.com also sends cookie information to the browser, why am I having different results? Is there something I'm missing?
Any ideas? Thanks!
UsaToday uses Javascript to set the cookie. WWW::Mechanize does not parse or run Javascript.
If you need to crawl the site with a cookie, you could analyze http://i.usatoday.net/_common/_scripts/gel/lib/core/core.js and other JS files and determine how exactly the cookie is created, and create one yourself programmatically.
So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id?
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;
my $un = 'username';
my $pw = 'password';
my $url = 'https://subdomain.url.com/index.do';
my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);
$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");
print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";
my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);
print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";
The output:
After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0
After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0
Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?
$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...
Also for the CERT In using the first option in this list, is this correct?
X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)
When your user-agent isn't doing something you think it should be doing, compare it's requests with that of an interactive browser. A Firefox plugin are handy for this sort of thing.
You're probably missing part of the process that the server expects. You probably aren't logging in or interacting correctly, and that could be for all sorts of reasons. For instance, there might be JavaScript on the page that WWW::Mechanize isn't handling.
When you can pinpoint what an interactive browser is doing that you are not, you'll know where you need to improve your script.
In your script, you can also watch what is happening by turning on debugging in LWP, which Mech is built on:
use LWP::Debug qw(+);
rjh already answered the certificate part of your question.
If your session cookie changes every page load, then likely you are not logging in correctly. But you could try forcing the JSESSIONID to be the same for each request. Construct your own cookie jar and tell WWW::Mechanize to use it:
my $cookie_jar = HTTP::Cookies->new(file => 'cookies', autosave => 1, ignore_discard => 1);
my $agent = WWW::Mechanize->new(cookie_jar => $cookie_jar, autocheck => 0);
The ignore_discard => 1 means that even session cookies are saved to disk (normally they are discarded for security reasons).
Then, after logging in, call:
$cookie_jar->save;
Then, after each request:
$cookie_jar->revert; # re-loads the save
Alternately, you could sub-class HTTP::Cookies and override the set_cookie method to reject re-setting the session cookie if it already exists.
Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?
Some browsers (Internet Explorer for example) prompt for a security certificate even if one is not needed. If you are not getting any errors and the response content looks good, you probably don't need to set one.
If you do have a certificate file, check the POD for Crypt::SSLeay. Your certificate is PEM0-encoded so yes, you want to set $ENV{HTTPS_CERT_FILE} to the path of your cert. You might want to set $ENV{HTTPS_DEBUG} = 1 to see what's happening.
Setup the cookie jar, something akin to this:
my $cookie = HTTP::Cookies->new(file => 'cookie',autosave => 1,);
my $mech = WWW::Mechanize->new(cookie_jar => $cookie, ....);
I have a Perl script that uses WWW::Mechanize to read from a file and perform some automated tasks on a website. However, the website uses a 302 redirect after every time I request a certain page. I don't want to be redirected (the page that it redirects to takes too long to respond); I just want to loop through the file and call the first link over and over. I can't figure out how to make WWW::Mechanize NOT follow redirects. Any suggestions?
WWW::Mechanize is a subclass of LWP::UserAgent. So you can use any LWP::UserAgent methods.
my $mech = WWW::Mechanize->new();
$mech->requests_redirectable([]);
WWW::Mechanize is a subclass of LWP::UserAgent; you can set the max_redirect or requests_redirectable options in the constructor as you would with LWP::UserAgent.
You can use $agent->max_redirect( 0 );, like in this example:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $agent = WWW::Mechanize->new( 'autocheck' => 1, 'onerror' => undef, );
$agent->max_redirect( 0 );
$agent->get('http://www.depesz.com/test/redirect');
printf("Got HTTP/%s from %s.\n", $agent->response->code, $agent->uri);
$agent->max_redirect( 1 );
$agent->get('http://www.depesz.com/test/redirect');
printf("Got HTTP/%s from %s.\n", $agent->response->code, $agent->uri);
When running it prints:
Got HTTP/302 from http://www.depesz.com/test/redirect.
Got HTTP/200 from http://www.depesz.com/.
So, with max_redirect(0) - it clearly doesn't follow redirects.