Missing Cookies using WWW::Mechanize - perl

I am using WWW::Mechanize to crawl a website and collect information on the Cookies being set. Here is the code I'm using:
#! /usr/bin/perl
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new;
my $mech = WWW::Mechanize->new( cookie_jar => $cookie_jar, autocheck => 1 );
my $response = $mech->get('http://assets.pinterest.com/images/PinExt.png');
print "Cookie:\n" . $cookie_jar->as_string;
When I use Chrome and check the resources, I can see a cookie getting set. However, when I run my code I get nothing. Been having this issue on a number of websites. Why am I missing cookies?

Your code works (prints cookies) for http://google.com/
I have used firefox to visit http://assets.pinterest.com/images/PinExt.png. No cookies have been set.

Related

Unable to download PDFs with Perl and LWP

I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:
use LWP::Simple;
use strict;
my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);
If I then look at the contents of "test.pdf", I find that they are an HTML page.
I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?
Ok, thanks to #SteffenUllrich and # ikegami for putting me on the right track!
It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.
This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.
Minimal working example below:
use strict;
use warnings 'all';
use LWP::UserAgent;
my $homeUrl = "https://documents.un.org/prod/ods.nsf/home.xsp";
my $pdfUrl = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
my $pdfOutputName = "test.pdf";
my $browser = LWP::UserAgent->new( cookie_jar => { } );
my $resp;
$resp = $browser->get( $homeUrl );
die $resp->status_line unless $resp->is_success;
$resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
die $resp->status_line unless $resp->is_success;
This will produce a complete PDF file.

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox

WWW::Mechanize and Cookies

I use WWW::Mechanize::Shell to test stuff.
Since I didn't managed to sign in on a web site I want to scrape, I thought I will use the browser cookie (chrome or firefox) for that specific website with the 'cookie' command WWW::Mechanize::Shell has.
The question is, Cookies usually stored in a single file, which is not good, how to get a cookie for only this specific site?
Why isn't storing cookies in a file good?
Since WWW::Mechanize is built on top of LWP::UserAgent, you handle cookies just like you do in LWP::UserAgent. You can make the cookie jar a file or an in-memory hash.
If you don't want to save the cookies in a file, use an empty hash reference when you construct the mech object:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( cookie_jar => {} );
If you want to use a new file, make a new HTTP::Cookies object:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new(
cookie_jar => HTTP::Cookies->new( file => "$ENV{HOME}/.cookies.txt" )
);
If you want to load a browser specific cookies file, use the right module for it:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new(
cookie_jar => HTTP::Cookies::Netscape->new( file => $filename )
);
If you want no cookies at all, use undef explicitly:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new( cookie_jar => undef );
All of this is in the docs.
HTTP::Cookies::Netscape, HTTP::Cookies::Microsoft load your existing browser cookies.

Trouble with downloading files

I am trying to download a file from a site using perl. I chose not to use wget so that I can learn how to do it this way. I am not sure if my page is not connecting or if something is wrong in my syntax somewhere. Also what is the best way to check if you are getting a connection to the page.
#!/usr/bin/perl -w
use strict;
use LWP;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->credentials( '********' , '********'); # if you do need to supply server and realms use credentials like in [LWP doc][2]
$mech->get('http://datawww2.wxc.com/kml/echo/MESH_Max_180min/');
$mech->success();
if (!$mech->success()) {
print "cannot connect to page\n";
exit;
}
$mech->follow_link( n => 8);
$mech->save_content('C:/Users/********/Desktop/');
I'm sorry but almost everything is wrong.
You use a mix of LWP::UserAgent and WWW::Mechanize in a wrong way. You can't do $mech->follow_link() if you use $browser->get() as you mix function from 2 module. $mech don't know that you did a request.
Arguments to credentials are not good, see the doc
You more probably want to do something like this:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->credentials( '************' , '*************'); # if you do need to supply server and realms use credentials like in LWP doc
$mech->get('http://datawww2.wxc.com/kml/echo/MESH_Max_180min/');
$mech->follow_link( n => 8);
You can check result of get() and follow_link() by checking $mech->success() result
if (!$mech->success()) { warn "error"; ... }
After follow->link, data is available using $mech->content(), if you want to save it in a file use $mech->save_content('/path/to/a/file')
A full code could be :
use strict;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->credentials( '************' , '*************'); #
$mech->get('http://datawww2.wxc.com/kml/echo/MESH_Max_180min/');
die "Error: failled to load the web page" if (!$mech->success());
$mech->follow_link( n => 8);
die "Error: failled to download content" if (!$mech->success());
$mech->save_content('/tmp/mydownloadedfile')

How can I keep WWW::Mechanize from following redirects?

I have a Perl script that uses WWW::Mechanize to read from a file and perform some automated tasks on a website. However, the website uses a 302 redirect after every time I request a certain page. I don't want to be redirected (the page that it redirects to takes too long to respond); I just want to loop through the file and call the first link over and over. I can't figure out how to make WWW::Mechanize NOT follow redirects. Any suggestions?
WWW::Mechanize is a subclass of LWP::UserAgent. So you can use any LWP::UserAgent methods.
my $mech = WWW::Mechanize->new();
$mech->requests_redirectable([]);
WWW::Mechanize is a subclass of LWP::UserAgent; you can set the max_redirect or requests_redirectable options in the constructor as you would with LWP::UserAgent.
You can use $agent->max_redirect( 0 );, like in this example:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $agent = WWW::Mechanize->new( 'autocheck' => 1, 'onerror' => undef, );
$agent->max_redirect( 0 );
$agent->get('http://www.depesz.com/test/redirect');
printf("Got HTTP/%s from %s.\n", $agent->response->code, $agent->uri);
$agent->max_redirect( 1 );
$agent->get('http://www.depesz.com/test/redirect');
printf("Got HTTP/%s from %s.\n", $agent->response->code, $agent->uri);
When running it prints:
Got HTTP/302 from http://www.depesz.com/test/redirect.
Got HTTP/200 from http://www.depesz.com/.
So, with max_redirect(0) - it clearly doesn't follow redirects.