Why can't my Perl script print cookie values? - perl

When I visit usatoday.com with IE, there're cookie files automatically created in my Temporary Internet Files folder. But why doesn't the following Perl script capture anything?
use WWW::Mechanize;
use strict;
use warnings;
my $browser = WWW::Mechanize->new();
my $response = $browser->get( 'http://www.usatoday.com' );
my $cookie_jar = $browser->cookie_jar(HTTP::Cookies->new());
$cookie_jar->extract_cookies( $response );
my $cookie_content = $cookie_jar->as_string;
print $cookie_content;
For some other sites like amazon.com, google.com and yahoo.com, the script works well, but at least it seems to me usatoday.com also sends cookie information to the browser, why am I having different results? Is there something I'm missing?
Any ideas? Thanks!

UsaToday uses Javascript to set the cookie. WWW::Mechanize does not parse or run Javascript.
If you need to crawl the site with a cookie, you could analyze http://i.usatoday.net/_common/_scripts/gel/lib/core/core.js and other JS files and determine how exactly the cookie is created, and create one yourself programmatically.

Related

Unable to download PDFs with Perl and LWP

I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:
use LWP::Simple;
use strict;
my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);
If I then look at the contents of "test.pdf", I find that they are an HTML page.
I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?
Ok, thanks to #SteffenUllrich and # ikegami for putting me on the right track!
It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.
This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.
Minimal working example below:
use strict;
use warnings 'all';
use LWP::UserAgent;
my $homeUrl = "https://documents.un.org/prod/ods.nsf/home.xsp";
my $pdfUrl = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
my $pdfOutputName = "test.pdf";
my $browser = LWP::UserAgent->new( cookie_jar => { } );
my $resp;
$resp = $browser->get( $homeUrl );
die $resp->status_line unless $resp->is_success;
$resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
die $resp->status_line unless $resp->is_success;
This will produce a complete PDF file.

Perl CGI::Session - Migrating user sessions to new server, without them knowing it

I need to move our perl apps to a new server, but I don't want everyone to have to re-authenticate when the move is made. I'd like to have our "Home" script redirect over to the new server, pre-plant a cookie, with a corresponding session file in /tmp, for each user before I make the DNS change.
Thought it would be simple but it's turning out not to be. That or I'm just missing something right in front of my face.
Here is the code I put in the "Home" script on the current server..
my $has_session = $cgi->param("session") || "";
if ($has_session eq "") {
my $url = "http://111.222.333.444/cgi-bin/SetNewSession.cgi?back=http://" . "$ENV{SERVER_NAME}" ."$ENV{SCRIPT_NAME}";
print "Location: $url\n\n";
}
And here is the code in a script on the new server...
use strict;
use warnings;
use CGI;
use CGI::Session;
use CGI::Carp qw/fatalsToBrowser warningsToBrowser/;
my $cgi = new CGI;
my $userid = $cgi->param("userid");
my $redir_back = $cgi->param("back");
my $session = CGI::Session->load() or die $!;
my $session_userid = $session->param("userid");
if (! defined $session_userid) {
$session->expire("9y");
$session->param("userid",$userid);
$session->flush();
}
my $url = $redir_back . "?session=1";
print $session->header(-location=>$url);
exit;
No cookie. No session file. Nothing.
P.S. Please, no flak for the 9y expiration. Management is "above" having to login. :)
Here is the code I put in the "Home" script on the current server..
That part doesn't seem to transmit userid, it also doesn't use url(qw/ -full 1 -rewrite 1 /) to retrieve the url value for back=
You also might want to Crypt::CBC userid (and even back) using rjindel like Session::Storage::Secure does for cookies (UTSL)
And here is the code in a script on the new server... No cookie. No session file. Nothing.
The way I read your program, it should have at least throw an error because load won't create a session, its called load() its not called new. new will create a session.

Processing external page with perl CGI or act as a reverse proxy

There is a page residing on a local server running Apache. I would like to submit the form via a GET request with a single name/value pair, like:
id=item1234
This GET request has to be processed by another server which I don't have control over subsequently returning a page which I would like to transform with the CGI script. In other words:
User submits form
MY apache proxies to external resource
EXTERNAL resource throws back a page
MY apache transforms it with a CGI (maybe another way?)
User get a modified page
Again this more like an architectural question so I'd be grateful for any hints, even poking my nose into some guides will help as I wasn't able to structure my google request well enough to locate anything related.
Thanks.
Pass the id "17929632" to this CGI code ("proxy.pl?id=17929632"), and you should this exact page in your browser.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use CGI::Pretty qw(:standard -any -no_xhtml -oldstyle_urls);
print header;
print "<html>\n";
print " <head><title>Proxy Demo</title></head>\n";
print " <body bgcolor=\"white\">\n";
my $id = param('id') || die "No CGI param 'id'\n";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
# Create a request
my $req = HTTP::Request->new(GET => "http://stackoverflow.com/questions/$id");
# Pass request to the user agent and get a response back
my $response = $ua->request($req);
# Check the outcome of the response
if ($response->is_success) {
my $content = $response->content;
# Modify the original content here!
print $content;
}
else {
print $response->status_line;
}
print "</body></html>\n";
Vague question, vague answer: write your CGI program to include a HTTP user agent, e.g. LWP.

Get files from a given URL on the basis of pattern passed using Perl on Unix

I have been told that a given URL contains several xml and text files and I need to download all the xml files starting with AAA(that is AAA*.xml) inside a given directory.
Credentials to access that URL are provided to me.
Please not that size of xml files could be in GBs.
I have used below code to achieve the same-
use strict;
use warnings;
use LWP;
my $browser = LWP::UserAgent->new;
my $username ='scott';
my $password='tiger';
# Create HTTP request object
my $req = HTTP::Request->new( GET => "https://url.com/");
# Authenticate the user
$req->authorization_basic( $username , $password);
my $res = $browser->request( $req , ':content_file' => '/fold/AAA1.xml');
print $res->status_line, "\n";
It prints 200 OK status but I am not able to get the file. Any suggestions?
Man
If the server doesn't allow you to receive a folder list (i.e. Apache without "Options +Indexes"), you will not GET the collection of files.
But, having the list, you can filter it with a regexpr like /AAA.*/, and with LWP::Simple module, it's easy to get it

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox