I'm trying to use LWP::Simple in Perl to download a number of PDF documents from the United Nations website (Security Council resolutions, etc.). Yet instead of returning PDFs, I am receiving an HTML error page. Consider this very simple example:
use LWP::Simple;
use strict;
my $url = 'https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf';
my $file = 'test.pdf';
getstore($url, $file);
If I then look at the contents of "test.pdf", I find that they are an HTML page.
I have also tried a number of tricks with LWP::UserAgent and even with cURL, but with no success. Any ideas?
Ok, thanks to #SteffenUllrich and # ikegami for putting me on the right track!
It is indeed a cookie issue. The fix? Open a cookie jar, access the homepage of the site first, then access the PDF once a cookie has been stored in the jar.
This can be done without using HTTP::Cookies. We need to use LWP::UserAgent instead of LWP::Simple, however.
Minimal working example below:
use strict;
use warnings 'all';
use LWP::UserAgent;
my $homeUrl = "https://documents.un.org/prod/ods.nsf/home.xsp";
my $pdfUrl = "https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/100/02/PDF/N1610002.pdf";
my $pdfOutputName = "test.pdf";
my $browser = LWP::UserAgent->new( cookie_jar => { } );
my $resp;
$resp = $browser->get( $homeUrl );
die $resp->status_line unless $resp->is_success;
$resp = $browser->get( $pdfUrl, ':content_file' => $pdfOutputName );
die $resp->status_line unless $resp->is_success;
This will produce a complete PDF file.
Related
There is a page residing on a local server running Apache. I would like to submit the form via a GET request with a single name/value pair, like:
id=item1234
This GET request has to be processed by another server which I don't have control over subsequently returning a page which I would like to transform with the CGI script. In other words:
User submits form
MY apache proxies to external resource
EXTERNAL resource throws back a page
MY apache transforms it with a CGI (maybe another way?)
User get a modified page
Again this more like an architectural question so I'd be grateful for any hints, even poking my nose into some guides will help as I wasn't able to structure my google request well enough to locate anything related.
Thanks.
Pass the id "17929632" to this CGI code ("proxy.pl?id=17929632"), and you should this exact page in your browser.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use CGI::Pretty qw(:standard -any -no_xhtml -oldstyle_urls);
print header;
print "<html>\n";
print " <head><title>Proxy Demo</title></head>\n";
print " <body bgcolor=\"white\">\n";
my $id = param('id') || die "No CGI param 'id'\n";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
# Create a request
my $req = HTTP::Request->new(GET => "http://stackoverflow.com/questions/$id");
# Pass request to the user agent and get a response back
my $response = $ua->request($req);
# Check the outcome of the response
if ($response->is_success) {
my $content = $response->content;
# Modify the original content here!
print $content;
}
else {
print $response->status_line;
}
print "</body></html>\n";
Vague question, vague answer: write your CGI program to include a HTTP user agent, e.g. LWP.
I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US<mpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox
I'm trying to figure out the right way to serve icon files for our site listings. Basically an icon for a listing can come from an image file from a handful of different services (Flickr, Picasa, Google Static Maps, our own internal image hosting service, etc). The URL of the icon is stored in our database so I'd like to enable each listing icon to be displayed by simply calling:
http://www.example.com/listing/1234/icon
Currently I have been using CGI.pm to do a redirect to the correct icon URL, however, I want the file to be directly displayed without having to do a 301 redirect. Here is the code for what we've been using:
my $url = "http://www.example-service.com/image-123.gif";
print $query->redirect(-url=>$url);
I would appreciate any suggestions and code examples of on how I could update this to serve the file via proxy without having to redirect the user. Thanks in advance for your help!
Use LWP to get the remote file and print it out.
#!/usr/local/bin/perl
use LWP::UserAgent;
use CGI;
my $q = CGI->new;
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1");
my $url = 'http://www.example-service.com/image-123.gif';
# Create a request
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
if ($res->is_success) {
print $q->header( $res->content_type );
print $res->content;
} else {
print $q->header( 'text/plain', $res->status_line );
print $res->status_line, "\n";
}
Alternatively you could write a trigger for your database which downloads the image for the listing and stores it either in the webroot somewhere or in the database itself when you add a new listing.
I am trying to download a file from a site using perl. I chose not to use wget so that I can learn how to do it this way. I am not sure if my page is not connecting or if something is wrong in my syntax somewhere. Also what is the best way to check if you are getting a connection to the page.
#!/usr/bin/perl -w
use strict;
use LWP;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->credentials( '********' , '********'); # if you do need to supply server and realms use credentials like in [LWP doc][2]
$mech->get('http://datawww2.wxc.com/kml/echo/MESH_Max_180min/');
$mech->success();
if (!$mech->success()) {
print "cannot connect to page\n";
exit;
}
$mech->follow_link( n => 8);
$mech->save_content('C:/Users/********/Desktop/');
I'm sorry but almost everything is wrong.
You use a mix of LWP::UserAgent and WWW::Mechanize in a wrong way. You can't do $mech->follow_link() if you use $browser->get() as you mix function from 2 module. $mech don't know that you did a request.
Arguments to credentials are not good, see the doc
You more probably want to do something like this:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->credentials( '************' , '*************'); # if you do need to supply server and realms use credentials like in LWP doc
$mech->get('http://datawww2.wxc.com/kml/echo/MESH_Max_180min/');
$mech->follow_link( n => 8);
You can check result of get() and follow_link() by checking $mech->success() result
if (!$mech->success()) { warn "error"; ... }
After follow->link, data is available using $mech->content(), if you want to save it in a file use $mech->save_content('/path/to/a/file')
A full code could be :
use strict;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->credentials( '************' , '*************'); #
$mech->get('http://datawww2.wxc.com/kml/echo/MESH_Max_180min/');
die "Error: failled to load the web page" if (!$mech->success());
$mech->follow_link( n => 8);
die "Error: failled to download content" if (!$mech->success());
$mech->save_content('/tmp/mydownloadedfile')
When I visit usatoday.com with IE, there're cookie files automatically created in my Temporary Internet Files folder. But why doesn't the following Perl script capture anything?
use WWW::Mechanize;
use strict;
use warnings;
my $browser = WWW::Mechanize->new();
my $response = $browser->get( 'http://www.usatoday.com' );
my $cookie_jar = $browser->cookie_jar(HTTP::Cookies->new());
$cookie_jar->extract_cookies( $response );
my $cookie_content = $cookie_jar->as_string;
print $cookie_content;
For some other sites like amazon.com, google.com and yahoo.com, the script works well, but at least it seems to me usatoday.com also sends cookie information to the browser, why am I having different results? Is there something I'm missing?
Any ideas? Thanks!
UsaToday uses Javascript to set the cookie. WWW::Mechanize does not parse or run Javascript.
If you need to crawl the site with a cookie, you could analyze http://i.usatoday.net/_common/_scripts/gel/lib/core/core.js and other JS files and determine how exactly the cookie is created, and create one yourself programmatically.