WWW::Mechanize timeout - all urls timing out - perl

I am having a problem using WWW::Mechanize. It seems no matter what website I try accessing, my script just sits there in the command prompt until it times out. The only things that come to mind that might be relevant are the following:
I have IE7, chrome, and FF installed. FF was my default browser but I recently switched that to chrome.
I seem to be able to access websites with port 8080 just fine.
I recently experimented with the cookie jar but stopped using it because, honestly, I'm not sure how it works. This may have instantiated a change.
Here is an example:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $url = 'http://docstore.mik.ua/orelly/perl/learn/';
my $mech = WWW::Mechanize->new();
$mech->get( $url );
print $mech->content;

The code seems to work, so it must be a firewall/proxy issue. You can try setting a proxy:
$mech->proxy(['http', 'ftp'], 'http://your-proxy:8080/');

Related

Unable to get page via HTTPS with LWP::Simple in Perl

I try to download a page from an HTTPS URL with Perl:
use LWP::Simple;
my $url = 'https://www.ferc.gov/xml/whats-new.xml';
my $content = get $url or die "Unable to get $url\n";
print $content;
There seems to be a problem. Just can't figure out the error. I can't get the page. Is the get request improperly coded? Do I need to use a user agent?
LWP::Protocol::https is needed to make HTTPS requests with LWP. It needs to be installed separately from the rest of LWP. It looks like you installed LWP, but not LWP::Protocol::https, so simply install it now.

Failing to establish a session while trying to log into a website and opening the logged in page

I am new to Perl, so as an exercise I have been trying to log in to a web page and open the logged-in page later from the command line.
This is the code I wrote
use HTTP::Cookies;
use lib "/home/tivo/Desktop/exp/WWW-Mechanize-1.80/lib/WWW";
use Mechanize;
$mech = WWW::Mechanize->new();
$mech->cookie_jar(HTTP::Cookies->new());
$url = "<url>";
$mech->credentials('username' => 'password');
$mech->get($url);
$mech->save_content("logged_in.html");
After I execute the script, I try to open the saved HTML page using the command
$ firefox logged_in.html
But I get the error
BIG-IP can not find session information in the request. This can happen because your browser restarted after an add-on was installed. If this occurred, click the link below to continue. This can also happen because cookies are disabled in your browser. If so, enable cookies in your browser and start a new session.
The same code worked for Facebook login.
Here are the main issues
You haven't installed WWW::Mechanize; it looks like you've just downloaded it and unpacked it, and added use lib to point to where it's unpacked. You need to run cpan WWW::Mechanize from the command line to install it properly, and then it will also be in a directory where Perl looks for libraries by default so there will be no need for a use lib at all
You need to use WWW::Mechanize. Just use Mechanize won't do
You must always start every Perl program with use strict and use warnings 'all', and declare all your variables with my
Fixing those should get you a long way towards working

Perl LWP::Simple won't "get" a webpage when running from remote server

I'm trying to use Perl to scrape a publications list as follows:
use XML::XPath;
use XML::XPath::XMLParser;
use LWP::Simple;
my $url = "https://connects.catalyst.harvard.edu/Profiles/profile/xxxxxxx/xxxxxx.rdf";
my $content = get($url);
die "Couldn't get publications!" unless defined $content;
When I run it on my local (Windows 7) machine it works fine. When I try to run it on the linux server where we are hosting some websites, it dies. I installed XML and LWP using cpan so those should be there. I'm wondering if the problem could be some sort of security or permissions on the server (keeping it from accessing an external website), but I don't even know where to start with that. Any ideas?
Turns out I didn't have LWP::Protocol::https" installed. I found this out by switching
LWP::Simple
to
LWP::UserAgent
and adding the following:
my $ua = LWP::UserAgent->new;
my $resp = $ua->get('https://connects.catalyst.harvard.edu/Profiles/profile/xxxxxx/xxxxxxx.rdf' );
print $resp;
It then returned an error telling me it didn't have the protocol to access the https without LWP::Protocol::https, so I installed it with
cpan LWP::Protocol::https
and all was good.

Perl mechanize response is only "<HTML></HTML>" with https

I'm kind of new with perl, even newer to Mechanize. So far, when I tried to fetch a site via http, it's no problem.
Now I need to fetch a site with https. I've installed Crypt::SSLeay via PPM.
When I use $mech->get($url), though, this is the only response I get:
"<HTML></HTML>"
I checked the status and success, both were OK (200 and 1).
Here's my code:
use strict;
use warnings;
use WWW::Mechanize;
use Crypt::SSLeay;
$ENV{HTTPS_PROXY} = 'http://username:pw#host:port';
//I have the https_proxy env variable set globally too.
my $url = 'https://google.com';
//Every https site has the same response,
//so I don't think google would cause problems.
my $mech = WWW::Mechanize->new(noproxy => 0);
$mech->get($url) or die "Couldn't load page";
print "Content:\n".$mech->response()->content()."\n\n";
As you can see I'm behind a proxy. I tried setting
$mech->proxy($myproxy);
but for no avail. I even tried to fetch it into a file, but when I checked it, I got the same response content.
Any kind of advice would be appreciated, since I'm just a beginner and there is still a lot to learn of everything. Thanks!
I think the answer lies here: How do I force LWP to use Crypt::SSLeay for HTTPS requests?
use Net::SSL (); # From Crypt-SSLeay
BEGIN {
$Net::HTTPS::SSL_SOCKET_CLASS = "Net::SSL"; # Force use of Net::SSL
$ENV{HTTPS_PROXY} = 'http://10.0.3.1:3128'; #your proxy!
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;
}

Why do I need to explicitly output the HTTP header for IIS but not Apache?

I am trying to set up apache instead of IIS because IIS needlessly crashes all the time, and it would be nice to be able to have my own checkout of the source instead of all of us editing a common checkout.
In IIS we must do something like this at the beginning of each file:
use CGI;
my $input = new CGI();
print "HTTP/1.0 200 OK";
print $input->header();
whereas with apache we must leave off the 200 OK line. The following works with both:
use CGI;
my $input = new CGI();
print $input->header('text/html','200 OK');
Can anyone explain why? And I was under the impression that the CGI module was supposed to figure out these kind of details automatically...
Thanks!
Update: brian is right, nph fixes the problem for IIS, but it is still broken for Apache. I don't think it's worth it to have conditionals all over the code so I will just stick with the last method, which works with and without nph.
HTTP and CGI are different things. The Perl CGI module calls what it does an "HTTP header", but it's really just a CGI header for the server to fix up before it goes back to the client. They look a lot alike which is why people get confused and why the CGI.pm docs don't help by calling them the wrong thing.
Apache fixes up the CGI headers to make them into HTTP headers, including adding the HTTP status line and anything else it might need.
If you webserver isn't fixing up the header for you, it's probably expecting a "no-parsed header" where you take responsibility for the entire header. To do that in CGI.pm, you have to add the -nph option to your call to header, and you have to make the complete header yourself, including headers such as Expires and Last-Modified. See the docs under Creating a Standard HTTP Header. You can turn on NPH in three ways:
use CGI qw(-nph)
CGI::nph(1)
print header( -nph => 1, ...)
Are you using an older version of IIS? CGI.pm used to turn on the NPH feature for you automatically for IIS, but now that line is commented out in the source in CGI.pm:
# This no longer seems to be necessary
# Turn on NPH scripts by default when running under IIS server!
# $NPH++ if defined($ENV{'SERVER_SOFTWARE'}) && $ENV{'SERVER_SOFTWARE'}=~/IIS/;
I'm still experiencing this problem with ActivePerl 5.14 running under IIS 7 via ISAPI. The ActivePerl 5.10 FAQ claims the problem is fixed (the 5.14 FAQ doesn't even address the issue), but it doesn't appear to be and setting the registry key they suggest using has no effect in this environment.
Using $ENV{PerlXS} eq 'PerlIS' to detect ISAPI and turn on the NPH key per the aforementioned FAQ seems to work. I hacked my CGI.pm to add the final two lines below under the old IIS handler:
# This no longer seems to be necessary
# Turn on NPH scripts by default when running under IIS server!
# $NPH++ if defined($ENV{'SERVER_SOFTWARE'}) && $ENV{'SERVER_SOFTWARE'}=~/IIS/;
# Turn on NPH scripts by default when running under IIS server via ISAPI!
$NPH++ if defined($ENV{'SERVER_SOFTWARE'}) && $ENV{PERLXS} eq 'PerlIS';
I had similar problem with perl (it was a DOS/Unix/Mac newline thing !)
binmode(STDOUT);
my $CRLF = "\r\n"; # "\015\012"; # ^M: \x0D ^L: \x0A
print "HTTP/1.0 200 OK",$CRLF if ($0 =~ m/nph-/o);
print "Content-Type: text/plain".$CRLF;
print $CRLF; print "OK !\n";