Getting "500 Internal Server Error" retrieving page with LWP::UserAgent - perl

I'm trying to retrieve a page using LWP::UserAgent but I keep getting a "500 Internal Server Error" as a response. Retrieving the exact same page in Firefox (using a fresh "Private Window" - so without any cookies set yet) succeeds without a problem.
I've duplicated the headers exactly as sent by Firefox, but that still does not make a difference. Here's my full code:
use strict;
use LWP::UserAgent;
my $browserObj = LWP::UserAgent->new();
$browserObj->cookie_jar( {} );
$browserObj->timeout(600);
my #header = (
'Host' => 'www.somedomain.com',
'User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'DNT' => '1',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
);
my $URL = "https://www.somedomain.com";
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Error: " . $response->status_line . ".\n" );
}
The real web address is something other than "www.somedomain.com". In fact, it's a URL to an online casino, but I don't want my question be regarded as spam.
But anyone any idea what could be wrong?

On our corporate network which has a proxy (and an out of date perl version - there may be better options in newer versions) we tend to add the following for one-offs:
BEGIN {
$ENV{HTTPS_DEBUG} = 1; # optional but can help if you get a response
$ENV{HTTPS_PROXY} = 'https://proxy.server.here.net:8080';
}
If we don't do this the script simply fails to connect with no other information.
You may also want to add something like this if you want to inspect the messages:
$browserObj->add_handler("request_send", sub { shift->dump; return });
$browserObj->add_handler("response_done", sub { shift->dump; return });

Related

Perl: Some websites block non-browser requests. But how?

I'm writing a simple Perl script that fetches some pages from different sites. It's very non-intrusive. I don't hog a servers bandwidth. It retrieves a single page without loading any extra javascript, or images, or style sheets.
I use LWP::UserAgent to retrieve the pages. This works fine on most sites but there are some sites that return a "403 - Bad Request" error. The same pages load perfectly fine in my browser. I have inspected the request header from my webbrowser and copied that exactly when trying to retrieve the same page in Perl and every single time I get a 403 error. Here's a code snippet:
use strict;
use LWP::UserAgent;
use HTTP::Cookies;
my $URL = "https://www.betsson.com/en/casino/jackpots";
my $browserObj = LWP::UserAgent->new(
ssl_opts => { verify_hostname => 0 }
);
# $browserObj->cookie_jar( {} );
my $cookie_jar = HTTP::Cookies->new();
$browserObj->cookie_jar( $cookie_jar );
$browserObj->agent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0");
$browserObj->timeout(600);
push #{ $browserObj->requests_redirectable }, 'POST';
my #header = ( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'en-US,en;q=0.5',
'Connection' => 'keep-alive',
'DNT' => '1',
'Host' => 'www.bettson.com',
'Upgrade-Insecure-Requests' => '1'
);
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Unsuccessfull...\n";
}
How do these servers distinguish between a real browser and my script? At first I thought they had some JavaScript trickery going on, but then I realized in order for that to work, the page has to be loaded by a browser first. But I immediately get this 403 Error.
What can I do to debug this?
While 403 is a typical answer for bot detection, in this case the bot detection is not the cause of the problem. Instead a typo in your code is:
my $URL = "https://www.betsson.com/en/casino/jackpots";
...
'Host' => 'www.bettson.com',
In the URL the domain name is www.betsson.com and this should be reflected in the Host header. But your Host header is slightly different: www.bettson.com. Since the Host header has the wrong name the request is rejected with 403 forbidden.
And actually, it is not even needed to go through all this trouble since it looks like no bot detection is done at all. I.e. no need to set user-agent and fiddle with the headers but plain:
my $browserObj = LWP::UserAgent->new();
my $response = $browserObj->get($URL);

Perl: Need an LWP & HTTP::Request POST code that actually works

I have been scratching my head trying to get LWP and HTTP::Request to actually pass a POST parameter to a web server. The web server can see the fact that the request was a POST transaction, but it is not picking up the passed parameters. I have been searching all day on this and have tried different things and I have yet to find something that works. (The web server is working, I am able to manually send post transactions and when running the whole script, I am getting '200' status but I am not seeing any posted elements. Any help would be appreciated. Tnx.
my $ua2 = LWP::UserAgent->new;
$ua2->agent("Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)");
my $req2 = HTTP::Request->new(POST => "$url", [ frm-advSearch => 'frmadvSearch' ]);
$req2->content_type('text/html');
my $res2 = $ua2->request($req2);
$http_stat = substr($res2->status_line,0,3);
my $res = $ua->post($url,
Content => [
'frm-advSearch' => 'frmadvSearch',
],
);
which is short for
use HTTP::Request::Common qw( POST );
my $req = POST($url,
Content => [
'frm-advSearch' => 'frmadvSearch',
],
);
my $res = $ua->request($req);
Here's a Mojo::UserAgent example, which I find easier to debug:
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->transactor->name( 'Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)' );
my $url = 'http://www.example.com/form/';
my $tx = $ua->post( $url, form => { 'frm-advSearch' => 'frmadvSearch' } );
say $tx->req->to_string;
The transaction in $tx knows about the request so I can look at that:
POST /form/ HTTP/1.1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)
Accept-Encoding: gzip
Host: www.example.com
Content-Length: 26
frm-advSearch=frmadvSearch

I can't connect using an api

I'm quite new to API's so I don't know if this should be more straight forward.
I write the following perl script
use strict;
use LWP::UserAgent;
require HTTP::Request;
my $request = HTTP::Request->new(GET => 'http://api.elsevier.com/content/ev/results?apiKey=1234&query=stress&database=c&updateNumber=1&pageSize=1');
my $ua = LWP::UserAgent->new;
my $response = $ua->request($request);
then when I get my response and print it in the debugger I get the following
HTTP::Response=HASH(0x9aedff8)
'_content' => '{"service-error":{"status":{"statusCode":"AUTHENTICATION_ERROR","statusText":"Requestor configuration settings insufficient for access to this resource."}}}'
'_headers' => HTTP::Headers=HASH(0x9aedfe8)
'allow' => 'GET'
'client-date' => 'Wed, 29 Mar 2017 08:08:25 GMT'
'client-peer' => '198.185.19.118:80'
'client-response-num' => 1
'content-length' => 156
'content-type' => 'application/json;charset=UTF-8'
'date' => 'Wed, 29 Mar 2017 08:08:24 GMT'
'p3p' => 'CP="IDC DSP LAW ADM DEV TAI PSA PSD IVA IVD CON HIS TEL OUR DEL SAM OTR IND OTC"'
'server' => 'api.elsevier.com 9999'
'vary' => 'Origin'
'x-cnection' => 'close'
'x-els-apikey' => 'e688c9db4db0386581dbe4c4dda46164'
'x-els-reqid' => '0000015b190d89fe-a0d0'
'x-els-status' => 'AUTHENTICATION_ERROR(Requestor configuration settings insufficient for access to this resource.)'
'x-els-transid' => 'cbf787b4-d171-4e35-8237-8cab3c931205'
'x-re-ref' => '1 1490774904423414'
'_msg' => 'Forbidden'
'_protocol' => 'HTTP/1.1'
'_rc' => 403
'_request' => HTTP::Request=HASH(0x9fc3000)
'_content' => ''
'_headers' => HTTP::Headers=HASH(0x9ae73e0)
'user-agent' => 'libwww-perl/5.831'
'_method' => 'GET'
'_uri' => URI::http=SCALAR(0x9e25188)
-> 'http://api.elsevier.com/content/ev/results?apiKey=e688c9db4db0386581dbe4c4dda46164&query=stress&database=c&updateNumber=1&pageSize=1'
'_uri_canonical' => URI::http=SCALAR(0x9e25188)
-> REUSED_ADDRESS
one of the notable lines is
x-els-status' => 'AUTHENTICATION_ERROR(Requestor configuration settings insufficient for access to this resource.)'
I don't know how to get a proper response text. I tried searching their websites for examples, but I can't seem to get it. as well I'm not sure if the key is only for scopus but not engineering village which I'm trying to use.
There website is here. https://dev.elsevier.com/index.html?utm_expid=89327795-0.AtRZzToKQ2u1mZEyQ3n7OQ.0&utm_referrer=https%3A%2F%2Fdev.elsevier.com%2Ftecdoc_ev_retrieval_request.html
any help would be appreciated
To get the text out of your response, you need to call the $response->decoded_content method. That will give you the JSON string that you can see in _content in your debug output. I've indented it to make it easier to read.
{
"service-error" : {
"status" : {
"statusCode" : "AUTHENTICATION_ERROR",
"statusText" : "Requestor configuration settings insufficient for access to this resource."
}
}
}
You can use the JSON module to decode this into a Perl data structure.
use JSON 'from_json';
my $res = $ua->request($req);
my $json = from_json( $res->decoded_content );
The error message you get back clearly states that you are not authenticated properly. I've looked at this guide from the documentation you mentioned. It seems that the apiKey URL param works, if you have the right type of account. You should check with whoever made that account for you, or if that was you and you're not sure, the account manager at that service that is working with you. They'll tell you if you are using the right API key, and if this method of authentication works for you.
Since this API also offers to use a custom header X-ELS-APIKey: [apikey] for the authentication I would suggest using that. Your API key is a secret, and you shouldn't share it with anyone. It's like a password. If you put it into the URL, it might show up in log files. But as a header, it does usually not.
This is how you add a custom header to an HTTP request. Make sure you don't have the apiKey URL param any more if you do this.
my $req = HTTP::Request->new( GET => $url ); # no apiKey=123 here!
$req->header( 'X-ELS-APIKey' => 123 );
Now as a last step, you should check the HTTP response code of the response. A 200 (or most other codes that start with 2) means the request was successful. The 403 that you are getting back means unauthorized, which also hints at that you are not authenticated correctly.
Since it seems that this API returns JSON in both success and failure cases, you might need to decode it for both. If you care to examine the failure response, that makes sense. If not, you can skip that part. To do this, use $res->is_success, which is also used in the synopsis of the LWP::UserAgent documentation.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request;
use JSON 'from_json';
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new( GET => 'http://api.elsevier.com/content/ev/results?query=stress&database=c&updateNumber=1&pageSize=1' );
$req->header( 'X-ELS-APIKey' => 123 );
if ($req->is_success) {
my $json = from_json( $res->decoded_content );
# ... do stuff with the response
} else {
# something went wrong
}

How do i simulate this particular post request in mechanize

The post request is as follows.
$bot->add_header(
'Host'=>'www.amazon.com',
'User-Agent'=>'application/json, text/javascript, */*',
'Accept'=>'application/json, text/javascript, */*',
'Accept Language'=>'en-us,en;q=0.5',
'Accept Encoding'=>'gzip, deflate',
'DNT'=>'1',
'Connection'=>'keep-alive',
'Content type'=>'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested with'=>'XMLHttpRequest',
'Referer'=>'https://www.amazon.com/gp/digital/fiona/manage?ie=UTF8&ref_=gno_yam_myk',
'Content length'=>'44',
'Cookie'=>'how do i put the cookie value');
Post parameters in my request :
sid-how do i get the session id.
new email-mailhost#mail.com
My code to logon:
use WWW::Mechanize;
use HTTP::Cookies;
use HTML::Form;
use WWW::Mechanize::Link;
my $bot = WWW::Mechanize->new();
$bot->agent_alias( 'Linux Mozilla' );
# Create a cookie jar for the login credentials
$bot->cookie_jar( HTTP::Cookies->new( file => "cookies.txt",
autosave => 1,
ignore_discard => 1, ) );
# Connect to the login page
my $response = $bot->get( 'https://www.amazon.com/gp/css/homepage.html/' );
# Get the login form. You might need to change the number.
$bot->form_number(3);
# Enter the login credentials.
$bot->field( email => '' );
$bot->field( password => '' );
$response = $bot->click();
#print $response->decoded_content;
$bot->get( 'https://www.amazon.com/gp/yourstore/home?ie=UTF8&ref_=topnav_ys' );
print $bot->content();
$bot->post('https://www.amazon.com/gp/digital/fiona/du/add-whitelist.html/ref=kinw_myk_wl_add', [sid => 'id', email=> 'v2#d.com']);
Data captured:
Host=www.amazon.com
User-Agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
Accept=application/json, text/javascript, */*
Accept-Language=en-us,en;q=0.5
Accept-Encoding=gzip, deflate
DNT=1
Connection=keep-alive
Content-Type=application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With=XMLHttpRequest
Referer=https://www.amazon.com/gp/digital/fiona/manage?ie=UTF8&ref_=gno_yam_myk
Content-Length=39
Cookie=session-id-time=2082787201l; session-id
Pragma=no-cache
Cache-Control=no-cache
POSTDATA=sid=id&email=v%40d.com
Error Message-
Error POSTing https://www.amazon.com/gp/digital/fiona/du/add-whitelist.html/ref=
kinw_myk_wl_add: InternalServerError at logon.pl line 81
See post in WWW::Mechanize.
$bot->post($url, [sid => 'id', email => 'v#d.com']);

Cookies in perl lwp

I once wrote a simple 'crawler' to download http pages for me in JAVA.
Now I'm trying to rewrite to same thing to Perl, using LWP module.
This is my Java code (which works fine):
String referer = "http://example.com";
String url = "http://example.com/something/cgi-bin/something.cgi";
String params= "a=0&b=1";
HttpState initialState = new HttpState();
HttpClient httpclient = new HttpClient();
httpclient.setState(initialState);
httpclient.getParams().setCookiePolicy(CookiePolicy.NETSCAPE);
PostMethod postMethod = new PostMethod(url);
postMethod.addRequestHeader("Referer", referer);
postMethod.addRequestHeader("User-Agent", " Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13");
postMethod.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8");
postMethod.addRequestHeader("Content-Type", "application/x-www-form-urlencoded");
String length = String.valueOf(params.length());
postMethod.addRequestHeader("Content-Length", length);
postMethod.setRequestBody(params);
httpclient.executeMethod(postMethod);
And this is the Perl version:
my $referer = "http://example.com/something/cgi-bin/something.cgi?module=A";
my $url = "http://example.com/something/cgi-bin/something.cgi";
my #headers = (
'User-Agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer' => $referer,
'Content-Type' => 'application/x-www-form-urlencoded',
);
my #params = (
'a' => '0',
'b' => '1',
);
my $browser = LWP::UserAgent->new( );
$browser->cookie_jar({});
$response = $browser->post($url, #params, #headers);
print $response->content;
The post request executes correctly, but I get another (main) webpage. As if cookies were not working properly...
Any guesses what is wrong?
Why I'm getting different result from JAVA and perl programs?
You can also use WWW::Mechanize, which is a wrapper around LWP::UserAgent. It gives you the cookie jar automatically.
You want to be creating hashes, not arrays - e.g. instead of:
my #params = (
'a' => '0',
'b' => '1',
);
You should use:
my %params = (
a => 0,
b => 1,
);
When passing the params to the LWP::UserAgent post method, you need to pass a reference to the hash, e.g.:
$response = $browser->post($url, \%params, %headers);
You could also look at the request you're sending to the server with:
print $response->request->as_string;
You can also use a handler to automatically dump requests and responses for debugging purposes:
$ua->add_handler("request_send", sub { shift->dump; return });
$ua->add_handler("response_done", sub { shift->dump; return });
I believe it has to do with $response = $browser->post($url, #params, #headers);
From the doc of LWP::UserAgent
$ua->post( $url, \%form )
$ua->post( $url, \#form )
$ua->post( $url, \%form, $field_name => $value, ... )
$ua->post( $url, $field_name => $value,... Content => \%form )
$ua->post( $url, $field_name => $value,... Content => \#form )
$ua->post( $url, $field_name => $value,... Content => $content )
Since your params and headers are as hashes, I would try this:
my $referer = "http://example.com/something/cgi-bin/something.cgi?module=A";
my $url = "http://example.com/something/cgi-bin/something.cgi";
my %headers = (
'User-Agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer' => $referer,
'Content-Type' => 'application/x-www-form-urlencoded',
);
my %params = (
'a' => '0',
'b' => '1',
);
my $browser = LWP::UserAgent->new( );
$browser->cookie_jar({});
$response = $browser->post($url, \%params, %headers);