Perl: Some websites block non-browser requests. But how? - perl

I'm writing a simple Perl script that fetches some pages from different sites. It's very non-intrusive. I don't hog a servers bandwidth. It retrieves a single page without loading any extra javascript, or images, or style sheets.
I use LWP::UserAgent to retrieve the pages. This works fine on most sites but there are some sites that return a "403 - Bad Request" error. The same pages load perfectly fine in my browser. I have inspected the request header from my webbrowser and copied that exactly when trying to retrieve the same page in Perl and every single time I get a 403 error. Here's a code snippet:
use strict;
use LWP::UserAgent;
use HTTP::Cookies;
my $URL = "https://www.betsson.com/en/casino/jackpots";
my $browserObj = LWP::UserAgent->new(
ssl_opts => { verify_hostname => 0 }
);
# $browserObj->cookie_jar( {} );
my $cookie_jar = HTTP::Cookies->new();
$browserObj->cookie_jar( $cookie_jar );
$browserObj->agent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0");
$browserObj->timeout(600);
push #{ $browserObj->requests_redirectable }, 'POST';
my #header = ( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'en-US,en;q=0.5',
'Connection' => 'keep-alive',
'DNT' => '1',
'Host' => 'www.bettson.com',
'Upgrade-Insecure-Requests' => '1'
);
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Unsuccessfull...\n";
}
How do these servers distinguish between a real browser and my script? At first I thought they had some JavaScript trickery going on, but then I realized in order for that to work, the page has to be loaded by a browser first. But I immediately get this 403 Error.
What can I do to debug this?

While 403 is a typical answer for bot detection, in this case the bot detection is not the cause of the problem. Instead a typo in your code is:
my $URL = "https://www.betsson.com/en/casino/jackpots";
...
'Host' => 'www.bettson.com',
In the URL the domain name is www.betsson.com and this should be reflected in the Host header. But your Host header is slightly different: www.bettson.com. Since the Host header has the wrong name the request is rejected with 403 forbidden.
And actually, it is not even needed to go through all this trouble since it looks like no bot detection is done at all. I.e. no need to set user-agent and fiddle with the headers but plain:
my $browserObj = LWP::UserAgent->new();
my $response = $browserObj->get($URL);

Related

LWP::UserAgent loses content data when redirecting via POST

I am posting JSON data to Jira, and the request is hitting CAS first. A series of redirects occur. However, after the initial request I noticed that the content is zeroed out on the first redirect. The end result is that my request reaches Jira with empty content and is unsucessful.
Short of preventing LWP::UserAgent from redirecting and following the links myself, I'm not sure what else to try. My understanding is that this is supposed to be handled by the module.
This is vaguely representative with redactions...
use LWP::UserAgent ();
use HTTP::Request ();
use HTTP::Headers;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new();
my $user_agent = LWP::UserAgent->new;
$user_agent->cookie_jar( $cookie_jar );
push #{ $user_agent->requests_redirectable }, 'POST';
$user_agent->ssl_opts( $ssl_cert_file_pem );
$user_agent->ssl_opts( $ssl_key_file_pem );
$user_agent->ssl_opts( $verify_hostname );
$user_agent->timeout( $timeout );
my $headers_obj = HTTP::Headers->new;
$headers_obj->header( 'Accept' => '*/*' );
$headers_obj->header( 'Accept-Encoding' => 'gzip, deflate, br' );
$headers_obj->header( 'Accept-Language' => 'en-US' );
$headers_obj->header( 'Connection' => 'Keep-Alive' );
$headers_obj->header( 'Host' => $host );
my $http_request_obj = HTTP::Request->new;
$http_request_obj->method( $method );
$http_request_obj->uri( $uri );
$http_request_obj->content_type( 'Content-Type' => 'application/json' );
$http_request_obj->content( $content );
$user_agent->default_headers( $headers_obj );
$response_obj = $user_agent->request( $http_request_obj );
When I dump the response, I can see that the initial request returns a 302 which is then followed successfully... it's just that the content does NOT go with each redirect. How can I get LWP::UserAgent to forward content on redirect?
This is appropriate behaviour for a 302 response.
What RFC 7231, the current HTTP spec, says the following about 302 responses:
Note: For historical reasons, a user agent MAY change the request
method from POST to GET for the subsequent request. If this
behavior is undesired, the 307 (Temporary Redirect) status code
can be used instead.
When LWP receives a 302 response to a POST made redirectable, it follows up with a GET request (which necessarily doesn't include the POST data of the original request).

Perl: Need an LWP & HTTP::Request POST code that actually works

I have been scratching my head trying to get LWP and HTTP::Request to actually pass a POST parameter to a web server. The web server can see the fact that the request was a POST transaction, but it is not picking up the passed parameters. I have been searching all day on this and have tried different things and I have yet to find something that works. (The web server is working, I am able to manually send post transactions and when running the whole script, I am getting '200' status but I am not seeing any posted elements. Any help would be appreciated. Tnx.
my $ua2 = LWP::UserAgent->new;
$ua2->agent("Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)");
my $req2 = HTTP::Request->new(POST => "$url", [ frm-advSearch => 'frmadvSearch' ]);
$req2->content_type('text/html');
my $res2 = $ua2->request($req2);
$http_stat = substr($res2->status_line,0,3);
my $res = $ua->post($url,
Content => [
'frm-advSearch' => 'frmadvSearch',
],
);
which is short for
use HTTP::Request::Common qw( POST );
my $req = POST($url,
Content => [
'frm-advSearch' => 'frmadvSearch',
],
);
my $res = $ua->request($req);
Here's a Mojo::UserAgent example, which I find easier to debug:
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->transactor->name( 'Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)' );
my $url = 'http://www.example.com/form/';
my $tx = $ua->post( $url, form => { 'frm-advSearch' => 'frmadvSearch' } );
say $tx->req->to_string;
The transaction in $tx knows about the request so I can look at that:
POST /form/ HTTP/1.1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)
Accept-Encoding: gzip
Host: www.example.com
Content-Length: 26
frm-advSearch=frmadvSearch

Getting "500 Internal Server Error" retrieving page with LWP::UserAgent

I'm trying to retrieve a page using LWP::UserAgent but I keep getting a "500 Internal Server Error" as a response. Retrieving the exact same page in Firefox (using a fresh "Private Window" - so without any cookies set yet) succeeds without a problem.
I've duplicated the headers exactly as sent by Firefox, but that still does not make a difference. Here's my full code:
use strict;
use LWP::UserAgent;
my $browserObj = LWP::UserAgent->new();
$browserObj->cookie_jar( {} );
$browserObj->timeout(600);
my #header = (
'Host' => 'www.somedomain.com',
'User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'DNT' => '1',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
);
my $URL = "https://www.somedomain.com";
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Error: " . $response->status_line . ".\n" );
}
The real web address is something other than "www.somedomain.com". In fact, it's a URL to an online casino, but I don't want my question be regarded as spam.
But anyone any idea what could be wrong?
On our corporate network which has a proxy (and an out of date perl version - there may be better options in newer versions) we tend to add the following for one-offs:
BEGIN {
$ENV{HTTPS_DEBUG} = 1; # optional but can help if you get a response
$ENV{HTTPS_PROXY} = 'https://proxy.server.here.net:8080';
}
If we don't do this the script simply fails to connect with no other information.
You may also want to add something like this if you want to inspect the messages:
$browserObj->add_handler("request_send", sub { shift->dump; return });
$browserObj->add_handler("response_done", sub { shift->dump; return });

I can't connect using an api

I'm quite new to API's so I don't know if this should be more straight forward.
I write the following perl script
use strict;
use LWP::UserAgent;
require HTTP::Request;
my $request = HTTP::Request->new(GET => 'http://api.elsevier.com/content/ev/results?apiKey=1234&query=stress&database=c&updateNumber=1&pageSize=1');
my $ua = LWP::UserAgent->new;
my $response = $ua->request($request);
then when I get my response and print it in the debugger I get the following
HTTP::Response=HASH(0x9aedff8)
'_content' => '{"service-error":{"status":{"statusCode":"AUTHENTICATION_ERROR","statusText":"Requestor configuration settings insufficient for access to this resource."}}}'
'_headers' => HTTP::Headers=HASH(0x9aedfe8)
'allow' => 'GET'
'client-date' => 'Wed, 29 Mar 2017 08:08:25 GMT'
'client-peer' => '198.185.19.118:80'
'client-response-num' => 1
'content-length' => 156
'content-type' => 'application/json;charset=UTF-8'
'date' => 'Wed, 29 Mar 2017 08:08:24 GMT'
'p3p' => 'CP="IDC DSP LAW ADM DEV TAI PSA PSD IVA IVD CON HIS TEL OUR DEL SAM OTR IND OTC"'
'server' => 'api.elsevier.com 9999'
'vary' => 'Origin'
'x-cnection' => 'close'
'x-els-apikey' => 'e688c9db4db0386581dbe4c4dda46164'
'x-els-reqid' => '0000015b190d89fe-a0d0'
'x-els-status' => 'AUTHENTICATION_ERROR(Requestor configuration settings insufficient for access to this resource.)'
'x-els-transid' => 'cbf787b4-d171-4e35-8237-8cab3c931205'
'x-re-ref' => '1 1490774904423414'
'_msg' => 'Forbidden'
'_protocol' => 'HTTP/1.1'
'_rc' => 403
'_request' => HTTP::Request=HASH(0x9fc3000)
'_content' => ''
'_headers' => HTTP::Headers=HASH(0x9ae73e0)
'user-agent' => 'libwww-perl/5.831'
'_method' => 'GET'
'_uri' => URI::http=SCALAR(0x9e25188)
-> 'http://api.elsevier.com/content/ev/results?apiKey=e688c9db4db0386581dbe4c4dda46164&query=stress&database=c&updateNumber=1&pageSize=1'
'_uri_canonical' => URI::http=SCALAR(0x9e25188)
-> REUSED_ADDRESS
one of the notable lines is
x-els-status' => 'AUTHENTICATION_ERROR(Requestor configuration settings insufficient for access to this resource.)'
I don't know how to get a proper response text. I tried searching their websites for examples, but I can't seem to get it. as well I'm not sure if the key is only for scopus but not engineering village which I'm trying to use.
There website is here. https://dev.elsevier.com/index.html?utm_expid=89327795-0.AtRZzToKQ2u1mZEyQ3n7OQ.0&utm_referrer=https%3A%2F%2Fdev.elsevier.com%2Ftecdoc_ev_retrieval_request.html
any help would be appreciated
To get the text out of your response, you need to call the $response->decoded_content method. That will give you the JSON string that you can see in _content in your debug output. I've indented it to make it easier to read.
{
"service-error" : {
"status" : {
"statusCode" : "AUTHENTICATION_ERROR",
"statusText" : "Requestor configuration settings insufficient for access to this resource."
}
}
}
You can use the JSON module to decode this into a Perl data structure.
use JSON 'from_json';
my $res = $ua->request($req);
my $json = from_json( $res->decoded_content );
The error message you get back clearly states that you are not authenticated properly. I've looked at this guide from the documentation you mentioned. It seems that the apiKey URL param works, if you have the right type of account. You should check with whoever made that account for you, or if that was you and you're not sure, the account manager at that service that is working with you. They'll tell you if you are using the right API key, and if this method of authentication works for you.
Since this API also offers to use a custom header X-ELS-APIKey: [apikey] for the authentication I would suggest using that. Your API key is a secret, and you shouldn't share it with anyone. It's like a password. If you put it into the URL, it might show up in log files. But as a header, it does usually not.
This is how you add a custom header to an HTTP request. Make sure you don't have the apiKey URL param any more if you do this.
my $req = HTTP::Request->new( GET => $url ); # no apiKey=123 here!
$req->header( 'X-ELS-APIKey' => 123 );
Now as a last step, you should check the HTTP response code of the response. A 200 (or most other codes that start with 2) means the request was successful. The 403 that you are getting back means unauthorized, which also hints at that you are not authenticated correctly.
Since it seems that this API returns JSON in both success and failure cases, you might need to decode it for both. If you care to examine the failure response, that makes sense. If not, you can skip that part. To do this, use $res->is_success, which is also used in the synopsis of the LWP::UserAgent documentation.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request;
use JSON 'from_json';
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new( GET => 'http://api.elsevier.com/content/ev/results?query=stress&database=c&updateNumber=1&pageSize=1' );
$req->header( 'X-ELS-APIKey' => 123 );
if ($req->is_success) {
my $json = from_json( $res->decoded_content );
# ... do stuff with the response
} else {
# something went wrong
}

WWW::Mechanize returning forbidden url error

This is the URL
https://trade.4over.com/orders/ajax/product_run_size.php?id_product=599983
I am trying to store its data using mechanize. It is returning forbidden error and when i am hitting it in browser it is giving response.
I am using WWW::Mechanize module.
Here is the code that I am using
my $mech = new WWW::Mechanize;
$mech->add_header( 'User-agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13');
$mech -> cookie_jar(HTTP::Cookies->new());
$mech->get($url);
my $result = $mech->submit_form(
form_number => 2,
fields =>
{
username => 'username', # Name of the input field and value
password => 'password',
},
button => 'log_in' # Name of the submit button
);
my $content = encode 'utf8',$mech->decoded_content;
return $content;
Just got the solution. I was doing t wrong.
What I was doing is to submit form on this page while the form is at the home page.
Now i am submitting the form on home page and then using mech->get for this UR.
Its working. Thanks for all your responce.