Perl: Need an LWP & HTTP::Request POST code that actually works - perl

I have been scratching my head trying to get LWP and HTTP::Request to actually pass a POST parameter to a web server. The web server can see the fact that the request was a POST transaction, but it is not picking up the passed parameters. I have been searching all day on this and have tried different things and I have yet to find something that works. (The web server is working, I am able to manually send post transactions and when running the whole script, I am getting '200' status but I am not seeing any posted elements. Any help would be appreciated. Tnx.
my $ua2 = LWP::UserAgent->new;
$ua2->agent("Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)");
my $req2 = HTTP::Request->new(POST => "$url", [ frm-advSearch => 'frmadvSearch' ]);
$req2->content_type('text/html');
my $res2 = $ua2->request($req2);
$http_stat = substr($res2->status_line,0,3);

my $res = $ua->post($url,
Content => [
'frm-advSearch' => 'frmadvSearch',
],
);
which is short for
use HTTP::Request::Common qw( POST );
my $req = POST($url,
Content => [
'frm-advSearch' => 'frmadvSearch',
],
);
my $res = $ua->request($req);

Here's a Mojo::UserAgent example, which I find easier to debug:
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
$ua->transactor->name( 'Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)' );
my $url = 'http://www.example.com/form/';
my $tx = $ua->post( $url, form => { 'frm-advSearch' => 'frmadvSearch' } );
say $tx->req->to_string;
The transaction in $tx knows about the request so I can look at that:
POST /form/ HTTP/1.1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (compatible; MSIE 6.0; Windows 98)
Accept-Encoding: gzip
Host: www.example.com
Content-Length: 26
frm-advSearch=frmadvSearch

Related

Perl: Some websites block non-browser requests. But how?

I'm writing a simple Perl script that fetches some pages from different sites. It's very non-intrusive. I don't hog a servers bandwidth. It retrieves a single page without loading any extra javascript, or images, or style sheets.
I use LWP::UserAgent to retrieve the pages. This works fine on most sites but there are some sites that return a "403 - Bad Request" error. The same pages load perfectly fine in my browser. I have inspected the request header from my webbrowser and copied that exactly when trying to retrieve the same page in Perl and every single time I get a 403 error. Here's a code snippet:
use strict;
use LWP::UserAgent;
use HTTP::Cookies;
my $URL = "https://www.betsson.com/en/casino/jackpots";
my $browserObj = LWP::UserAgent->new(
ssl_opts => { verify_hostname => 0 }
);
# $browserObj->cookie_jar( {} );
my $cookie_jar = HTTP::Cookies->new();
$browserObj->cookie_jar( $cookie_jar );
$browserObj->agent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0");
$browserObj->timeout(600);
push #{ $browserObj->requests_redirectable }, 'POST';
my #header = ( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'en-US,en;q=0.5',
'Connection' => 'keep-alive',
'DNT' => '1',
'Host' => 'www.bettson.com',
'Upgrade-Insecure-Requests' => '1'
);
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Unsuccessfull...\n";
}
How do these servers distinguish between a real browser and my script? At first I thought they had some JavaScript trickery going on, but then I realized in order for that to work, the page has to be loaded by a browser first. But I immediately get this 403 Error.
What can I do to debug this?
While 403 is a typical answer for bot detection, in this case the bot detection is not the cause of the problem. Instead a typo in your code is:
my $URL = "https://www.betsson.com/en/casino/jackpots";
...
'Host' => 'www.bettson.com',
In the URL the domain name is www.betsson.com and this should be reflected in the Host header. But your Host header is slightly different: www.bettson.com. Since the Host header has the wrong name the request is rejected with 403 forbidden.
And actually, it is not even needed to go through all this trouble since it looks like no bot detection is done at all. I.e. no need to set user-agent and fiddle with the headers but plain:
my $browserObj = LWP::UserAgent->new();
my $response = $browserObj->get($URL);

Why does LWP::UserAgent succeed and Mojo::UserAgent fail?

If I make a request like this:
my $mojo_ua = Mojo::UserAgent->new->max_redirects(5);
$mojo_ua->inactivity_timeout(60)->connect_timeout(60)->request_timeout(60);;
$mojo_ua->transactor->name('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36');
my $headers = {
'Accept' => 'application/json',
'Accept-Language' => 'en-US,en;q=0.5',
'Connection' => 'keep-alive',
'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'x-csrf-token' => 'Fetch',
'Accept-Encoding' => 'gzip, deflate, br',
'DataServiceVersion' => '2.0',
'MaxDataServiceVersion' => '2.0',
'Referer' => 'https://blah.blas.com/someThing.someThing'
};
my $url = Mojo::URL->new('https://blah.blah.com/irj/go/sap/FOO_BAR_BAZ/');
my $tx = $mojo_ua->get($url, $headers);
$tx = $mojo_ua->start($tx);
my $res = $tx->result;
the request times out, but if I take the exact same request, built in the same way and do this:
my $lwp_ua = LWP::UserAgent->new;
my $req = HTTP::Request->parse( $tx->req->to_string );
$req->uri("$url");
my $res = $lwp_ua->request($req);
it succeeds.
It happens in a few cases that Mojo::UserAgent fails, and LWP::UserAgent succeeds with exactly the same transaction, and I'm starting to get curious.
Any idea as to why?
Your call to
$mojo_ua->get($url, $headers)
has already sent the HTTP request and received the response from the server, errored, or timed out. You don't need to call
$mojo_ua->start($tx)
as well, and that statement should be removed
If you really want to first build the transaction and then start it, you need
my $tx = $mojo_ua->build_tx(GET => $url, $headers);
$tx = $mojo_ua->start($tx);
but I don't see any reason why you should need to do it this way

Getting "500 Internal Server Error" retrieving page with LWP::UserAgent

I'm trying to retrieve a page using LWP::UserAgent but I keep getting a "500 Internal Server Error" as a response. Retrieving the exact same page in Firefox (using a fresh "Private Window" - so without any cookies set yet) succeeds without a problem.
I've duplicated the headers exactly as sent by Firefox, but that still does not make a difference. Here's my full code:
use strict;
use LWP::UserAgent;
my $browserObj = LWP::UserAgent->new();
$browserObj->cookie_jar( {} );
$browserObj->timeout(600);
my #header = (
'Host' => 'www.somedomain.com',
'User-Agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'DNT' => '1',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
);
my $URL = "https://www.somedomain.com";
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Error: " . $response->status_line . ".\n" );
}
The real web address is something other than "www.somedomain.com". In fact, it's a URL to an online casino, but I don't want my question be regarded as spam.
But anyone any idea what could be wrong?
On our corporate network which has a proxy (and an out of date perl version - there may be better options in newer versions) we tend to add the following for one-offs:
BEGIN {
$ENV{HTTPS_DEBUG} = 1; # optional but can help if you get a response
$ENV{HTTPS_PROXY} = 'https://proxy.server.here.net:8080';
}
If we don't do this the script simply fails to connect with no other information.
You may also want to add something like this if you want to inspect the messages:
$browserObj->add_handler("request_send", sub { shift->dump; return });
$browserObj->add_handler("response_done", sub { shift->dump; return });

How do i simulate this particular post request in mechanize

The post request is as follows.
$bot->add_header(
'Host'=>'www.amazon.com',
'User-Agent'=>'application/json, text/javascript, */*',
'Accept'=>'application/json, text/javascript, */*',
'Accept Language'=>'en-us,en;q=0.5',
'Accept Encoding'=>'gzip, deflate',
'DNT'=>'1',
'Connection'=>'keep-alive',
'Content type'=>'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested with'=>'XMLHttpRequest',
'Referer'=>'https://www.amazon.com/gp/digital/fiona/manage?ie=UTF8&ref_=gno_yam_myk',
'Content length'=>'44',
'Cookie'=>'how do i put the cookie value');
Post parameters in my request :
sid-how do i get the session id.
new email-mailhost#mail.com
My code to logon:
use WWW::Mechanize;
use HTTP::Cookies;
use HTML::Form;
use WWW::Mechanize::Link;
my $bot = WWW::Mechanize->new();
$bot->agent_alias( 'Linux Mozilla' );
# Create a cookie jar for the login credentials
$bot->cookie_jar( HTTP::Cookies->new( file => "cookies.txt",
autosave => 1,
ignore_discard => 1, ) );
# Connect to the login page
my $response = $bot->get( 'https://www.amazon.com/gp/css/homepage.html/' );
# Get the login form. You might need to change the number.
$bot->form_number(3);
# Enter the login credentials.
$bot->field( email => '' );
$bot->field( password => '' );
$response = $bot->click();
#print $response->decoded_content;
$bot->get( 'https://www.amazon.com/gp/yourstore/home?ie=UTF8&ref_=topnav_ys' );
print $bot->content();
$bot->post('https://www.amazon.com/gp/digital/fiona/du/add-whitelist.html/ref=kinw_myk_wl_add', [sid => 'id', email=> 'v2#d.com']);
Data captured:
Host=www.amazon.com
User-Agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
Accept=application/json, text/javascript, */*
Accept-Language=en-us,en;q=0.5
Accept-Encoding=gzip, deflate
DNT=1
Connection=keep-alive
Content-Type=application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With=XMLHttpRequest
Referer=https://www.amazon.com/gp/digital/fiona/manage?ie=UTF8&ref_=gno_yam_myk
Content-Length=39
Cookie=session-id-time=2082787201l; session-id
Pragma=no-cache
Cache-Control=no-cache
POSTDATA=sid=id&email=v%40d.com
Error Message-
Error POSTing https://www.amazon.com/gp/digital/fiona/du/add-whitelist.html/ref=
kinw_myk_wl_add: InternalServerError at logon.pl line 81
See post in WWW::Mechanize.
$bot->post($url, [sid => 'id', email => 'v#d.com']);

Parsing Cyrillic sites with Zend Framework encoding issue

$url = "http://www.kinopoisk.ru/picture/791547/";
require_once 'Zend/Http/Client.php';
require_once 'Zend/Dom/Query.php';
$client = new Zend_Http_Client($url, array(
'maxredirects' => 0,
'timeout' => 30,
'useragent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7'));
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$title = $dom->query('title');
$titleText = $title->current()->textContent;
echo $titleText;
// Locally returns "Постеры: ВАЛЛ·И (WALL·E)"
// Remotely returns "Ïîñòåðû: ÂÀËË·È (WALL·E)"
.htaccess setting: AddDefaultCharset utf-8
Response headers in remote site:
Content-Encoding gzip
Vary Accept-Encoding
Though there are no such responses in local
So I think that it's the server fault? how to resolve this?