How do I maintain cookies across many WWW::Mechanize runs? - perl

use WWW::Mechanize;
use strict;
my $agent = WWW::Mechanize->new(cookie_jar => {ignore_discard => 0});
$agent->add_header('User-Agent' => 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0');
$agent->get($url);
my $content = $agent->content;

The cookie_jar attribute expects a HTTP::Cookies object.
WWW::Mechanize->new(
cookie_jar => HTTP::Cookies->new(
file => 'lwp_cookies.dat',
autosave => 1,
)
)
Your mistake was to declare a plain hashref, this means a temporary in-memory cookie store that is destroyed after Mechanize ends.

Related

Perl: Some websites block non-browser requests. But how?

I'm writing a simple Perl script that fetches some pages from different sites. It's very non-intrusive. I don't hog a servers bandwidth. It retrieves a single page without loading any extra javascript, or images, or style sheets.
I use LWP::UserAgent to retrieve the pages. This works fine on most sites but there are some sites that return a "403 - Bad Request" error. The same pages load perfectly fine in my browser. I have inspected the request header from my webbrowser and copied that exactly when trying to retrieve the same page in Perl and every single time I get a 403 error. Here's a code snippet:
use strict;
use LWP::UserAgent;
use HTTP::Cookies;
my $URL = "https://www.betsson.com/en/casino/jackpots";
my $browserObj = LWP::UserAgent->new(
ssl_opts => { verify_hostname => 0 }
);
# $browserObj->cookie_jar( {} );
my $cookie_jar = HTTP::Cookies->new();
$browserObj->cookie_jar( $cookie_jar );
$browserObj->agent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0");
$browserObj->timeout(600);
push #{ $browserObj->requests_redirectable }, 'POST';
my #header = ( 'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'en-US,en;q=0.5',
'Connection' => 'keep-alive',
'DNT' => '1',
'Host' => 'www.bettson.com',
'Upgrade-Insecure-Requests' => '1'
);
my $response = $browserObj->get( $URL, #header );
if( $response->is_success ) {
print "Success!\n";
} else {
print "Unsuccessfull...\n";
}
How do these servers distinguish between a real browser and my script? At first I thought they had some JavaScript trickery going on, but then I realized in order for that to work, the page has to be loaded by a browser first. But I immediately get this 403 Error.
What can I do to debug this?
While 403 is a typical answer for bot detection, in this case the bot detection is not the cause of the problem. Instead a typo in your code is:
my $URL = "https://www.betsson.com/en/casino/jackpots";
...
'Host' => 'www.bettson.com',
In the URL the domain name is www.betsson.com and this should be reflected in the Host header. But your Host header is slightly different: www.bettson.com. Since the Host header has the wrong name the request is rejected with 403 forbidden.
And actually, it is not even needed to go through all this trouble since it looks like no bot detection is done at all. I.e. no need to set user-agent and fiddle with the headers but plain:
my $browserObj = LWP::UserAgent->new();
my $response = $browserObj->get($URL);

Invalid character found in method name. HTTP method names must be tokens, persists even with http request

I am trying to warm up my controller so that the service gets hot during each deployment.
In order to do this i have written a perl script as below:
#!perl
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Request;
my $ua = WWW::Mechanize->new();
my $r = HTTP::Request->new(
'GET' =>
'http://gaurav_setia.microsoft.com:8080/b2h/homepage?_encoding=UTF8&opf_redir=1&portalDebug=1',
[
'Connection' => 'Keep-Alive',
'Via' => 'HTTP/1.1 ShoppingSchedule',
'Accept' =>
'text/x-html-parts,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Charset' => 'UTF-8',
'Accept-Encoding' => 'identity',
'Accept-Language' => 'en-US',
'Host' => 'gaurav_setia.microsoft.com',
'User-Agent' =>
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36',
'Cookie' =>
'Original-X-Forwarded-For' => '10.45.103.166',
'X-MS-Internal-Ip-Class' => 'rfc1918',
'X-MS-Internal-Via' =>
'1.1 us-beta-opf-1a-1-67440dc2.us-east-1.ms.com (OPF)',
'X-MS-Urlspace' => 'NoPageType',
'X-MS-Portal-Customer-Id' => 'AMY4OD2PMM9T31',
'X-MS-Portal-Default-Merchant-Id' => 'BTLPDKIKX0DE41',
'X-MS-Portal-Device-Attr' => 'desktop',
'X-MS-Portal-Language' => 'en_US',
'X-MS-Portal-Marketplace-Id' => 'ATVPDKIKX0DER',
'X-MS-Portal-Page-Type' => 'AQGate',
'X-MS-Portal-Request-Attr' => 'internal, http, portal-debug',
'X-MS-Portal-Session-Id' => '1M0-493PO66-0596753',
'X-MS-Portal-Ubid' => '1P2-465OP632-8831161',
'X-MS-Portal-User-Attr' => 'business',
'X-MS-Rid' => 'G308MPK95BWTA69EY2MW',
'X-Forwarded-For' => '10.45.101.126',
'X-Forwarded-Host' => 'development.ms.com',
'X-Forwarded-Server' =>
'development.ms.com, b-hp-shpomnpng-na-2b-02af3555.us-west-2.amazon.com',
'X-Original-Args' => 'portalDebug=1',
'X-Original-Method' => 'GET',
'X-Original-Scheme' => 'http',
'X-Original-Uri' => '/',
],
);
my $res = $ua->request( $r, );
if ( $res->is_success() )
{
print $response->is_success();
}
print $response->status_line;
This script should run after each deployment.
But in the catalina.out logs i am getting the following error:
Dec 13, 2018 9:08:11 AM org.apache.coyote.http11.AbstractHttp11Processor process
INFO: Error parsing HTTP request header
Note: further occurrences of HTTP header parsing errors will be logged at DEBUG level.
java.lang.IllegalArgumentException: Invalid character found in method name. HTTP method names must be tokens
at org.apache.coyote.http11.AbstractNioInputBuffer.parseRequestLine(AbstractNioInputBuffer.java:235)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1055)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:684)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1539)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1495)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
I am unable to find the fix!
Many answers say that this is due to https/http issue, but i am making a http call here itself!
In amongst your pile of headers, you have this:
'User-Agent' =>
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36',
'Cookie' =>
'Original-X-Forwarded-For' => '10.45.103.166',
Notice that there's no value for the Cookie header. That means all of the headers after that will be wrong (the names and values will be muddled up).
Either remove the Cookie line completely or set its value to undef.
'Cookie' => undef,
(Removing it is probably best)

WWW::Mechanize returning forbidden url error

This is the URL
https://trade.4over.com/orders/ajax/product_run_size.php?id_product=599983
I am trying to store its data using mechanize. It is returning forbidden error and when i am hitting it in browser it is giving response.
I am using WWW::Mechanize module.
Here is the code that I am using
my $mech = new WWW::Mechanize;
$mech->add_header( 'User-agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; nl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13');
$mech -> cookie_jar(HTTP::Cookies->new());
$mech->get($url);
my $result = $mech->submit_form(
form_number => 2,
fields =>
{
username => 'username', # Name of the input field and value
password => 'password',
},
button => 'log_in' # Name of the submit button
);
my $content = encode 'utf8',$mech->decoded_content;
return $content;
Just got the solution. I was doing t wrong.
What I was doing is to submit form on this page while the form is at the home page.
Now i am submitting the form on home page and then using mech->get for this UR.
Its working. Thanks for all your responce.

How do i simulate this particular post request in mechanize

The post request is as follows.
$bot->add_header(
'Host'=>'www.amazon.com',
'User-Agent'=>'application/json, text/javascript, */*',
'Accept'=>'application/json, text/javascript, */*',
'Accept Language'=>'en-us,en;q=0.5',
'Accept Encoding'=>'gzip, deflate',
'DNT'=>'1',
'Connection'=>'keep-alive',
'Content type'=>'application/x-www-form-urlencoded; charset=UTF-8',
'X-Requested with'=>'XMLHttpRequest',
'Referer'=>'https://www.amazon.com/gp/digital/fiona/manage?ie=UTF8&ref_=gno_yam_myk',
'Content length'=>'44',
'Cookie'=>'how do i put the cookie value');
Post parameters in my request :
sid-how do i get the session id.
new email-mailhost#mail.com
My code to logon:
use WWW::Mechanize;
use HTTP::Cookies;
use HTML::Form;
use WWW::Mechanize::Link;
my $bot = WWW::Mechanize->new();
$bot->agent_alias( 'Linux Mozilla' );
# Create a cookie jar for the login credentials
$bot->cookie_jar( HTTP::Cookies->new( file => "cookies.txt",
autosave => 1,
ignore_discard => 1, ) );
# Connect to the login page
my $response = $bot->get( 'https://www.amazon.com/gp/css/homepage.html/' );
# Get the login form. You might need to change the number.
$bot->form_number(3);
# Enter the login credentials.
$bot->field( email => '' );
$bot->field( password => '' );
$response = $bot->click();
#print $response->decoded_content;
$bot->get( 'https://www.amazon.com/gp/yourstore/home?ie=UTF8&ref_=topnav_ys' );
print $bot->content();
$bot->post('https://www.amazon.com/gp/digital/fiona/du/add-whitelist.html/ref=kinw_myk_wl_add', [sid => 'id', email=> 'v2#d.com']);
Data captured:
Host=www.amazon.com
User-Agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
Accept=application/json, text/javascript, */*
Accept-Language=en-us,en;q=0.5
Accept-Encoding=gzip, deflate
DNT=1
Connection=keep-alive
Content-Type=application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With=XMLHttpRequest
Referer=https://www.amazon.com/gp/digital/fiona/manage?ie=UTF8&ref_=gno_yam_myk
Content-Length=39
Cookie=session-id-time=2082787201l; session-id
Pragma=no-cache
Cache-Control=no-cache
POSTDATA=sid=id&email=v%40d.com
Error Message-
Error POSTing https://www.amazon.com/gp/digital/fiona/du/add-whitelist.html/ref=
kinw_myk_wl_add: InternalServerError at logon.pl line 81
See post in WWW::Mechanize.
$bot->post($url, [sid => 'id', email => 'v#d.com']);

Cookies in perl lwp

I once wrote a simple 'crawler' to download http pages for me in JAVA.
Now I'm trying to rewrite to same thing to Perl, using LWP module.
This is my Java code (which works fine):
String referer = "http://example.com";
String url = "http://example.com/something/cgi-bin/something.cgi";
String params= "a=0&b=1";
HttpState initialState = new HttpState();
HttpClient httpclient = new HttpClient();
httpclient.setState(initialState);
httpclient.getParams().setCookiePolicy(CookiePolicy.NETSCAPE);
PostMethod postMethod = new PostMethod(url);
postMethod.addRequestHeader("Referer", referer);
postMethod.addRequestHeader("User-Agent", " Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13");
postMethod.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8");
postMethod.addRequestHeader("Content-Type", "application/x-www-form-urlencoded");
String length = String.valueOf(params.length());
postMethod.addRequestHeader("Content-Length", length);
postMethod.setRequestBody(params);
httpclient.executeMethod(postMethod);
And this is the Perl version:
my $referer = "http://example.com/something/cgi-bin/something.cgi?module=A";
my $url = "http://example.com/something/cgi-bin/something.cgi";
my #headers = (
'User-Agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer' => $referer,
'Content-Type' => 'application/x-www-form-urlencoded',
);
my #params = (
'a' => '0',
'b' => '1',
);
my $browser = LWP::UserAgent->new( );
$browser->cookie_jar({});
$response = $browser->post($url, #params, #headers);
print $response->content;
The post request executes correctly, but I get another (main) webpage. As if cookies were not working properly...
Any guesses what is wrong?
Why I'm getting different result from JAVA and perl programs?
You can also use WWW::Mechanize, which is a wrapper around LWP::UserAgent. It gives you the cookie jar automatically.
You want to be creating hashes, not arrays - e.g. instead of:
my #params = (
'a' => '0',
'b' => '1',
);
You should use:
my %params = (
a => 0,
b => 1,
);
When passing the params to the LWP::UserAgent post method, you need to pass a reference to the hash, e.g.:
$response = $browser->post($url, \%params, %headers);
You could also look at the request you're sending to the server with:
print $response->request->as_string;
You can also use a handler to automatically dump requests and responses for debugging purposes:
$ua->add_handler("request_send", sub { shift->dump; return });
$ua->add_handler("response_done", sub { shift->dump; return });
I believe it has to do with $response = $browser->post($url, #params, #headers);
From the doc of LWP::UserAgent
$ua->post( $url, \%form )
$ua->post( $url, \#form )
$ua->post( $url, \%form, $field_name => $value, ... )
$ua->post( $url, $field_name => $value,... Content => \%form )
$ua->post( $url, $field_name => $value,... Content => \#form )
$ua->post( $url, $field_name => $value,... Content => $content )
Since your params and headers are as hashes, I would try this:
my $referer = "http://example.com/something/cgi-bin/something.cgi?module=A";
my $url = "http://example.com/something/cgi-bin/something.cgi";
my %headers = (
'User-Agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; pl; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer' => $referer,
'Content-Type' => 'application/x-www-form-urlencoded',
);
my %params = (
'a' => '0',
'b' => '1',
);
my $browser = LWP::UserAgent->new( );
$browser->cookie_jar({});
$response = $browser->post($url, \%params, %headers);