Gettting Error 500 when trying Get from website using lwp::useragent - perl

Someone help me with this using lwp::Useragent please
my $mech = WWW::Mechanize->new(autocheck => 0);
$mech->get($url);
my $content=$mech->content;
but getting Error 500 when trying to get https://camelcamelcamel.com/

It seems that the site blocks requests from "bad" (undesired) user agents. You may make WWW::Mechanize (LWP::UserAgent) present itself as another user agent using agent parameter for new or by calling agent method. Full IE8 identification string fixed the problem.
I have tested it using Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1.
[As listed by "User Agent Switcher" plugin for Firefox]
Short list of user agents (xml file)
Long list of user agents
WARNING
The site(s) may use or choose to use other means to block unwanted requests.

Related

Simple API request not working - 403 error

I am trying to run a simple API request from a perl script. But it seems not working. Same request if copied to web browser works without any problem.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $query = 'http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9944384761';
my $result = get($query);
print $result."\n";
when I use getprint($query) - it gives 403 error.
If you take a look at the body of the response (i.e. not only at the status code 403) you will find:
The owner of this website (checkdnd.com) has banned your access based on your browser's signature (2f988642c0f02798-ua22).
This means that it is blocking the client because it probably looks too much like a non-browser. For this site a simple fix is to include some User-Agent header. The following works for me:
my $ua = LWP::UserAgent->new;
$ua->default_header('User-Agent' => 'Mozilla/5.0');
my $resp = $ua->get('http://checkdnd.com/api/check_dnd_no_api.php?mobiles=9944384761');
my $result = $resp->decoded_content;
The site in question seems to be served by Cloudflare which has some thing they call "Browser Integrity Check". From the support page for this feature:
... looks for common HTTP headers abused most commonly by spammers and denies access to your page. It will also challenge visitors that do not have a user agent or a non standard user agent (also commonly used by abuse bots, crawlers or visitors).

Mangled URL Parameters in IE9

I'm seeing mangled URL parameters coming from IE9 desktop clients. The links are sent via email, and all of the mangled URLs come from the plain-text version of the email.
I'm almost sure that it has nothing to do with my stack (django, nginx, mandrill) The values for the parameters have characters exactly transposed. The original character is the mangled one minus 13 places (eg. rznvy_cynva = email_plain, ubgryfpbz = hotelscom).
Here is one example of a mangled request that came through:
GET /book/48465?sid=rznvy_cynva&order=q09362qs55-741722-442521-98n2-n88s4nnr87192n&checkOut=07-17-15&affiliate=ubgryfpbz&checkIn=07-16-15 HTTP/1.1" 302 5 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
All of the requests with mangled URLs have the same user-agent as the example.
The IP addresses associated with the mangled URLs aren't restricted to any location.
Looking up the user-agent, this seems to be restricted to desktop Windows 7, IE9 users.
It is anti-malware software on your recipients' computers. It gets the links and scans your pages for any possible vulnerabilities. It uses rot13 obfuscation to ensure that it doesn't take any unwanted actions ("buy now", etc.).
https://security.stackexchange.com/questions/48684/help-investigating-potential-website-attack-url-rewriting-and-rot-13-obfuscatio
The solution is to track down what anti-malware software / company is performing the scans, and get your site whitelisted if possible.
This is going into the realm of speculation, but I'm also guessing you cannot get any answers which don't, so here goes ...
The rot13 encryption does not look like an accident. I have two guesses to offer;
Somebody is sharing their email and obfuscating query parameters in links so as to break the "order now", "unsubscribe" etc links while maintaining the overall integrity of the email messages. Maybe this is a feature of a spam-reporting tool or similar?
Alternatively, the queries are made from within a test network where users are not supposed to click on links, but the tools in there need pretty much unrestricted Internet access; so the admin set up an HTTP proxy which rewrites the query URLs to dismantle most GET transactions with parameters. (POST requests I guess would still probably work?)
Your observation that the IP addresses seem to be nonlocalized somewhat contradicts these hypotheses, but it could just mean that you are looking at TOR endpoints or similar.

Perl Mechanize - How to disable Kerberose?

I have a situation where i need to check for certain conditions of an Internal web application.
First i need to check if the application is loading or not. -- For this i have used Perl Mechanize module and using get method to load
the URl. The problem which i am facing was it was showing 401
unauthorized and if i send the username and password as parameters to
function "credentials" it works fine.
I Just want to check if the webpage is loading or not without entering the credentials? Printing a message if it loads looks fine.
You can do a direct request with LWP and check the return code. If it is 401 you know that the server was responding. If this also means that your application is working depends on who is responsible for checking the authorization.
use LWP::UserAgent;
my $resp = LWP::UserAgent->new->get('http://example.com');
if ($resp->code == 401) ...

Why is IE9 sending a user agent string of IE6?

I'm getting a bunch of errors on my application with the user agent string being:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Looking this up on useragentstring.com, this is supposed to be internet explorer 6 while the user claims he is using internet explorer 9.
I'm no expert in user agents, can someone tell me why IE9 would be disguising as IE6, or what else am I missing here? Is there a way to "really" detect the browser server-side? Can I do a redirect server-side (using Coldfusion) or in htaccess?
Thanks!
This is what I could find from an archive of almost all user agent strings.
Explanation: This string has bit of history. We originally published it as a EudoraWeb string - since it was self identified by a site user as being that. However:
We got some email about this string suggesting that it was not eudora since it had no Eudora in it. To be fair the supplier of the string also voiced some doubt since it was left to the user to identify the string. If anyone can shed some more light on this topic - please email us and we'll publish.
We got some more comment which says it looks so much like a normal Win 2K that we've moved it. The suggestion is that both .NET strings are added when the MS WindowsUpdate system is used. Explanation from Matt Hair - thanks.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
http://www.zytrax.com/tech/web/msie-history.html

HTTP error: 403 while parsing a website

So I'm trying to parse from this website http://dl.acm.org/dl.cfm . This website doesn't allow web scrapers, so hence I get an HTTP error: 403 forbidden.
I'm using python, so I tried mechanize to fill the form (to automate the filling of the form or a button click), but then again I got the same error.
I can't even open the html page using urllib2.urlopen() function, it gives the same error.
Can anyone help me with this problem?
If the website doesn't allow web scrapers/bots, you shouldn't be using bots on the site to begin with.
But to answer your question, I suspect the website is blocking urllib's default user-agent. You're probably going to have to spoof the user-agent to a known browser by crafting your own request.
headers = {"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"}
req = urllib2.Request("http://dl.acm.org/dl.cfm", headers=headers)
urllib2.urlopen(req)
EDIT: I tested this and it works. The site is actively blocking based on user-agents to stop badly made bots from ignoring robots.txt