HTTP error: 403 while parsing a website - postgresql

So I'm trying to parse from this website http://dl.acm.org/dl.cfm . This website doesn't allow web scrapers, so hence I get an HTTP error: 403 forbidden.
I'm using python, so I tried mechanize to fill the form (to automate the filling of the form or a button click), but then again I got the same error.
I can't even open the html page using urllib2.urlopen() function, it gives the same error.
Can anyone help me with this problem?

If the website doesn't allow web scrapers/bots, you shouldn't be using bots on the site to begin with.
But to answer your question, I suspect the website is blocking urllib's default user-agent. You're probably going to have to spoof the user-agent to a known browser by crafting your own request.
headers = {"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"}
req = urllib2.Request("http://dl.acm.org/dl.cfm", headers=headers)
urllib2.urlopen(req)
EDIT: I tested this and it works. The site is actively blocking based on user-agents to stop badly made bots from ignoring robots.txt

Related

Replicatiing post request from website form using postman returns 500 internal server error

I'm trying to replicate a post request done normally by a website form via postman but the server returns 500 error.
the form website URL that I'm dealing with is here.
what I have done so far is investigate the network request using chrome or safari dev tools, copy the request as cURL, import the cURL in postman and do the request.
what can be the possible reasons for the failure and what are the alternative ways to achieve the same result?
Postman Headers:
Most probably you must have used invalid request body. The browser shows parsed json body and you might have copied incomple request body.
To get full body click view source and copy the full content.

Microsoft OAuth redirects with 302 instead of 200, which breaks deep link logic on mobile device

What I am using OAuth to authenticate with Microsoft:
https://login.microsoftonline.com/common/oauth2/v2.0/authorize...&redirect_uri=MYURL
(I also use similar approach with google: https://accounts.google.com/o/oauth2/v2/auth...redirect_uri=MYURL)
MYURL is https://admin.myrealdomain.com/code
(MYURL is an empty 200 Ok page on my server)
However, Microsoft Graph returns with 302 redirect from https://login.live.com/oauth20_authorize.srf...
and this causes issues with deeplinks handling (the page just is not intercepted by the app).
I don't have any such issues with Google though (200 status code).
And it seems like it recently worked just fine with Microsoft as well. I am just not sure if this is something I miss or MS has some recent changes applied to that logic.
Does anyone has any idea how I can solve it? Thanks!
It seems that you are executing the OAUTH code flow behind the scenes. It doesn't work this way.
You should pop up a browser dialog to request the authorization code. See reference here.
The steps:
Pop up a browser dialog which the url address is
https://login.microsoftonline.com/common/oauth2/v2.0/authorize?...
After User signs in, it redirects to the redirect url, where the
authorization code has been returned.
POST to
https://login.microsoftonline.com/common/oauth2/v2.0/token?....
you can get the access token to call Microsoft Graph API.

isomorphic-unfetch 302 response working slow in IE11

I am using next.js and I need to send an AJAX request (I am using isomorphic-unfetch). Some cases the response is with 302 HTTP status code and I need to redirect to the URL that comes in response. For redirecting, I have tried with Router.push(url) and window.location.href=url due is a client redirection.
The problem is that in IE11 appears the error (other browsers works well)
The redirection is done but it takes about 10 seconds. I was googling and found this SCRIPT7002: XMLHttpRequest: Network Error 0x2ef3, Could not complete the operation due to error 00002ef3 that describe a bit my problem but none of the proposed solutions fixed my problem.
Other posts suggest changing the status response but I have no control to the requested endpoint so I can't change anything.
Any ideas on how to solve this?

apache security

I need to use a facebook application but my web page return response 206 instead 200,
so that the facebook application return http code 500.
I tested with http://developers.facebook.com/tools/debug/og/object?q=http://adserver.leadhouse.net/test/test/index.php and return 206 instead joomla.it return 200
when they are same curl -I response datae
I tested with this perl script: http://pastebin.com/NCDv9eTh
and my page is vulnerable instead joomla.it is good.
I think that my answer is very close between
Facebook debugger : Response 206
and Apache Webserver security and optimization tips
but I don't understand how change my apache configuration.
the solution is into this page:
www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.2
with similar code:
SetEnvIf Range (,.*?){5,} bad-range=1
RequestHeader unset Range env=bad-range
or
httpd.apache.org/docs/2.2/mod/core.html#limitrequestfieldsize
how can I make it less vulnerable to my web pages?
I have no idea what kind of “vulnerability” you are talking about here.
Facebook debugger showing a response status code 206 is normal – because the debugger tries to only request the first x (K)Bytes from your URL. If your server accepts such range requests and answers them correctly, then the response code will be 206.
There is no vulnerability in that.
If this causes you any other problems with your site – then please describe them in a manner that makes them comprehensible.
Yes, everything is started with debugging facebook: dialog send return 500 http code with my page return 206 http code.
And my curiosity is focused on DoS vulnerability of http code 206 when I tested perl script http://pastebin.com/NCDv9eTh
I report some significant phrase about apache documentation:
This vulnerability concerns a 'Denial of Service' attack. This means
that a remote attacker, under the right circumstances, is able to slow
your service or server down to a crawl or exhausting memory available
to serve requests, leaving it unable to serve legitimate clients in a
timely manner.
There are no indications that this leads to a remote exploit; where a
third party can compromise your security and gain foothold of the
server itself. The result of this vulnerability is purely one of
denying service by grinding your server down to a halt and refusing
additional connections to the server.
so that LimitRequestFieldSize workaround was insufficient,
you could modify Range parameters consulting Mitigation paragraph
about apache wiki documentation: http://wiki.apache.org/httpd/CVE-2011-3192
You obtain switch between return http code: from 206 to 200.
You best apache configuration, but you're still exposed to DoS vulnerability.
I added mod_headers with this line:
RequestHeader unset Range
and now my page return http code 200.
And to limit exhausting memory available to serve requests,
I limit ip connections adding mod_limitipconn with this code:
MaxConnPerIP 10

why do i get this error "Unknown host http:80"?

i'm developing an application for blackbery, i'm displaying a webpage using Eclipse and net.rim.device.api.browser.field.* api when i click a submit buttom in a form i get this error "Unknown host http:80", can anyone helpme?
Don't know anything about Blackberries, but it looks like you're entering a URL where your program is only expecting a host name.
It sounds like form on the web page is not properly set up, causing the post action to post to an invalid URL. It would help if you included the app code and the form HTML.
In this 2005 forum thread people complain about getting that kind of error on their Blackberries.
I'm on the server side and I can see some Proxy servers trying to access my server with either HTTP/1.0 and no HTTP_HOST (which my app requires) or using the wrong HTTP_HOST.
For example, I am getting requests for widgets.twimg.com , www.google-analytics.com , servedby.jumpdisplay.com . My server doesn't host those domains so the response is obviously not any of the sites on the server, and instead I'm giving back an error.
So, it might be that your Blackberry is not providing the right HTTP_HOST to the server (or none at all) and the server doesn't know what to do with it.
To me, that's Blackberry (or whatever proxy that might exist between you and the server) 's fault.