Why do some servers respond with a 503 error unless my User-Agent starts with "Mozilla"? - webserver

I'm writing a client that grabs a page from a web server. On one particular server, it would work fine from my web browser, but my code was consistently getting the response:
HTTP/1.1 503 Service Unavailable
Content-Length:62
Connection: close
Cache-Control: no-cache,no-store
Pragma: no-cache
<html><body><b>Http/1.1 Service Unavailable</b></body> </html>
I eventually narrowed this down to the User-Agent header I was sending: if it contains Mozilla, everything is fine (I tried many variations of this). If not, I get 503. As soon as I realized it was User-Agent, I remembered having this same issue in the past (different project, different servers), but I've never figured out why.
In this particular case, the web server I'm connecting to is running IIS 7.5, but I'm not sure if there are any proxies/firewalls/etc in front of it (I suspect there probably is something because of this behaviour).
There's an interesting history to User-Agents which you can read about on this question: Why do all browsers' user agents start with "Mozilla/"?
It's clearly no issue for me to have Mozilla in my User-Agent, but my question is simply: what is the configuration or server that causes this to happen, and why would anyone want this behaviour?

Here is an interesting history of this phenomenon: User Agent String History
The main reason that this exists is because the internet, web, and browsers were not designed, but evolved, with high amounts of backwards compatibility, but then a lot of vender exclusive extensions. In particular, frames (which are widely considered a bad idea these days) were not well supported by Mosaic, but were by Netscape (which had Mozilla as it's user agent).
Server administrators then had a choice: did they use the new hip cool frames and only support Netscape, or did they use old boring pages that everyone can use? Their choice was a hack; if someone tells me they are Mozilla, send them frames; if not, send them not frames.
This ruined everything. IE had to call itself Mozilla compatible, everyone impersonated everyone else, it's all well detailed in the link at the top. But this problem more or less went away in the modern era, as everyone impersonated everyone, and everyone supported more and more of a common subset of features.
And then mobile browsers and smart phone browsers became wide spread. Suddenly, there wasn't just a few main browsers with basically the same features, and a few outlying browsers you could easily ignore. Now it was dozens of small browsers, with less power and less ability and a disjoint odd set of capabilities! And so, many servers took the easy road and simply did not send the proper data, or any data at all, to any browser they did not recognize.
Now rather than a poorly rendered or inoperable website, you had...no website on certain platforms, and a perfect one on others. This worked, but wasn't tolerable for many businesses; they wanted to work right on ALL platforms, because that's how the web was supposed to work.
Mobile versions, mobile first, responsive design, media queries, all these were designed to fill in those gaps. But for the most part, a lot of websites still just ignore less than modern browsers. And media queries were quickly subverted: no one wants to declare their browser is handheld, oh no. We're a real display browser, even if our screen is only 3 inches, yes sir!
In summary, some servers are configured to drop any browser which is not Mozilla compatible because they think it's better to serve no page than a poorly rendered one.
I've also seen some arguments that this improves security because then the server doesn't have to deal with rogue programs that aren't browsers (much like your own) connecting to them. As the user agent is easy to change, this holds no water for me; it's simply security through obscurity.

Many firewalls are configured to drop all requests which do not have "proper" user agent, as many DDoS attacks do not bother to send it - this is easy, reliable filter.

Related

Database info not showing when previewing site on mobile?

I have made a simple full stack application that uses a postgreSQL database. When previewing the site on desktop it works fine and is able to retrieve all the information with no problem so long as my backend server is on. I am trying to preview the site on my phone using my IP address followed by the port number and it comes up just fine but only the frontend is displaying on my phone. I am unable to see any information from my backend or database. Does anyone know why that is or how I can fix that to display on my phone (without hosting the site)?
1.Maybe it's just cashing issue.
check your mobile phone browser cash setting.
In general, browsers use caching technology for performance reasons. Caching refers to storing values that you previously requested locally and then reusing old values without using new values when a similar request comes in.
2.Maybe it's a front-end css problem.
If design-related elements such as css are not accurate, problems that cannot be seen on the screen may occur even if server data is imported normally.
3.Or maybe front-end can't get data from the server at all.
In this case, it is necessary to debug the server source, check whether it is sent normally on the screen, and check whether the response is received normally through the network terminal.
After checking the three above, even if you can't solve the problem,
At least you'll know exactly what the problem is.

Why are websites requiring referer headers (and failing silently)?

I've been noticing a very quirky trend lately and I'm baffled by it. In the past month or two, I've begun to notice sites breaking without a referer header.
As background: you'll of course remember the archaic days where referer headers were misused to do a whole bunch of things from feature detection to some misguided appearance of security. There are still some legacy sites that depend on it, but for the most part refer headers have been relegated to shitty device detection.
Imagine my surprise when not one, but three modern websites are suddenly breaking without a referer.
Codepen: pen previews and full page views just break (i.imgur.com/3abXqsC.png). But editor view works perfectly.
Twitter: basically every interactive function breaks. If you try to tweet, retweet, favourite, etc. you get a generic no-descriptive error (i.imgur.com/E6tIKFo.png). If you try to update a setting, it just flat out refuses (403) (i.imgur.com/51e2d0M.png).
Imgur: It just can't upload anything (i.imgur.com/xCWpkGX.png) and eventually gives up (i.imgur.com/iO2UlR6.png).
All three are modern websites. Codepen was already broken since I started using it so I'm not sure if it was always like that, but Twitter and Imgur used to work perfectly fine with no referer. In fact I had just noticed Imgur breaking.
Furthermore, all of them only generate non-descriptive error messages, if at all, which do not identify the problem at all. It took a lot of trial and error for me to figure it out the first two times, now I try referer headers as one of the first things. But wait! There's more! All it takes to un-bork them is to send a generic referer that's the root of the host (i.e. twitter.com, codepen.io, imgur.com). You don't even need to use actual URLs with directory paths!
One website, I can chalk it up to shitty code. But three, major, modern websites - especially when they used to work - is a huge head scratcher.
Has anybody else noticed this trend or know wtf is going on?
While Referer headers don't "add security", they can be used to trim out attempts from browsers (that play by refer rules) which invoke the request. It's not making the site "secure" from any HTTP attempt, but it is a fair filter for browsers (running on behalf of, possibly unsuspecting, users) acting-as proxies.
Here are some possibilities:
Might prevent hijacked (or phished) users, and/or other injection attacks on form POSTS (non-idempotent requests), which are not constrained to Same-Origin Policy.
Some requests can leak a little bit of information, event with Same-Origin Policy.
Limit 3rd-party use of embedded content such as iframes, videos/images, and other hotlinking.
That is, while it definitely should not be considered a last line of defence (eg. it should not replace proper authentication and CSRF tokens), it does help reduce some exposure of undesired access from browsers.

Recreate a site from a tcpdump?

It's a long story, but I am trying to save an internal website from the pointy hair bosses who see no value from it anymore and will be flicking the switch at some point in the future. I feel the information contained is important and future generations will want to use it. No, it's not some adult site, but since it's some big corp, I can't say any more.
The problem is, the site is a mess of ASP and Flash that only works under IE7 and is buggy under IE8 and 32bit only even. All the urls are session style and are gibberish. The flash objects itself pull extra information with GET request to ASP objects. It's really poorly designed for scraping. :)
So my idea is to do a tcpdump as I navigate the entire site. Then somehow dump the result of every GET into a sql database. Then with a little messing with the host file, redirect every request to some cgi script that will look for a matching get request in the database and return the data. So the entire site will be located in an SQL database in URL/Data keypairs. Flat file may also work.
In theory, I think this is the only way to go about this. The only problem I see is if they do some client side ActiveX/Flash stuff that generates session URLs that will be different each time.
Anyway, I know Perl, and the idea seems simple with the right modules, so I think I can do most of the work in that, but I am open to any other ideas before I get started. Maybe this exist already?
Thanks for any input.
To capture I wouldn't use tcpdump, but either the crawler itself or a webproxy that can be tweaked to save everything, e.g. Fiddler, Squid, or mod_proxy.

SOP issue behind reverse proxy

I've spent the last 5 months developing a gwt app, and it's now become time for third party people to start using it. In preparation for this one of them has set up my app behind a reverse proxy, and this immediately resulted in problems with the browser's same origin policy. I guess there's a problem in the response headers, but I can't seem to rewrite them in any way to make the problem go away. I've tried this
response.setHeader("Server", request.getRemoteAddress());
in some sort of naive attempt to mimic the behaviour I want. Didn't work (to the surprise of no-one).
Anyone knowing anything about this will most likely snicker and shake their heads when reading this, and I do not blame them. I would snicker too, if it was me... I know nothing at all about this, and that naturally makes this problem awfully hard to solve. Any help at all will be greatly appreciated.
How can I get the header rewrite to work and get away from the SOP issues I'm dealing with?
Edit: The exact problem I'm getting is a pop-up saying:
"SmartClient can't directly contact
URL
'https://localhost/app/resource?action='doStuffs'"
due to browser same-origin policy.
Remove the host and port number (even
if localhost) to avoid this problem,
or use XJSONDataSource protocol (which
allows cross-site calls), or use the
server-side HttpProxy included with
SmartClient Server."
But I shouldn't need the smartclient HttpProxy, since I have a proxy on top of the server, should I? I've gotten no indications that this could be a serialisation problem, but maybe this message is hiding the real issue...
Solution
chris_l and saret both helped to find the solution, but since I can only mark one I marked the answer from chris_l. Readers are encouraged to bump them both up, they really came through for me here. The solution was quite simple, just remove any absolute paths to your server and use only relative ones, that did the trick for me. Thanks guys!
The SOP (for AJAX requests) applies, when the URL of the HTML page, and the URL of the AJAX requests differ in their "origin". The origin includes host, port and protocol.
So if the page is http://www.example.com/index.html, your AJAX request must also point to something under http://www.example.com. For the SOP, it doesn't matter, if there is a reverse proxy - just make sure, that the URL - as it appears to the browser (including port and protocol) - isn't different. The URL you use internally is irrelevant - but don't use that internal URL in your GWT app!
Note: The solution in the special case of SmartClient turned out to be using relative URLs (instead of absolute URLs to the same origin). Since relative URLs aren't an SOP requirement in browsers, I'd say that's a bug in SmartClient.
What issue are you having exactly?
Having previously had to write a reverseproxy for a GWT app I can't remember hitting any SOP issues, one thing you need to do though is make sure response headers and uri's are rewritten to the reverseproxies url - this includes ajax callback urls.
One issue I hit (which you might also experience) when running behind a reverseproxy was with the serialization policy of GWT server.
Fixing this required writing an implementation of RemoteServiceServlet. While this was in early/mid 2009, it seems the issue still exists.
Seems like others have hit this as well - see this for further details (the answer by Michele Renda in particular)

making LWP Useragent faster

I need to perform a large number of HTTP post requests, and ignore the response. I am currently doing this using LWP::UserAgent. It seems to run somewhat slow though I am not sure if it is waiting for a response or what, is there anyway to speed it up and possibly just ignore the responses?
bigian's answer is probably the best for this, but another way to speed things up is to use LWP::ConnCache to allow LWP to re-use existing connections rather than build a new connection for every request.
Enabling it is this simple if you're pounding on just one site --
my $conn_cache = LWP::ConnCache->new;
$conn_cache->total_capacity([1]) ;
$ua->conn_cache($conn_cache) ;
I've found this to double the speed of some operations on http site, and more than double it for https sites.
LWP::Parallel
http://metacpan.org/pod/LWP::Parallel
"Introduction
ParallelUserAgent is an extension to the existing libwww module. It allows you to take a list of URLs (it currently supports HTTP, FTP, and FILE URLs. HTTPS might work, too) and connect to all of them in parallel, then wait for the results to come in."
It's great, it's worked wonders for me...