URLRewriteFile and "#" char in URL string - gwt

I'm using the Google means of making my GWT app searchable (https://developers.google.com/webmasters/ajax-crawling/docs/getting-started), which works fine. Unfortunately, it seems Bing does not follow the same pattern/rule.
I thought I'd add a URL filter, based on user-agent to map all URL's of the form
http://www.example.com/#!blah=something
to
http://www.example.com/?_escaped_fragment_=blah=something
only for BingBot so that my CrawlerServet returned the same as the GoogleBot requests. I have a URLRewrite rule like:
<rule>
<condition name="user-agent">Firefox/8.0</condition>
<from use-query-string="true">^(.*)#!(.*)$</from>
<to type="redirect">?_escaped_fragment_=$2</to>
</rule>
(I'm using a user-agent of Firefox to test)
This never matches. If I change the rule to ^(.)!(.)$ and try and match on
http://www.example.com/!blah=something
it will work, but using the same rule
http://www.example.com/#!blah=something
will not work, because it seems the URL string the filter is using is truncated at the "#".
Can anyone tell me if it's possible to make this work.

The browser doesn't send the hash to the server, as you've discovered. Watching a given request, you'll see that it only sends along the url before the # symbol.
GET / HTTP/1.1
Host: example.com
...
From the link you mentioned:
Hash fragments are never (by specification) sent to the server as part of an HTTP request. In other words, the crawler needs some way to let your server know that it wants the content for the URL www.example.com/ajax.html#!key=value (as opposed to simply www.example.com/ajax.html).
From the descriptions in the text, it is the server's job to translate from the 'ugly' url to a pretty one (with a hash), and to send back a snapshot of what that page might look like if loaded with a hash on the client. That page may have other links using hashes to load other documents - the crawler will automatically translate those back to ugly urls, and request more data from the server.
So in short, this is not an change you should need to make, the GoogleBot will make it automatically, provided you have opted into using hash fragments. As for other bots, apparently Bing now supports this idea as well, but that appears to be outside the scope of your question.

Related

google oauth callback appending parameters multiple times

We have successfully using Google OAuth for years now, but it suddenly stopped working a few days ago. In looking into this, it appears that the after the user clicks "Allow" to grant access to the requested scope, Google is redirecting to our callback page (as it always has) but now the code and scope parameters are being appended to the URL multiple times (example below). Given querystring length limits on our web server, this is now throwing a 404.15 error.
Since we have made no recent code changes and have not made any updates in the Google API Console, I don't believe we have done anything to cause the parameters to be appended multiple times to the callback URL. Is this an issue with Google? Or am I missing something that may have caused this issue?
Example callback URL:
http://example.com/oauth/oauthcallback?code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://googleapis.com/auth/gmail.readonly&code=4/XADj4OhPIwWZRA5TsZMgOkMIfmuBVdQidarK_MhSmkpxWubmprbySMBnY4huJaYATwzf8B798OcHLfD-LdBBtfQ&scope=https://www.googleapis.com/auth/gmail.readonly
I have resolved this. In case this helps someone else, sometime between 9/12/2018 and 9/14/2018, Google started returning an additional parameter ("scope") in their OAuth callback (in addition to the only other parameter - "code" - that was previously being returned in the callback). The scope value included "https://www.googleapis.com" which was causing an issue with an existing URL rewrite rule on our end to strip "www" from our URL. The very generic syntax in our rewrite rule that simply looked for "www." was causing a redirect loop until a 404.15 was thrown. By making the rewrite rule specific to our URL, the scope parameter is ignored by the rewrite rule and the redirect loop is avoided.
Posting because this may help others. #fzebra's answer applied in my case but ALSO my auth library forwards all query parameters that the OAuth provider sends to my redirect_uri onto the requests it makes to retrieve the access_token. Because of this and because I think Google has a parsing bug, the new scope parameter blows up the request. Google responds with a 400 Bad Request and inspecting the JSON response, you get a redirect_uri_mismatch. My guess is they see their own scope URL parameter as the redirect URI and invalidate the request.
To solve this, I needed to chop the scope query parameter off the outgoing request to Google, so I did it via a URL rewrite rule.
<!-- See https://stackoverflow.com/questions/52372359/google-oauth-callback-appending-parameters-multiple-times -->
<rule name="Google Login - Remove scope parameter" stopProcessing="true">
<match url="google/redirect/url(.*)?$" />
<conditions trackAllCaptures="true">
<add input="{QUERY_STRING}" pattern="(.*)(&?scope=.+&?)(.*)" />
</conditions>
<action type="Rewrite" url="google/redirect/url?{C:1}{C:3}" appendQueryString="false" />
</rule>
This cuts the scope parameter and value out from the incoming query string and joins the two parts back together without it. Note the & is because this is XML, in plain regex the expression is just (.*)(&?scope=.+&?)(.*). It will leave a trailing & in some cases.
You should replace google/redirect/url with the path to your auth URL (that Google redirects to).
You could do this in application layer code but URL rewrite does not add an extra server request đź‘Ť
This fixed it finally. Jeez!

Redirect or forward

Looking through some legacy code I have in front of me using struts one, I see:
<global-forwards>
...
<forward name="accessDenied" path="/www/jsp/AccessDeniedForm.do" redirect="true" />
</global-forwards>
So it's just a global forward to send to a access denied page.
I am curious about the decision to redirect as opposed to forward. What are the advantages and disadvantages of using it?
What are the pro's and con's of using it?
Before discussing pro's and con's of using that forward element with redirect set to true, let's understand what is actually going on with that configuration. When redirect is set to true in the forward element, a redirect instruction should be issued to the user-agent so that a new request is issued for this forward's resource. This link will probably provide detail information that you need.
The default value for redirect is to false, essentially when the forward element is called, it forward to that path specified and that's it. If you are setting redirect to true, take for example, the browser will make another request. So I think with these said, you probably know or have an idea the pro and con if you really want to use it.
In redirect, the control can be directed to different servers or even another domain name.The redirect takes a round trip.When a redirect is issued , it is sent back to the client , and redirected URL information is in the header instructing the browser to move to the next URL. This will act as a new request and all the request and response data is lost.
In forward , the forwarding is done from server side , the client browser URL do not change.the data is also not lost.It is just like a browser page refresh. Whatever data posted in the first submit is resubmitted again.So use it with caution.
Both forward and redirect are used in different scenarios ,the global forward should be redirect because it is an error situation.
Redirect is slower as it needs a roundtrip.Forwards are faster.
If you specify
redirect="true", Struts uses a client-side redirect
[response.sendRedirect()]
. The JSP will be invoked by a new browser request, and any data stored in the old request will be lost.

How to disallow access to an url called without parameters with robots.txt

I would like to deny web robots to access a url like this:
http://www.example.com/export
allowing this kind of url instead:
http://www.example.com/export?foo=value1
A spider bot is calling /export without query string causing a lot of errors on my log.
Is there a way to manage this filter on robots.txt?
I am assuming you have problems with bots hitting the first URL in your example.
As said in the comment, this is probably not possible, because http://www.example.com/export is the resource's base URL. Even if it were possible as per the standard, I wouldn't trust bots to understand this properly.
I would also not send a 401 Access denied or similar header if the URL is called without a query string for the same reason: A bot could think that the resource is out of bounds entirely.
What I would do in your situation is, if somebody arrives at
http://www.example.com/export
send a 301 Moved permanently redirect to the same URL and a query string with some default values, like
http://www.example.com/export?foo=0
this should keep the search engine index clean. (It won't fix the logging problem you state in your comment, though.)

Can I change the headers of the HTTP request sent by the browser?

I'm looking into a restful design and would like to use the HTTP methods (POST, GET, ...) and HTTP headers as much as possible. I already found out that the HTTP methods PUT and DELETE are not supported from the browser.
Now I'm looking to get different representations of the same resource and would like to do this by changing the Accept header of the request. Depending on this Accept header, the server can serve a different view on the same resource.
Problem is that I didn't find a way to tell my browser to change this header.
The <a..> tag has a type attribute, that can have a mime type, looked like a good candidate but the header was still the browser default (in Firefox it can be changed in about:config with the network.http.accept.default key).
I would partially disagree with Milan's suggestion of embedding the requested representation in the URI.
If anyhow possible, URIs should only be used for addressing resources and not for tunneling HTTP methods/verbs. Eventually, specific business action (edit, lock, etc.) could be embedded in the URI if create (POST) or update (PUT) alone do not serve the purpose:
POST http://shonzilla.com/orders/08/165;edit
In the case of requesting a particular representation in URI you would need to disrupt your URI design eventually making it uglier, mixing two distinct REST concepts in the same place (i.e. URI) and making it harder to generically process requests on the server-side. What Milan is suggesting and many are doing the same, incl. Flickr, is exactly this.
Instead, a more RESTful approach would be using a separate place to encode preferred representation by using Accept HTTP header which is used for content negotiation where client tells to the server which content types it can handle/process and server tries to fulfill client's request. This approach is a part of HTTP 1.1 standard, software compliant and supported by web browsers as well.
Compare this:
GET /orders/08/165.xml HTTP/1.1
or
GET /orders/08/165&format=xml HTTP/1.1
to this:
GET /orders/08/165 HTTP/1.1
Accept: application/xml
From a web browser you can request any content type by using setRequestHeader method of XMLHttpRequest object. For example:
function getOrder(year, yearlyOrderId, contentType) {
var client = new XMLHttpRequest();
client.open("GET", "/order/" + year + "/" + yearlyOrderId);
client.setRequestHeader("Accept", contentType);
client.send(orderDetails);
}
To sum it up: the address, i.e. the URI of a resource should be independent of its representation and XMLHttpRequest.setRequestHeader method allows you to request any representation using the Accept HTTP header.
Cheers!
Shonzilla
I was looking to do exactly the same thing (RESTful web service), and I stumbled upon this firefox addon, which lets you modify the accept headers (actually, any request headers) for requests. It works perfectly.
https://addons.mozilla.org/en-US/firefox/addon/967/
I don't think it's possible to do it in the way you are trying to do it.
Indication of the accepted data format is usually done through adding the extension to the resource name. So, if you have resource like
/resources/resource
and GET /resources/resource returns its HTML representation, to indicate that you want its XML representation instead, you can use following pattern:
/resources/resource.xml
You have to do the accepted content type determination magic on the server side, then.
Or use Javascript as James suggests.
ModHeader extension for Google Chrome, is also a good option. You can just set the Headers you want and just enter the URL in the browser, it will automatically take the headers from the extension when you hit the url. Only thing is, it will send headers for each and every URL you will hit so you have to disable or delete it after use.
Use some javascript!
xmlhttp=new XMLHttpRequest();
xmlhttp.open('PUT',http://www.mydomain.org/documents/standards/browsers/supportlist)
xmlhttp.send("page content goes here");

Why does Fiddler break my site's redirects?

Why does using Fiddler break my site sometimes on page transitions.
After a server side redirect -- in the http response (as found in Fiddler) I get this:
Object moved
Object moved to here.
The site is an ASP.NET 1.1 / VB.NET 1.1 [sic] site.
Why doesnt Fiddler just go there for me? i dont get it.
I'm fine with this issue when developing but I'm worried that other proxy servers might cause this issue for 'real customers'. Im not even clear exactly what is going on.
That's actually what Response.Redirect does. It sends a 302 - Object moved response to the user-agent. The user-agent then automatically goes to the URL specified in the 302 response. If you need a real server-side redirect without round-tripping to the client, try Server.Transfer.
If you merely constructed the request using the request builder, you're not going to see Fiddler automatically follow the returned redirect.
In contrast, if you are using IE or another browser, it will generally check the redirect header and follow it.
For IE specifically, I believe there's a timing corner case where the browser will fail to follow the redirect in obscure situations. You can often fix this by clicking Tools / Fiddler Options, and enabling both the "Server" and "Client" socket reuse settings.
Thanks user15310, it works with Server.Transfer
Server.Transfer("newpage.aspx", true);
Firstly, transferring to another page using Server.Transfer conserves server resources. Instead of telling the browser to redirect, it simply changes the "focus" on the Web server and transfers the request. This means you don't get quite as many HTTP requests coming through, which therefore eases the pressure on your Web server and makes your applications run faster.
But watch out: because the "transfer" process can work on only those sites running on the server, you can't use Server.Transfer to send the user to an external site. Only Response.Redirect can do that.
Secondly, Server.Transfer maintains the original URL in the browser. This can really help streamline data entry techniques, although it may make for confusion when debugging.
That's not all: The Server.Transfer method also has a second parameter—"preserveForm". If you set this to True, using a statement such as Server.Transfer("WebForm2.aspx", True), the existing query string and any form variables will still be available to the page you are transferring to.
Read more here:
http://www.developer.com/net/asp/article.php/3299641/ServerTransfer-Vs-ResponseRedirect.htm