Googlebot guesses urls. How to avoid/handle this crawling - robots.txt

Googlebot is crawling our site. Based on our URL structure it is guessing new possible URLs.
Our structure is of the kind /x/y/z/param1.value. Now google bot exchanges the values of x,y,z and value with tons of different keywords.
Problem is, that each call triggers a very expensive operation and it will return positive results only in very rare cases.
I tried to set an url parameter in the crawling section of the webmasters tools (param1. -> no crawling). But this seems not to work, probably cause of our inline url format (would it be better to use the html get format ?param1=..?)
As Disallow: */param1.* seems not to be an allowed robots.txt entry, is there another way to disallow google from crawling this sites?
As another solution I thought of detecting the googlebot and returning him a special page.
But I have heard that this will be punished by google.
Currently we always return a http status code 200 and a human readable page, which says: "No targets for your filter critera found". Would It help to return another status code?

Note: This is probably not a general answer!
Joachim was right. It turned out that the googlebot is not guessing URLs.
Doing a bit of research I found out that I added a new DIV in my site containing those special URLs half a year ago (which I unfortunately forgot). A week ago googlebot has started crawling it.
My solution: I deleted the DIV and also I return a 404 status code on those URLs. I think, sooner or later, googlebot will now stop crawling the URLs after revisiting my site.
Thanks for the help!

Related

Disallow certain URLs in robot.txt

I'm currently running a web service where people can browse products. The URL for that is basically just /products/product_pk/. However, we don't serve products with certain product_pks, e.g. nothing smaller than 200. Is there hence a way to discourage bots to hit URLs like /products/10/ (because they will receive a 404)?
Thank you for your help :)
I am pretty sure that crawlers don't try and fail autogenerated urls. It crawls your website and find the next links to crawl. If you have any links that return 404, that is bad design on your site, since they should not be there.

Why robots.txt doesn't work for when I do redirection from http to https

Today I experience the problem with search in the google.
When I type "trakopolis" in the google in shows me my page (so it is indexed by google robots), but the description of the page is not available. It is very important to have a description on my website.
the website is:
https://trakopolis.com
the robots txt file is, so I allow everything:
User-agent: *
Allow: /
https://www.google.com.ua/?gws_rd=cr#gs_rn=23&gs_ri=psy-ab&tok=O7cIXclKCSxtMd3uDVRVhg&cp=2&gs_id=h&xhr=t&q=trakopolis&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq=tr&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.50165853,d.bGE&fp=d3f611552977418f&biw=1680&bih=949
but as you see the description is not available. I confused :( Sorry if the questio is stupid.
As I see from the google webmaster tools. Google use this robots.txt file, so maybe the issue with redirection from http to https? The website doesn't allow http and we use https. And on main page I use redirection to Login.aspx page in case if user didn't authenticate.
Google shows a description when searching for "trakopolis":
It seems that your robots.txt disallowed crawling of your site some time ago, as some other search engines still display that they are not allowed to show your description, e.g. DuckDuckGo.
Note that your robots.txt uses Allow, which is not part of the original robots.txt specification (but many parsers understand it anyway). It’s equivalent to:
User-agent: *
Disallow:
(But because parsers have to ignore unknown fields, you should have no problem using Allow. An empty or no existent robots.txt always allows crawling of everything.)

Google couldn't follow your URL because it redirected too many times

I was fixing url's on a website, and one of the problems there was that the url's contained characters that were sometimes upper-case while other times lower-case, the server did not care about it, but google did, and indexed the pages as duplicates.
Also some urls contained characters that are simply not allowed to be in that part of the URL, like commas "," and brackets "()" although [round brackets are technically not reserved][1]
I still decided to get rid of them by encoding them.
I added a check that checks if the url is valid, and if not, would do a 301 redirect to the correct url.
for example
http://www.example.com/articles/SomeGreatArticle(2012).html
would do a 301 redirect to
http://www.example.com/articles/somegreatarticle%282012%29.html
It works, and it does one redirect to the correct url.
But for a small fraction of the pages (which are possibly the only pages google has indexed so far) google webmaster tools started to give me the following error under the Crawl errors > Not followed tab:
Google couldn't follow your URL because it redirected too many
times.
googling for this error with quotes gives me 0 results, and I'm sure I'm not the only one to ever get this error, so I would like to know some more information about it, for example:
how many redirects can a single page do before google thinks that it's too many?
what are the other possible causes for such an error?
SOLUTION
According to this experiment http://www.monperrus.net/martin/google+url+encoding
Google has it's own character encoding rules, where google will always encode some characters and always decode other.
The following characters are never encoded
-,.#~_*)!$'(
So even if you give Google this url
http://www.example.com/articles/somegreatarticle%282012%29.html
where the round brackets () are encoded, google will transform this URL, decode the brackets and follow this URL instead:
http://www.example.com/articles/somegreatarticle(2012).html
What happened in my situation:
http://www.example.com/articles/somegreatarticle(2012).html
my server would do a 301 redirect to
http://www.example.com/articles/somegreatarticle%282012%29.html
while Googlebot would ignore the encoded brackets and follow:
http://www.example.com/articles/somegreatarticle(2012).html
get redirected to
http://www.example.com/articles/somegreatarticle%282012%29.html
follow
http://www.example.com/articles/somegreatarticle(2012).html
get redirected to
http://www.example.com/articles/somegreatarticle%282012%29.html
and give up after a couple of tries and show the "Google couldn't follow your URL because it redirected too many times" error.
I don't know about Google webmaster tools, but I have seen a similar error in PHP, when there is an infinite loop of redirection. Make sure that none of the pages is redirecting to itself.
Oke first of all I would remove the () and , signs from the urls, it is a fact that googlebot has a harder time working with these. And they don't do any benefit for SEO purposes either.
Readability for the client isn't an issues so if i where you just use a - or _ dash.
Try not to use any other character in your file/folder names.
You should also clean up your html, there are quite some errors and issues to resolve.
A cleaner source is better for google, browsers and your visitors.
I couldn't find any definitive problem that google would have an issue with.

Get google to index links from javascript generated content

On my site I have a directory of things which is generated through jquery ajax calls, which subsequently creates the html.
To my knwoledge goole and other bots aren't aware of dom changes after the page load, and won't index the directory.
What I'd like to achieve, is to serve the search bots a dedicated page which only contains the links to the things.
Would adding a noscript tag to the directory page be a solution? (in the noscript section, I would link to a page which merely serves the links to the things.)
I've looked at both the robots.txt and the meta tag, but neither seem to do what I want.
It looks like you stumbled on the answer to this yourself, but I'll post the answer to this question anyway for posterity:
Implement Google's AJAX crawling specification. If links to your page contain #! (a URL fragment starting with an exclamation point), Googlebot will send everything after the ! to the server in the special query string parameter _escaped_fragment_.
You then look for the _escaped_fragment_ parameter in your server code, and if present, return static HTML.
(I went into a little more detail in this answer.)

SEO redirects for removed pages

Apologies if SO is not the right place for this, but there are 700+ other SEO questions on here.
I'm a senior developer for a travel site with 12k+ pages. We completely redeveloped the site and relaunched in January, and with the volatile nature of travel, there are many pages which are no longer on the site. Examples:
/destinations/africa/senegal.aspx
/destinations/africa/features.aspx
Of course, we have a 404 page in place (and it's a hard 404 page rather than a 30x redirect to a 404).
Our SEO advisor has asked us to 30x redirect all our 404 pages (as found in Webmaster Tools), his argument being that 404's are damaging to our pagerank. He'd want us to redirect our Senegal and features pages above to the Africa page (which doesn't contain the content previously found on Senegal.aspx or features.aspx).
An equivalent for SO would be taking a url for a removed question and redirecting it to /questions rather than showing a 404 'Question/Page not found'.
My argument is that, as these pages are no longer on the site, 404 is the correct status to return. I'd also argue that redirecting these to less relevant pages could damage our SEO (due to duplicate content perhaps)? It's also very time consuming redirecting all 404's when our site takes some content from our in-house system, which adds/removes content at will.
Thanks for any advice,
Adam
The correct status to return is 410 Gone. I wouldn't want to speculate about what search engines will do if they are redirected to a page with entirely different content.
As I know 404 is quite bad for SEO because your site won't get any PageRank for pages linked from somewhere but missing.
I would added another page, which will explain that due to redesign original pages are not available, offering links to some other most relevant pages. (e.g. to Africa and FAQ) Then this page sounds like a good 301 answer for those pages.
This is actually a good idea.
As described at http://www.seomoz.org/blog/url-rewrites-and-301-redirects-how-does-it-all-work
(which is a good resource for the non seo people here)
404 is obviously not good. A 301 tells spiders/users that this is a permanent redirect of a source. The content should not get flagged as duplicate because you are not sending a 200 (good page) response and so there is nothing spidered/compared.
This IS kinda a grey hat tactic though so be careful, it would be much better to put actual 301 pages in place where it is looking for the page and also to find who posted the erroneous link and if possible, correct it.
I agree that 404 is the correct status, but than again you should take a step back and answer the following questions:
Do these old pages have any inbound links?
Did these old pages have any good, relevant content to the page you are 301'ing it to?
Is there any active traffic that is trying to reach these pages?
While the pages may not exist I would investigate the pages in question with those 3 questions, because you can steer incoming traffic and page rank to other existing pages that either need the PR/traffic or to pages that are already high traffic.
With regards to your in house SEO saying you are losing PR this can be true of those pages have inbound links, because you they will be met with a 404 status code and will not pass link juice, since nothing exists there any more. That's why 301's rock.
404s should not affect overall pagerank of other web pages of a website.
If they are really gone then 404/410 is appropriate. Check the official google webmasters blog.