Disallow certain URLs in robot.txt

Disallow certain URLs in robot.txt - robots.txt

I'm currently running a web service where people can browse products. The URL for that is basically just /products/product_pk/. However, we don't serve products with certain product_pks, e.g. nothing smaller than 200. Is there hence a way to discourage bots to hit URLs like /products/10/ (because they will receive a 404)?
Thank you for your help :)

I am pretty sure that crawlers don't try and fail autogenerated urls. It crawls your website and find the next links to crawl. If you have any links that return 404, that is bad design on your site, since they should not be there.

Related

Googlebot guesses urls. How to avoid/handle this crawling

Googlebot is crawling our site. Based on our URL structure it is guessing new possible URLs.
Our structure is of the kind /x/y/z/param1.value. Now google bot exchanges the values of x,y,z and value with tons of different keywords.
Problem is, that each call triggers a very expensive operation and it will return positive results only in very rare cases.
I tried to set an url parameter in the crawling section of the webmasters tools (param1. -> no crawling). But this seems not to work, probably cause of our inline url format (would it be better to use the html get format ?param1=..?)
As Disallow: */param1.* seems not to be an allowed robots.txt entry, is there another way to disallow google from crawling this sites?
As another solution I thought of detecting the googlebot and returning him a special page.
But I have heard that this will be punished by google.
Currently we always return a http status code 200 and a human readable page, which says: "No targets for your filter critera found". Would It help to return another status code?

Note: This is probably not a general answer!
Joachim was right. It turned out that the googlebot is not guessing URLs.
Doing a bit of research I found out that I added a new DIV in my site containing those special URLs half a year ago (which I unfortunately forgot). A week ago googlebot has started crawling it.
My solution: I deleted the DIV and also I return a 404 status code on those URLs. I think, sooner or later, googlebot will now stop crawling the URLs after revisiting my site.
Thanks for the help!

How to prevent Google from indexing redirect URL I do not own

A domainname that I do not own, is redirecting to my domain. I don´t know who owns it and why it is redirecting to my domain.
This domain however is showing up in Googles search results. When doing a whois it also returns this message:
"Domain:http://[baddomain].com webserver returns 307 Temporary Redirect"
Since I do not own this domain I cannot set a 301 redirect, or disable it. When clicking the baddomain in Google it shows the content of my website but the baddomain.com stays visible in the URL bar.
My question is: How can I stop Google from indexing and showing this bad domain in the search results and only show my website instead?
Thanks.

Some thoughts:
You cannot directly stop Google from indexing other sites, but what you could do is add the cannonical tag to your pages so Google can see that the original content is located on your domain and not "bad domain".
For example check out : https://support.google.com/webmasters/answer/139394?hl=en
Other actions can be taken SEO wise if the 'baddomain' is outscoring you in the search rankings, because then it sounds like your site could use some optimizing.
The better your site and domain rank in the SERPs, the less likely it is that people will see the scraped content and 'baddomain'.
You could however also look at the referrer for the request and if it is 'bad domain' you should be able to do a redirect to your own domain, change content etc, because the code is being run from your own server.
But that might be more trouble than it's worth as you'd need to investigate how the 'baddomain' is doing things and code accordingly. (properly iframe or similar from what you describe, but that can still be circumvented using scripts).
Depending on what country you and 'baddomain' are located in, there are also legal actions. So called DMCA complaints. This however can also be quite a task, and well - it's often not worth it because a new domain will just pop up.

Dealing with 301 redirects for a brand new website

I have seen multiple articles on redirecting Urls when the site has been redesigned or Url just changed to a standard format but I need to know how to manage when the Url has no correlation to the old one.
For instance, an old Url may have been www.mysite.com/index.php?product=12 but there is no way to map that Url to the new site.
I don't want search engines to think that the page has broken so I assume the best thing to do is to 301 redirect to the home page but I am not sure how I would do that effectively. Would I just change the 404 error page to do a 301 to the home page?
Also, would that then cause issues with duplicate content via dofferent Urls?
Is it better to just not worry about these and let the search engines re-index the new Urls?
I am running IIS7 with Rewrite module and ASP.NET 2.
Thanks.

Why do you say there is no way to map that URL to the new one? There probably is, since both should be unique identifiers for a given resource. If your site has good rankings, it may be worth the pain to work this out and have a 301 redirect to the right page. In this way, the ranks should be unchanged.
Redirecting everything to the new home page will probably have a negative effect. It really depends on how the bots are going to interpret this. But it may seem an artificial way to increase the rank of the home page, and correspondingly get a penalty.
Doing nothing and waiting for the bots to index your new site will of course work, but often you cannot afford to lose the high rank you have gained.
All in all, I would advise you to ask here a new question on how to map the old URLs to the new ones, and do proper redirects.

That product URL you supplied is obviously, well, a product. The best bet is to 301 redirect it to a new page that is the most relevant to that old page. If there aren't any external links even pointing to it at all, just let it die. Be sure to remove it from any sitemaps or old navigation links you may have internally though or it will keep getting re-indexed which is what you want to avoid.
Once you have your new site structure set up, visit a site like AuditMyPc.com and create a brand new sitemap of your new site setup. Then login to Google Webmaster Tools and resubmit the new sitemap. This normally will fix the problem, but if that page is indexed, expect it to stay in Google's index for a while. They don't clean themselves up too well.

SEO redirects for removed pages

Apologies if SO is not the right place for this, but there are 700+ other SEO questions on here.
I'm a senior developer for a travel site with 12k+ pages. We completely redeveloped the site and relaunched in January, and with the volatile nature of travel, there are many pages which are no longer on the site. Examples:
/destinations/africa/senegal.aspx
/destinations/africa/features.aspx
Of course, we have a 404 page in place (and it's a hard 404 page rather than a 30x redirect to a 404).
Our SEO advisor has asked us to 30x redirect all our 404 pages (as found in Webmaster Tools), his argument being that 404's are damaging to our pagerank. He'd want us to redirect our Senegal and features pages above to the Africa page (which doesn't contain the content previously found on Senegal.aspx or features.aspx).
An equivalent for SO would be taking a url for a removed question and redirecting it to /questions rather than showing a 404 'Question/Page not found'.
My argument is that, as these pages are no longer on the site, 404 is the correct status to return. I'd also argue that redirecting these to less relevant pages could damage our SEO (due to duplicate content perhaps)? It's also very time consuming redirecting all 404's when our site takes some content from our in-house system, which adds/removes content at will.
Thanks for any advice,
Adam

The correct status to return is 410 Gone. I wouldn't want to speculate about what search engines will do if they are redirected to a page with entirely different content.

As I know 404 is quite bad for SEO because your site won't get any PageRank for pages linked from somewhere but missing.
I would added another page, which will explain that due to redesign original pages are not available, offering links to some other most relevant pages. (e.g. to Africa and FAQ) Then this page sounds like a good 301 answer for those pages.

This is actually a good idea.
As described at http://www.seomoz.org/blog/url-rewrites-and-301-redirects-how-does-it-all-work
(which is a good resource for the non seo people here)
404 is obviously not good. A 301 tells spiders/users that this is a permanent redirect of a source. The content should not get flagged as duplicate because you are not sending a 200 (good page) response and so there is nothing spidered/compared.
This IS kinda a grey hat tactic though so be careful, it would be much better to put actual 301 pages in place where it is looking for the page and also to find who posted the erroneous link and if possible, correct it.

I agree that 404 is the correct status, but than again you should take a step back and answer the following questions:
Do these old pages have any inbound links?
Did these old pages have any good, relevant content to the page you are 301'ing it to?
Is there any active traffic that is trying to reach these pages?
While the pages may not exist I would investigate the pages in question with those 3 questions, because you can steer incoming traffic and page rank to other existing pages that either need the PR/traffic or to pages that are already high traffic.
With regards to your in house SEO saying you are losing PR this can be true of those pages have inbound links, because you they will be met with a 404 status code and will not pass link juice, since nothing exists there any more. That's why 301's rock.

404s should not affect overall pagerank of other web pages of a website.
If they are really gone then 404/410 is appropriate. Check the official google webmasters blog.

Can RSS readers follow redirects if the url of the feed changes?

We are migrating to a Sharepoint solution and our urls are changing slightly.
Are most RSS readers able to follow redirect links without breaking the feed and making an update manually?
Most of the documentation I'm reading says that this will work for major RSS readers.
I have read in some places that a lot of RSS readers will treat a 301 as a temporary redirect and not update its stored url. Any truth to this?

Assuming you are using a 301 redirect, I would say yes, since any reader worth its salt is built on a compliant HTTP library which will honor the 301 status code and follow the redirect.
Of course, it's not that hard to test with the reader of your choice.

Pretty much every RSS reader - major or minor - will update the feed URL when it encounters a 301 redirect.

In my (limited) experience, most applications will ignore the "permanent" part of a permanent redirect and execute the same logic they would use for a temporary redirect.

It may be necessary to make its site velindekserede about. What to do so to preserve PageRank, link popularity and traffic?
As I understand it, so the solution is called a 301 redirect. It tells search engines that the URL has been permanently moved. How a redirect should be done in a special way. At this link there are different options depending on what kind of server technology you use:
http://www.webconfs.com/how-to-redirect-a-webpage.php
I just tried it in practice. I use PHP itself on all my sites, so I used the PHP instructions:
I ripped all my old page for tags and content and put the small code snippet on the page. Prisoners of the new URL for the page, and saved it. Tested the page by typing the old URL and then redirects worked. To be absolutely sure that redirects are search engine friendly, I used this "Search Engine Friendly Redirect Checker":
http://www.webconfs.com/redirect-check.php
There no disagreement about how well the 301-redirect is working and whether it can transfer an entire site to a new domain (http://www.webmasterworld.com/link_deve ... 135964.htm), but people's experience says that it is good enough. You just make sure that the new URL has the content as the old page had

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Disallow certain URLs in robot.txt - robots.txt

I am pretty sure that crawlers don't try and fail autogenerated urls. It crawls your website and find the next links to crawl. If you have any links that return 404, that is bad design on your site, since they should not be there.

Related

Googlebot guesses urls. How to avoid/handle this crawling

How to prevent Google from indexing redirect URL I do not own

Dealing with 301 redirects for a brand new website

SEO redirects for removed pages

Can RSS readers follow redirects if the url of the feed changes?

Categories

Resources