Does modifying robots.txt take effect immediately? - robots.txt

I am trying to solve an issue where Googlebot seems to be eating up my CPU usage. To confirm my guess, I modify robots.txt on my website's root folder, adding
Disallow: /
to it. I have two websites on different servers both of them are having this issue. So for one of them, after I edited robots.txt the CPU usage drops to a normal level, for the other I see from apache access log that the Googlebot is still coming in.
So I go to Google search console to test robots.txt. For the first one I see that google already discovered the latest robots.txt and stop crawling my website; For the second one google is still using an old version of robots.txt. So modifying robots.txt doesn't always take effect immediately, am I right? And if so, how do I notify google that I have a new robots.txt?

You need to use this to disallow all user agents though:
User-agent: *
Disallow: /
For the search engine re-indexing, it might take between a few days to four weeks before Googlebots index a new site (reference).

Related

Disallow url with ends with "?m=0" in robots

Hello I want to disallow urls like this one: "/2018/11/razones-para-ver-fallet.html?m=0", in robots.txt. I mean the ones who ends with "?m=0".
This one belongs to blogger mobile view (now I have migrated to wordpress) and google bot are still indexing (sitemaps in console is the new one but...) them causing some cpu problems.
I have proved with: Disallow: /*?m=0 but I´m still seeing them in visit log.
Many thanks
What about this: Disallow: /*?*m=0

How to fix robots.txt

full disclaimer, I am not a programer, I am an SEO trying to learn how to not rely on my developer for every little question I have.
Currently my issue is this. I use Screaming Frog to crawl my sites to layout the page titles, meta descriptions, h1, h2, etc so I can more easily plan out my changes.
The other day I wanted to run a report for my client and my own company website and got the following back.
So I know robots.txt is a way to make pages on your site but not have google crawl them. What I don't know is why an entire site would have this message as opposed to just some pages.
Can anyone give advice on how to fix this or links to how to's? I get this issue a lot and would like to educate myself so I don't have to wait for someone else. I get these as well when I try indexing websites on Google Search Console.
Many Thanks
What I don't know is why an entire site would have this message as
apposed to just some pages.
The robots.txt for your website has not been written properly if the intention is to index its content.
Or Screaming Frog might have a but if indeed the robots.txt file is written properly.
Or some webmaster decided the content was not worth indexing on Google or that bots would eat too much bandwidth (as in not being selective to restrict access).
Checking the current robots.txt file on that website, I see this content:
User-Agent: *
Disallow:
Which means the any page of that website is allowed to be crawled by any crawler (here the explanation of that file's syntax: https://moz.com/learn/seo/robotstxt)
So the current file should not cause that error OP mentions. Seeing that this question is from June 30/2017 and the robots.txt file was last modified on Jul 11/2017, it seems since this question was opened the OP may have already fixed whatever problem they had.

Robots.txt Disallow

I'm working with an e-commerce system at the moment that is throwing up hundreds of potential duplicate page urls and trying to work out how to hide them via robots.txt untill the developers are able to sort there ...... out.
I have managed to block most of them but got stuck on the last type so the question is:
I have 4 urls to the same product page with the below structure, how do I block the first one but not the others.
www.example.com/ProductPage
www.example.com/category/ProductPage
www.example.com/category/subcategory/ProductPage
www.example.com/category/subcategory/ProductPage/assessorypage
So far the only idea I can come up with is using:
Disallow: /*?id=*/
this however blocks everything…
EDIT: I believe I may have found a way to do it by setting up a robots.txt file to disallow all then just allow the specific paths I want again below that and then…once again disallow any specific paths after that.
Anyone know if this has a negative effect on SEO using disallow > allow > disallow.
You could set the meta tag for the rel="canonical" property. This will help search engines know which url is the 'right' one and not have more than one URL per product in search results.
Read here for more information

Robots.txt Allow sub folder but not the parent

Can anybody please explain the correct robots.txt command for the following scenario.
I would like to allow access to:
/directory/subdirectory/..
But I would also like to restrict access to /directory/ not withstanding the above exception.
Be aware that there is no real official standard and that any web crawler may happily ignore your robots.txt
According to a Google groups post, the following works at least with GoogleBot;
User-agent: Googlebot
Disallow: /directory/
Allow: /directory/subdirectory/
I would recommend using Google's robot tester. Utilize Google Webmaster tools - https://support.google.com/webmasters/answer/6062598?hl=en
You can edit and test URLs right in the tool, plus you get a wealth of other tools as well.
If these are truly directories then the accepted answer is probably your best choice. But, if you're writing an application and the directories are dynamically generated paths (a.k.a. contexts, routes, etc), then you might want to use meta tags instead of defining it in the robots.txt. This gives you the advantage of not having to worry about how different browsers may interpret/prioritize the access to the subdirectory path.
You might try something like this in the code:
if is_parent_directory_path
<meta name="robots" content="noindex, nofollow">
end

prevent google from indexing

hi sirs what's the best way to prevent google from showing of a folder in the search engine ?, like e.g www.example.com/support , what should i do if I want the support folder to disappear in google ?
the first thing I did was place a 'robots.txt' file and include this code
User-agent: *
Disallow: /support/etc
but the results is a total disaster, am not able to use the support page anymore unless i remove the robots.txt
what's the best thing to do ?
robots.txt shouldnt affect the way your page function. If in doubt, you can use tools to generate like http://www.searchenginepromotionhelp.com/m/robots-text-creator/simple-robots-creator.php or http://www.seochat.com/seo-tools/robots-generator/
When dissallowing in robots file, you can explicitly specify a file or subfolder rather than just a folder.
You can also use meta tag in your document to tell the crawler not to use it
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
what's the best way to prevent google from showing of a folder in the search engine ?
A robots.txt file is the right way to do this. Your example is correct for blocking the /support/etc directory and its descendants.
am not able to use the support page anymore unless i remove the robots.txt
It doesn't make sense that a robots.txt file would affect the way your site functions, and certainly it should never affect which pages can be accessed by a human. I suspect something else is awry -- check your server logs to see what kinds of errors are being recorded.
While not the preferred method of limiting robot access, Google talks about using a noindex meta tag here. This will also prevent the various pages from showing up if they are linked to by a site other than your own.
A good discussion of limiting bots that visit your site can be found here.