What is the "unique" keyword in robots.txt? - robots.txt

I have the following code in robots.txt to allow crawling from everyone for now
User-agent: *
Disallow:
Before I changed this the layout of the file was this below. I've been looking for details about unique and I can't find it. Anyone see this before and what is "unique" doing exactly?
User-agent: *
Disallow: /unique/

It's not a keyword, it's a directory on your server that shouldn't be visited by a web crawler.

Related

Why does Google not index my "robots.txt"?

I am trying to allow the Googlebot webcrawler to index my site. My robots.txt initially looked like this:
User-agent: *
Disallow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
And I changed it to:
User-agent: *
Allow: /
Host: www.sitename.com
Sitemap: https://www.sitename.com/sitemap.xml
Only Google is still not indexing my links.
I am trying to allow the Googlebot webcrawler to index my site.
Robots rules has nothing to do with indexing! They are ONLY about crawling ability. A page can be indexed, even if it is forbidden to be crawled!
host directive is supported only by Yandex.
If you want all bots are able to crawl your site, your robots.txt file should be placed under https://www.sitename.com/robots.txt, be available with status code 200, and contain:
User-agent: *
Disallow:
Sitemap: https://www.sitename.com/sitemap.xml
From the docs:
Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:
User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Try to specifically mention Googlebot in your robots.txt-directives such as:
User-agent: Googlebot
Allow: /
or allow all web crawlers access to all content
User-agent: *
Disallow:

Check for specific text in Robots.txt

My URL ends with &content=Search. I want to block all URLs that end with this. I have added following in robots.txt.
User-agent: *
Disallow:
Sitemap: http://local.com/sitemap.xml
Sitemap: http://local.com/en/sitemap.xml
Disallow: /*&content=Search$
But it's not working when testing /en/search?q=terms#currentYear=2015&content=search in https://webmaster.yandex.com/robots.xml. It is not working for me because content=search is after # character.
The Yandex Robots.txt analysis will block your example if you test for Search instead of search, as Robots.txt Disallow values are case-sensitive.
If your site uses case-insensitive URLs, you might want to use:
User-agent: *
Disallow: /*&content=Search$
Disallow: /*&content=search$
# and possibly also =SEARCH, =SEarch, etc.
Having said that, I don’t know if Yandex really supports this for URL fragments (it would be unusual, I guess), although their tool gives this impression.

Disallow pages that ends with number only in robots.txt

Is it possible to tell Google not to crawl these pages
/blog/page/10
/blog/page/20
…
/blog/page/100
These are basically Ajax calls that bring blog posts data.
I created this in robots.txt:
User-agent: *
Disallow: /blog/page/*
But now I have to another page that I want allow which is
/blog/page/start
Is there a way that I tell robots that only pages that end with a number
e.g
User-agent: *
Disallow: /blog/page/(:num)
I also got an error bellow when I tried to validate the robots.txt file:
Following the original robots.txt specification, this would work (for all conforming bots, including Google’s):
User-agent: *
Disallow: /blog/pages/0
Disallow: /blog/pages/1
Disallow: /blog/pages/2
Disallow: /blog/pages/3
Disallow: /blog/pages/4
Disallow: /blog/pages/5
Disallow: /blog/pages/6
Disallow: /blog/pages/7
Disallow: /blog/pages/8
Disallow: /blog/pages/9
This blocks all URLs whose path begins with /blog/pages/ followed by any number (/blog/pages/9129831823, /blog/pages/9.html, /blog/pages/5/10/foo etc.).
So you should not append the * character (it’s not a wildcard in the original robots.txt specification, and not even needed in your case for bots that interpret it as wildcard).
Google supports some features for robots.txt which are not part of the original robots.txt specification, and therefore are not supported by (all) other bots, e.g., the Allow field. But as the above robots.txt would work, there is no need for using it.

Url blocked by robots.txt message in Google Webmaster

I've a wordpress site in root domain. Now, i've added a forum in subfolder as mydomain/forum
which makes a sitemap as follows: mydomain/forum/sitemap_index.xml.
Submitting that sitemap to google, It sounds google can't access sub-sitemaps with the message of "Url blocked by robots.txt" - Value: mydomain/forum/sitemap-forums.xml?page=1 --- Value: mydoamin/forum/sitemap-index.xml?page=1.
This is my robots.txt:
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /trackback
Disallow: /feed
Disallow: /comments
Disallow: /category/*/*
Disallow: */trackback
Disallow: */feed
Disallow: */comments
Disallow: /*?*
Disallow: /*?
Allow: /wp-content/uploads
# Google Image
User-agent: Googlebot-Image
Disallow:
Allow: /*
Sitemap: mydomain/sitemap_index.xml
Sitemap: mydomain/forum/sitemap_index.xml
What should i add to robots.txt? Any help would be greatly appreciated.
Thanks in advance
Just to clarify, I'm assuming 'mydomain' in your example is a stand-in for the scheme plus fully qualified domain name, correct? (e.g. "http://www.whatever.com", not "whatever.com" or "www.whatever.com") I figure this must be the case because you have it in the Google error message in the same format.
The error message suggests that Google is getting the URL from somewhere other than your robots.txt file. The robots.txt file lists the sitemap URL as:
mydomain/forum/sitemap_index.xml
but the error message shows that Google is trying to load the URL:
mydomain/forum/sitemap-index.xml?page=1
This second URL is getting blocked, because your robots.txt file blocks any URL that contains a question mark:
Disallow: /*?*
Disallow: /*?
(Incidentally, these two lines do exactly the same thing. You can safely delete the first one) Google should still be able to read the sitemap file using the simpler URL however, so the pages will probably still be crawled. If you really want to get rid of the error message, you could always add:
Allow: /forum/sitemap-index.xml?page=1
This will override the disallows for just the sitemap URL. (This will work on Google at least - YMMV for any other search engines)

robots.txt remove entire subdomain/directory

I have a subdomain forums.example.com
It is in public_html/forums
If I put the following robots.txt in public_html/forums will it remove all the forums from the index? (I migrated forums to a different provider and want to remove all the forum pages from google's index)
User-agent: *
Disallow: /
User-agent: *
Disallow: /forums/
If you don't put the folder name "forums" you will be completely removed from google/spider robots for all your content in that web host.