Robots.txt - prevent index of .html files - robots.txt

I want to prevent index of *.html files on our site - so that just clean urls are indexed.
So I would like www.example.com/en/login indexed but not www.example.com/en/login/index.html
Currently I have:
User-agent: *
Disallow: /
Disallow: /**.html - not working
Allow: /$
Allow: /*/login*
I know I can just disallow e.g. Disallow: /*/login/index.html, but my issue is I have a number of these .html files that I do not want indexed - so wondered if there was a way to Disallow them all instead of doing them individually?

First of all, you keep using the word "indexed", so I want to ensure that you're aware that the robots.txt convention is only about suggesting to automated crawlers that they avoid certain URLs on your domain, but pages listed in a robots.txt file can still show up on search engine indexes if they have other data about the page. For instance, Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it. I just wanted you to be aware of that in case you are using the word "indexed" to mean "listed in a search engine" rather than "getting crawled by an automated program".
Secondly, there's no standard way to accomplish what you're asking for. Per "The Web Robots Pages":
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. So, you could add a Disallow: /*.html$ directive and then Google would not crawl URLs ending with .html, though they could still end up in search results.
But, if your primary goal is telling search engines what URL you consider "clean" and preferred, then what you're actually looking for is specifying Canonical URLs. You can put a link rel="canonical" element on each page with your preferred URL for that page, and search engines that use that element will use it in order to determine which path to prefer when displaying that page.

Related

Noindex in a robots.txt

I've always stopped google from indexing my website using a robots.txt file. Recently i've read an article from a google employee where he stated you should do this using meta tags. Does this mean Robots.txt won't work? Since i'm working with a CMS my options are very limited and its a lot easier just using a robots.txt file. My question is whats the worst that could happen if i proceed using a robots.txt file instead of meta tags.
Here's the difference in simple terms:
A robots.txt file controls crawling. It instructs robots (a.k.a. spiders) that are looking for pages to crawl to “keep out” of certain places. You place this file in your website’s root directory.
A noindex tag controls indexing. It tells spiders that the page should not be indexed. You place this tag in the code of the relevant web page.
Use the robots.txt file when you want control at the directory level or across your site. However, keep in mind that robots are not required to follow these directives. Most will, such as Googlebot, but it is safer to keep any highly sensitive information out of publicly-accessible areas of the site.
As with robots.txt files, noindex tags will exclude a page from search results. The page will still be crawled, but it won’t be indexed. Use these tags when you want control at the individual page level.
An aside on the difference between crawling and indexing: Crawling (via spiders) is how a search engine’s spider tracks your website; the results of the crawling go into the search engine’s index. Storing this information in an index speeds up the return of relevant search results—instead of scanning every page related to a search, the index (a smaller database) is searched to optimize speed.
If there was no index, the search engine would look at every single bit of data or info in existence related to the search term, and we’d all have time to make and eat a couple of sandwiches while waiting for search results to display. The index uses spiders to keep its database up to date.
Here is an example of the tag:
<meta name="robots" content="noindex,follow"/>
Now that you read and understand the above information, I think you are able to answer your question on your own ;)
Indeed, there was the opportunity of GoogleBot that allowed to use:
Noindex
Nofollow
Crawl-delay
But seen on the GoogleBlog-News they will no longer support those (0,001% used) commands anymore from September 2019 on. So you should only use meta tags anymore for these on your page to be safe for the future.

how to escape $ in robots.txt Disallow directive?

robots.txt handles the $ as a special character to identify the end of the pattern.
Though, googlebot is parsing some hrefs from JS templates within script tags e.g:
${object.name}
After encoding it, google bot tries to reach mySite.com/$%7Bobject.path%7D which ends in 404s.
To work around this I want do disallow such urls from being crawled by adding a matching directive to my robots.txt.
But using the $ "as is" doesn't work:
Disallow: /$%7Bobject.path%7D$
The only working solution I found was to use the wildcard character:
Disallow: /*%7Bobject.path%7D$
Though, I'm really curious if there is a way to escape that particular $ sign?
Thanks.
EDIT:
Well after some more testing with the google robots.txt testing tool I have some strange results. According to this tool the directive:
Disallow: /*%7Bobject.path%7D$
won't work for /$%7Bobject.path%7D while other tools tells me it matches (like https://technicalseo.com/seo-tools/robots-txt/).
What works in google's testing tool is putting the brackets unencoded in the directive:
Disallow: /*{object.path}$
I can't make any sense out of it so I've put both version in my robots.txt.
Googlebot and other crawlers support the $ but it is not part of the Robots Exclusion Protocol standard.
The standard does not include any escape character and Google's documentation doesn't mention it either.
Have you tried using the dollar encoded?:
Disallow: /%24%7Bobject.path%7D$

how to block multiple links in robot.txt with one line?

I have many pages whose links are as follow:
http://site.com/school_flower/
http://site.com/school_rose/
http://site.com/school_pink/
etc.
I can't block them manually.
How could i block these kind of pages, while i have hundreds fo links of above type and not wanted to write each line for each link.
You can't.
robots.txt is a very simple format. But you can create a tool that will generate that file for you. That should be fairly easy, if you have a list of URLs to be blocked, one per line, you just have to prepend Disallow: to each line.
That said, the fact that you want to block many urls is an alarm. Probably, you are doing something wrong. You could ask a question about your ultimate goal and we would give you a better solution.
Continuing from my comment:
user-agent: *
Disallow: /folder/
Of course you'll have to place all files you don't want robots to access under a single directory, unless you block the entire site by Disallow: /
In responce to your comment, kirelagin has provided the correct answer.

Robots.txt Disallow Certain Folder Names

I want to disallow robots from crawling any folder, at any position in the url with the name: this-folder.
Examples to disallow:
http://mysite.com/this-folder/
http://mysite.com/houses/this-folder/
http://mysite.com/some-other/this-folder/
http://mysite.com/no-robots/this-folder/
This is my attempt:
Disallow: /.*this-folder/
Will this work?
Officially globbing and regular expressions are not supported:
http://www.robotstxt.org/robotstxt.html
but apparently some search engines support this.

how to disallow all dynamic urls robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
how to disallow all dynamic urls in robots.txt
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
i want to disallow all things that start with /?q=
The answer to your question is to use
Disallow: /?q=
The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)
According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.
The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.
This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.
As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.
That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.
If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.
If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.
More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.
In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:
User-agent: *
Allow: /index.html
Allow: /offices
Allow: /static
Disallow: /
This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.