Noindex in a robots.txt - robots.txt

I've always stopped google from indexing my website using a robots.txt file. Recently i've read an article from a google employee where he stated you should do this using meta tags. Does this mean Robots.txt won't work? Since i'm working with a CMS my options are very limited and its a lot easier just using a robots.txt file. My question is whats the worst that could happen if i proceed using a robots.txt file instead of meta tags.

Here's the difference in simple terms:
A robots.txt file controls crawling. It instructs robots (a.k.a. spiders) that are looking for pages to crawl to “keep out” of certain places. You place this file in your website’s root directory.
A noindex tag controls indexing. It tells spiders that the page should not be indexed. You place this tag in the code of the relevant web page.
Use the robots.txt file when you want control at the directory level or across your site. However, keep in mind that robots are not required to follow these directives. Most will, such as Googlebot, but it is safer to keep any highly sensitive information out of publicly-accessible areas of the site.
As with robots.txt files, noindex tags will exclude a page from search results. The page will still be crawled, but it won’t be indexed. Use these tags when you want control at the individual page level.
An aside on the difference between crawling and indexing: Crawling (via spiders) is how a search engine’s spider tracks your website; the results of the crawling go into the search engine’s index. Storing this information in an index speeds up the return of relevant search results—instead of scanning every page related to a search, the index (a smaller database) is searched to optimize speed.
If there was no index, the search engine would look at every single bit of data or info in existence related to the search term, and we’d all have time to make and eat a couple of sandwiches while waiting for search results to display. The index uses spiders to keep its database up to date.
Here is an example of the tag:
<meta name="robots" content="noindex,follow"/>
Now that you read and understand the above information, I think you are able to answer your question on your own ;)

Indeed, there was the opportunity of GoogleBot that allowed to use:
Noindex
Nofollow
Crawl-delay
But seen on the GoogleBlog-News they will no longer support those (0,001% used) commands anymore from September 2019 on. So you should only use meta tags anymore for these on your page to be safe for the future.

Related

How to get one robots.txt for each store

I have a Magento 2 website with two stores. At the moment, I can edit the global website and his content is applied to both stores.
What I want to de is replace that behaviour in order to get one robot.txt file by store.
But I really have no idea how I should do that.
Currently, if I go to the back office Content > design > Configuration > (Store Edit) > Search Engine Robots
All the fields are disabled in the stores and can't be modified
But If I go on the global Content > design > Configuration > (Global Edit) > Search Engine Robots, of course, I can modify.
I also have 3 robots.txt files on my storage, but none of them seems to be matching the information saved in the global search engine robots configuration
src/robots.txt
src/app/design/frontend/theme1/jeunesse/robots.txt
src/app/design/frontend/theme2/jeunesse/robots.txt
I found these two links...but none of them helped me : https://inchoo.net/online-marketing/editing-robots-txt-in-magento-2-admin/ and https://support.hypernode.com/knowledgebase/create-robots-txt-magento-2/
The first one tells me that If I have a robots.txt on my storage it should override the configurations...but looks like no considering I have robots file and they aren't showing when I go to website/robots.txt. I only find again the one in the global configuration.
The second one tells that saving the configuration should save the robots.txt file on the storage...but once again...that's not what is happening.
Thanks for your help, let me know if there is pieces of code I can show ? I really don't know which one at this point.
I'm the author of the first link. It's a 2 years old article, Magento 2 has since then introduced a few improvements to the built in robots.txt functionality.
The robots.txt content you save under Content > Design > Configuration has a "website" scope. Meaning you can edit it on website level and if you need it to vary through this config you can do it if you have multiple websites.
It is unclear from the question itself if you have multiple websites or if you have set-up multiple stores and/or storeviews under the same website.

Robots.txt - prevent index of .html files

I want to prevent index of *.html files on our site - so that just clean urls are indexed.
So I would like www.example.com/en/login indexed but not www.example.com/en/login/index.html
Currently I have:
User-agent: *
Disallow: /
Disallow: /**.html - not working
Allow: /$
Allow: /*/login*
I know I can just disallow e.g. Disallow: /*/login/index.html, but my issue is I have a number of these .html files that I do not want indexed - so wondered if there was a way to Disallow them all instead of doing them individually?
First of all, you keep using the word "indexed", so I want to ensure that you're aware that the robots.txt convention is only about suggesting to automated crawlers that they avoid certain URLs on your domain, but pages listed in a robots.txt file can still show up on search engine indexes if they have other data about the page. For instance, Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it. I just wanted you to be aware of that in case you are using the word "indexed" to mean "listed in a search engine" rather than "getting crawled by an automated program".
Secondly, there's no standard way to accomplish what you're asking for. Per "The Web Robots Pages":
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. So, you could add a Disallow: /*.html$ directive and then Google would not crawl URLs ending with .html, though they could still end up in search results.
But, if your primary goal is telling search engines what URL you consider "clean" and preferred, then what you're actually looking for is specifying Canonical URLs. You can put a link rel="canonical" element on each page with your preferred URL for that page, and search engines that use that element will use it in order to determine which path to prefer when displaying that page.

Find all <forms> used on a site

Is there e.g. a crawler that can find (and list the form action etc.) all pages that have forms in my site?
I'd like to log all pages with unique actions to then audit further.
Norconex HTTP Collector is an open source web crawler that can certainly help you. Its "Importer" module has a "TextBetweenTagger" feature to extract text between any start and end text and store it in a metadata field of your choice. You can then filter out those that have no such text extracted (look at the EmptyMetadataFilter option for this).
You can do this without writing code. As far as storing the results, the product uses "Committers". A few committers are readily available (including a filesystem one), but you may want to write your own to "commit" your crawled data wherever you like (e.g. in a database).
Check its configuration page for ideas.

how can I override robots in a sub folder

I have a sub-domain for testing purposes. I have set robots.txt to disallow this folder.
Some of the results are still showing for some reason. I thought it may be because I hadn't set up the robots.txt originally and Google hadn't removed some of them yet.
Now I'm worried that the robots.txt files within the individual joomla sites in this folder are causing Google to keep indexing them. Ideally I would like to stop that from happening because I don't want to have to remember to turn robots.txt back to follow when they go live (just in case).
Is there a way to override these explicitly with a robots.txt in a folder above this folder?
As far as a crawler is concerned, robots.txt exists only in the site's root directory. There is no concept of a hierarchy of robots.txt files.
So if you have http://example.com and http://foo.example.com, then you would need two different robots.txt files: one for example.com and one for foo.example.com. When Googlebot reads the robots.txt file for foo.example.com, it does not take into account the robots.txt for example.com.
When Google bot is crawling example.com, it will not under any circumstances interpret the robots.txt file for foo.example.com. And when it's crawling foo.example.com, it will not interpret the robots.txt for example.com.
Does that answer your question?
More info
When Googlebot crawls foo.com, it will read foo.com/robots.txt and use the rules in that file. It will not read and follow the rules in foo.com/portfolio/robots.txt or foo.com/portfolio/mydummysite.com/robots.txt. See the first two sentences of my original answer.
I don't fully understand what you're trying to prevent, probably because I don't fully understand your site hierarchy. But you can't change a crawler's behavior on mydummysite.com by changing the robots.txt file at foo.com/robots.txt or foo.com/portfolio/robots.txt.

Where to store the complete url in a cms?

I'm creating a cms and have not yet settled on the matter of where to store the complete url for a given page in the structure.
Every page have a slug (url friendly name of the page) and every page has a nullable (for top-level pages) parent and children.
Where do I store the complete url (/first-page/sub-page) for a given page? Should this go in the database along with the other properties of the page or in some cache?
Update
It's not the database design I'm asking about, rather where to store the complete url to a given page so I don't need to traverse the entire url to get the page that the user requested (/first-page/sub-page)
Update 2
I need to find which page belongs to the currently requested url. If the requested url is /first-page/sub-page I don't want to split the url and looping through the database (obviously).
I'd rather have the entire url in the table so that I can just do a single query (WHERE url = '/first-page/sub-page') but this does not seem ideal, what if I change the slug for the parent page? Then I also need to update the url-field for all descendants.
How do other people solve this issue? Are they putting it in the database? In a cache that maps /first-page-/sub-page to the id for the page? Or are they splitting the requested url and looping though the database?
Thanks
Anders
Store it in a cache, because the web servers will need to be looking up URLs constantly. Unless you expect the URLs of pages to change very rapidly, caching will greatly reduce load on the database, which is usually your bottleneck in database driven web sites.
Basically, you want a dictionary that maps URL -> whatever you need to render the page. Many web servers will automatically use the operating system's file system as the dictionary and will often have a built-in cache that can recognize when a file changes in the file system. This would probably be much more efficient than anything you can write in your CMS. It might be better, therefore, to have you CMS implement the structure directly in the file system and handle additional mapping with hard or soft links.
I just did this for MvcCms. I went with the idea of content categories/sub categories and content pages. When a content category / subcategory is created I go recursively through the parents and build the entire route and then store it in the category table. Then when the page is requested I can find the correct content page and find out when going through a nav structure if the current nav being built is the current or active route.
This approach requires some rules about what happens when a category is edited. The approach right now is that once the full path is set for a sub category it can't be change later with the normal tools.
The source is a mvccms.codeplex.com

Categories