how can I override robots in a sub folder - robots.txt

I have a sub-domain for testing purposes. I have set robots.txt to disallow this folder.
Some of the results are still showing for some reason. I thought it may be because I hadn't set up the robots.txt originally and Google hadn't removed some of them yet.
Now I'm worried that the robots.txt files within the individual joomla sites in this folder are causing Google to keep indexing them. Ideally I would like to stop that from happening because I don't want to have to remember to turn robots.txt back to follow when they go live (just in case).
Is there a way to override these explicitly with a robots.txt in a folder above this folder?

As far as a crawler is concerned, robots.txt exists only in the site's root directory. There is no concept of a hierarchy of robots.txt files.
So if you have http://example.com and http://foo.example.com, then you would need two different robots.txt files: one for example.com and one for foo.example.com. When Googlebot reads the robots.txt file for foo.example.com, it does not take into account the robots.txt for example.com.
When Google bot is crawling example.com, it will not under any circumstances interpret the robots.txt file for foo.example.com. And when it's crawling foo.example.com, it will not interpret the robots.txt for example.com.
Does that answer your question?
More info
When Googlebot crawls foo.com, it will read foo.com/robots.txt and use the rules in that file. It will not read and follow the rules in foo.com/portfolio/robots.txt or foo.com/portfolio/mydummysite.com/robots.txt. See the first two sentences of my original answer.
I don't fully understand what you're trying to prevent, probably because I don't fully understand your site hierarchy. But you can't change a crawler's behavior on mydummysite.com by changing the robots.txt file at foo.com/robots.txt or foo.com/portfolio/robots.txt.

Related

Noindex in a robots.txt

I've always stopped google from indexing my website using a robots.txt file. Recently i've read an article from a google employee where he stated you should do this using meta tags. Does this mean Robots.txt won't work? Since i'm working with a CMS my options are very limited and its a lot easier just using a robots.txt file. My question is whats the worst that could happen if i proceed using a robots.txt file instead of meta tags.
Here's the difference in simple terms:
A robots.txt file controls crawling. It instructs robots (a.k.a. spiders) that are looking for pages to crawl to “keep out” of certain places. You place this file in your website’s root directory.
A noindex tag controls indexing. It tells spiders that the page should not be indexed. You place this tag in the code of the relevant web page.
Use the robots.txt file when you want control at the directory level or across your site. However, keep in mind that robots are not required to follow these directives. Most will, such as Googlebot, but it is safer to keep any highly sensitive information out of publicly-accessible areas of the site.
As with robots.txt files, noindex tags will exclude a page from search results. The page will still be crawled, but it won’t be indexed. Use these tags when you want control at the individual page level.
An aside on the difference between crawling and indexing: Crawling (via spiders) is how a search engine’s spider tracks your website; the results of the crawling go into the search engine’s index. Storing this information in an index speeds up the return of relevant search results—instead of scanning every page related to a search, the index (a smaller database) is searched to optimize speed.
If there was no index, the search engine would look at every single bit of data or info in existence related to the search term, and we’d all have time to make and eat a couple of sandwiches while waiting for search results to display. The index uses spiders to keep its database up to date.
Here is an example of the tag:
<meta name="robots" content="noindex,follow"/>
Now that you read and understand the above information, I think you are able to answer your question on your own ;)
Indeed, there was the opportunity of GoogleBot that allowed to use:
Noindex
Nofollow
Crawl-delay
But seen on the GoogleBlog-News they will no longer support those (0,001% used) commands anymore from September 2019 on. So you should only use meta tags anymore for these on your page to be safe for the future.

Robots.txt for application

Is it possible for an application within a website to have its own robots.txt file?
For example, I have a site running under http://www.example.com and this has its robots.txt file.
We then have a seperate site running as an application under this domain: http://www.example.com/website-app
Is it possible to keep the robots.txt file seperate for the application or do I need to put all the stuff for the application into the main root robots.txt?
The robots.txt file needs to reside in /robots.txt, there is no way to tell the crawler that it can be found anywhere else (like for favicons for example). So if you can you should add this to your root robots.txt (or put your application on a subdomain instead where it can have its own file).
If you want to control specific pages individually you can use <meta>-tags instead, as described at robotstxt.org. Since this needs to be put on every page it will have the crawler visit (but not index) at least one page, but it won't follow to other pages (unless you tell it to). For a small application in a subdirectory this might be an ok solution.

Where to place the robots.txt file with G-WAN?

I want to disallow robots from crawling the csp folder and plan to use the following robots.txt file:
User-agent: *
Disallow: /csp
So, my question is double:
Is the syntax correct for G-WAN?
With G-WAN, where should I place this file?
The well-documented robots.txt file should be placed in the /www G-WAN fodler - if you want to use this feature. robots.txt is a hint for robots, many of them do not respect your will (so it's much safer to define file-system permissions or use an index.html file in the folders that you don't want to be browsed).
The /csp directory cannot be crawled by any HTTP client (including robots). Only the /www directory can.
This separation has worked pretty well in terms of simplicity, design and security so far, avoiding the pitfall of deciding what is executable and what is the presentation layer.

Multiple 301 redirects in one line

This is probably easy for people who deal with these regularly, but I'm not sure what kind of code I will need to use to achieve what I want to. I know how to redirect individual URLs to other URLs, but when it comes to redirecting multiple at once I can't do it.
Basically I set up my site structure kinda bad when I built my website. I have a bunch of URLs named:
crafting-alchemist-level-1-10.php
all in the root directory, where alchemist-level-1-10 is the page name and crafting is the site section. I have about 50 of these URLs and I would like to put them all in a /crafting directory with the crafting- cut off the file names.
I could do this individually but there must be a way to do all with a single line. Is there?
These URL redirects need to be compatible with any parameters after the .php too.
Use mod_rewrite in your .htaccess
RewriteEngine On
RewriteRule ^(.*)/(.*)/(.*)$ $1-$2-$3.php
For more information (you will need to customize it a bit):
http://httpd.apache.org/docs/current/rewrite/intro.html#regex
EDIT
This will rewrite one/two/three to one-two-three.php.

How to add RESTful type routes in Jekyll

The root of the site http://example.com correctly identifies index.html and renders it. In a similar manner, I want, http://example.com/foo to fetch foo.html present in the root of the directory. The site that uses this functionality is www.zachholman.com. I've seen his code in Github. But still I'm not able to find how it is done. Please help.
This feature is actually available in Jekyll. Just add the following line to your _config.yml:
permalink: pretty
This will enable links to posts and pages without .html extension, e.g.
/about/ instead of /about.html
/YYYY/MM/DD/my-first-post/ instead of YYYY-MM-DD-my-first-post.html
However, you lose the ability to customize permalinks... and the trailing slash is pretty ugly.
Edit: The trailing slash seems to be there by design
It's actually the server that needs adjusting, not jekyll. Be default, jekyll is going to produces files with .html extensions. There may be a way around that, but it's unlikely that you really want to do go that route. Instead, you need to let your web server know that you want those files served when a URL is called with the file's basename (and no extension).
If your site is served via an Apache web server you can enable the "MultiViews" option. In most cases, you can do that be creating an .htaccess file at your site root with the following line:
Options +MultiViews
With this option enabled, when Apache receives a request for:
http://example.com/foo
It will serve the file:
/foo.html
Note that the Apache server must be setup to allow the option to be set in the htaccess file. If not, you would need to do it in the Apache config file itself. If your site is hosted on another web server, you'll need to look for an equivalent setting.