Robots.txt for application - robots.txt

Is it possible for an application within a website to have its own robots.txt file?
For example, I have a site running under http://www.example.com and this has its robots.txt file.
We then have a seperate site running as an application under this domain: http://www.example.com/website-app
Is it possible to keep the robots.txt file seperate for the application or do I need to put all the stuff for the application into the main root robots.txt?

The robots.txt file needs to reside in /robots.txt, there is no way to tell the crawler that it can be found anywhere else (like for favicons for example). So if you can you should add this to your root robots.txt (or put your application on a subdomain instead where it can have its own file).
If you want to control specific pages individually you can use <meta>-tags instead, as described at robotstxt.org. Since this needs to be put on every page it will have the crawler visit (but not index) at least one page, but it won't follow to other pages (unless you tell it to). For a small application in a subdirectory this might be an ok solution.

Related

IIS8 use web.config to rewrite url

I have a website that lives within a folder one level off the root of the website. This was done because it used to host multiple web applications, but the other application has been retired and now the domain is used for just the site. We want to move it out of the folder and into the root of the domain
Current: website.com/main/page.php
Want: website.com/page.php
The issue is all the links that are out there that have the old location. I would like to have a .config file that lives in the old directory and have it re-direct to the link by just removing "main" from the URL. What is the best way to go about doing this?
One way of doing this is by using HTTP redirect
This method is explained in this video: https://www.youtube.com/watch?v=wC3kJnhlofw

DNN - Redirecting specific file types

I've taken on the webmaster role for a website that uses DNN version 07.02.02. Most of the links to my pdf files are broken. They pdfs were in a folder called "/pdfs" now they're in a new folder "/docs/pdfs "
A few quick things:
I only have ftp access to the web site files. No access to web.config so rewrite rules are out.
I don't want to copy the old files back to "/pdfs" because it would mean managing two different pdf copies (there are over 500 pdfs).
Using file directories with a .pdf extension then add an index.asp file with a redirect i.e. "/pdfs/file_1001.pdf/index.asp" led to an error page because there's an override which doesn't allow site directory pages exposed.
Using a DNN module where I'd have to enter 500 files to redirect seems redundant when I only want to move a directory.
Any solutions to try?
In DNN if you have HOST level access you can modify Config files through the Host/Configuration manager page.
There you could modify the web.config file.
You might also look at the siteurls.config file (also accessible there) in which you could define some URL rules, might be as easy as
<RewriterRule>
<LookFor>/pdf/(.*)</LookFor>
<SendTo>/docs/pdf/$1</SendTo>
</RewriterRule>
The above rule is completely untested, not positive if it will do what you need or not.
I did a little more testing, and it looks like this won't work out of the box as a default setting that tells it to NOT rewrite PDF files, but I can't find the source code for that currently.

how can I override robots in a sub folder

I have a sub-domain for testing purposes. I have set robots.txt to disallow this folder.
Some of the results are still showing for some reason. I thought it may be because I hadn't set up the robots.txt originally and Google hadn't removed some of them yet.
Now I'm worried that the robots.txt files within the individual joomla sites in this folder are causing Google to keep indexing them. Ideally I would like to stop that from happening because I don't want to have to remember to turn robots.txt back to follow when they go live (just in case).
Is there a way to override these explicitly with a robots.txt in a folder above this folder?
As far as a crawler is concerned, robots.txt exists only in the site's root directory. There is no concept of a hierarchy of robots.txt files.
So if you have http://example.com and http://foo.example.com, then you would need two different robots.txt files: one for example.com and one for foo.example.com. When Googlebot reads the robots.txt file for foo.example.com, it does not take into account the robots.txt for example.com.
When Google bot is crawling example.com, it will not under any circumstances interpret the robots.txt file for foo.example.com. And when it's crawling foo.example.com, it will not interpret the robots.txt for example.com.
Does that answer your question?
More info
When Googlebot crawls foo.com, it will read foo.com/robots.txt and use the rules in that file. It will not read and follow the rules in foo.com/portfolio/robots.txt or foo.com/portfolio/mydummysite.com/robots.txt. See the first two sentences of my original answer.
I don't fully understand what you're trying to prevent, probably because I don't fully understand your site hierarchy. But you can't change a crawler's behavior on mydummysite.com by changing the robots.txt file at foo.com/robots.txt or foo.com/portfolio/robots.txt.

Multiple 301 redirects in one line

This is probably easy for people who deal with these regularly, but I'm not sure what kind of code I will need to use to achieve what I want to. I know how to redirect individual URLs to other URLs, but when it comes to redirecting multiple at once I can't do it.
Basically I set up my site structure kinda bad when I built my website. I have a bunch of URLs named:
crafting-alchemist-level-1-10.php
all in the root directory, where alchemist-level-1-10 is the page name and crafting is the site section. I have about 50 of these URLs and I would like to put them all in a /crafting directory with the crafting- cut off the file names.
I could do this individually but there must be a way to do all with a single line. Is there?
These URL redirects need to be compatible with any parameters after the .php too.
Use mod_rewrite in your .htaccess
RewriteEngine On
RewriteRule ^(.*)/(.*)/(.*)$ $1-$2-$3.php
For more information (you will need to customize it a bit):
http://httpd.apache.org/docs/current/rewrite/intro.html#regex
EDIT
This will rewrite one/two/three to one-two-three.php.

How to add RESTful type routes in Jekyll

The root of the site http://example.com correctly identifies index.html and renders it. In a similar manner, I want, http://example.com/foo to fetch foo.html present in the root of the directory. The site that uses this functionality is www.zachholman.com. I've seen his code in Github. But still I'm not able to find how it is done. Please help.
This feature is actually available in Jekyll. Just add the following line to your _config.yml:
permalink: pretty
This will enable links to posts and pages without .html extension, e.g.
/about/ instead of /about.html
/YYYY/MM/DD/my-first-post/ instead of YYYY-MM-DD-my-first-post.html
However, you lose the ability to customize permalinks... and the trailing slash is pretty ugly.
Edit: The trailing slash seems to be there by design
It's actually the server that needs adjusting, not jekyll. Be default, jekyll is going to produces files with .html extensions. There may be a way around that, but it's unlikely that you really want to do go that route. Instead, you need to let your web server know that you want those files served when a URL is called with the file's basename (and no extension).
If your site is served via an Apache web server you can enable the "MultiViews" option. In most cases, you can do that be creating an .htaccess file at your site root with the following line:
Options +MultiViews
With this option enabled, when Apache receives a request for:
http://example.com/foo
It will serve the file:
/foo.html
Note that the Apache server must be setup to allow the option to be set in the htaccess file. If not, you would need to do it in the Apache config file itself. If your site is hosted on another web server, you'll need to look for an equivalent setting.