Can I use a `robots.txt` file for a subdirectory on my school's domain? - robots.txt

I own some webspace which is registered with a University. Google has unfortunately found my CV (resume) on the site, but has mis-indexed it as a scholarly publication, which is screwing up things like citation counts on Google Scholar. I tried to upload a robots.txt into my local subdirectory. The problem is that google ignores this file, and instead uses the rules listed for the school domain.
That is, the url looks like
www.someschool.edu/~myusername/mycv.pdf
I have uploaded a robots.txt, which can be found here
www.someschool.edu/~myusername/robots.txt
And Google is ignoring it and instead using the robots.txt for the school's domain
www.someschool.edu/robots.txt
How can I make Googlebot ignore my CV?

Sadly, robots.txt is defined to be whatever you get when you GET /robots.txt, so you can't use it for your subdirectory.
What you can do is use the X-Robots-Tag HTTP header, if you can use custom .htaccess files. Here's Google's documentation on X-Robots-Tag.

Related

How to embed a custom url path to a file in Google sites

I have a Google Sites website with a custom domain, let's say www.mysite.net for example. I want to put a dataset file (let's say file.csv) into the site, that can be downloaded with a link as in www.mysite.net/datasets/file.csv. I do not know how to do that. I can insert a file from Drive into the site, but I want it to have a custom url like I mentioned to download it. How can this be done? Can it be done from within the Google sites domain, or do I have to do something special?
Thanks!

Apache2: Redirect from a custom link to a html file

That's quite a simple question, but I would like to create a redirection for my website using Apache2 and the virtual hosts.
I want it to happen when a user gets on this link : https://somewebsite.com/test
And this link should redirect to an html file in my website directory (for example test.html) with the link still on https://somewebsite.com/test
I think I should use mod_rewrite or the aliases but I don't know how to use them.
Thanks for your time !

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

How come when I block a directory in robots.txt, its contents are still coming up?

This is what I've got in my robots.txt, placed in the base directory, of course:
User-Agent: *
Disallow: /foo/
But then, in Google, I have no index of /foo/, but for some reason, I still have /foo/foo.php showing up as a link in Google.
How come? Did I write something incorrectly? Do I need to write something else?
When you put robots.txt after your site went live, Google could already index files under /foo/.
You can remove already indexed files via Google Webmaster Tools - removal request.
robots.txt does not prevent Google to link to your blocked pages. Google won't index your blocked pages (so it won't show the page title/description/snippet), but if it finds a link to any blocked page, it might still link it from their search results.
If you want to also forbid this linking, you could use the meta element with robots and noindex.

How to redirect specifically for one directory?

I'm trying to redirect the forum link on my website
example.com/forum to example.net/forum
Most code is either for a page or an entire site. I'm trying to figure out how to do it specifically for ONE directory
You want to instruct the browsers that your forum has moved. This is a "301 redirect".
Supposing your server is Apache, the best is to use an .htaccess file :
http://kb.mediatemple.net/questions/242/How+do+I+redirect+my+site+using+a+.htaccess+file%3F
Here is a list of other methods to do the same :
http://www.webconfs.com/how-to-redirect-a-webpage.php