robots.txt file - robots.txt

I have a url that I like to restrict the search engines from going to:
Is the following acceptable:
User-agent: *
Disallow: https://mysite.com/
or do I need to put something more like:
User-agent: *
Disallow: https://mysite.com/index.aspx
or would I just put:
User-agent: *
Disallow: /index.aspx

The last one is enough. You don't need the whole URL, if you transfer you robots.txt to another website you have to change this file aswell, and that's not what you want. But you do need to start with a / which means it's the root of your domain.
Or you can do this:
Disallow ALL
User-agent: *
Disallow: /
Disallow 1 page
User-agent: *
Disallow: /index.php
Disallow 1 directory
User-agent: *
Disallow: /dirname/
Disallow 2 pages and 2 directories
User-agent: *
Disallow: /index.php
Disallow: /subpage.php
Disallow: /dirname/
Disallow: /otherdirname/
Allow 1 page (only index.php)
User-agent: *
Disallow: /~index.php

Related

Write the correct robots txt file

I have a webshop on my test domain, no one knows it. I always store the search-es on the site into a search sql table, and there are always search-es with same words. Maybe a robot?
How can I fix this? What should I write into the robots.txt? What folders or links should I disable in the file?
My robots txt looks like:
User-agent: *
Disallow: /cms
Sitemap: http://www.my-domain.hu/sitemap.xml
Host: www.my-domain.hu
Sorry for bad english, I hope you understand what I write. :)
Update:
And what about this robotx file? Is it correct? What is MJ12bot?
User-agent: *
Disallow: /admin/
Disallow: /index.php?route=checkout*
Disallow: /cache/*/block/
Disallow: /custom/*/cache/block/
Disallow: /cib.php
Disallow: /cib_facebook.php
Disallow: /index.php?route=product/relatedproducts/
Disallow: /index.php?route=product/similar_products/
Disallow: /index.php?route=module/upsale/
Disallow: /.well-known/
Allow: /
Sitemap: http://mydomain.hu/sitemap.xml
User-Agent: MJ12bot
Disallow: /
To exclude all robots from accessing anything under the root
User-agent: *
Disallow: /
To allow all crawlers complete access
User-agent: *
Disallow:
Alternatively, you can skip creating a robots.txt file, or create one with empty content.
To exclude a single robot
User-agent: Googlebot
Disallow: /
This will disallow Google’s crawler from the entire website.
To allow just Google crawler
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Is the basic idea. Does that work for you?
Source: https://www.wst.space/allow-disallow-robots-txt/

Website Description not showing in Google search engine

My website description not showing in Google search engine.
I wrote description meta tag inside the header tag.
and my robots.txt file as below
User-agent: *
Disallow: /
When i am searching in google i am getting below message
A description for this result is not available because of this site's robots.txt – learn more.
User-agent: *
Disallow: /
You're blocking all robots to crawl your website.
Simply remove that line to enable robots to crawl your site, or use:
User-agent: *
Disallow:
robots.txt examples:
To exclude all robots from the entire server (what you have!)
User-agent: *
Disallow: /
To allow all robots complete access (or just create an empty "/robots.txt" file, or don't use one at all)
User-agent: *
Disallow:
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
Learn more about robots.txt at http://www.robotstxt.org

Will a robots.txt file with "disallow / " stop all crawling of my website?

I know the following will stop all bots from crawling my site
User-agent: *
Disallow: /
But what about something like this:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/
Disallow: /
I didn't want to comment out the entire file and logic told me that having the final Disallow: / line should override all the previous rules, but we got a report from the client that a form was submitted from the site this robots.txt file belongs to, leading us to believe it was indexed. Is there something I'm missing here?
Thanks ya'll!
As mentioned in the comments, the robots.txt file is no more than a request.
Polite web-crawlers will honor it, and potentially evil ones could ignore it or use it as a treasure map.
What you propose will work (to the extent that robots.txt work).
Here are the "rules":
It needs to be readable by your webserver (duh, huh?)
It needs to be at the root level of your webserver (e.g.
(http://www.example.com/robots.txt).
If you have multiple websites, each one needs a /robots.txt url (they
can share the actual file, if appropriate). Note that
http://www.example.com and https://www.example.com are two
different websites for these purposes as are http://www.example.com
and http://example.com, even if they deliver the same content.
The first match found applies (this is mostly important if you are
using the non-standard, (but widely implemented) Allow extension).
You can find some additional information here: https://en.wikipedia.org/wiki/Robots_exclusion_standard

Robots.txt: Disallow subdirectory but allow directory

I want to allow crawling of files in:
/directory/
but not crawling of files in:
/directory/subdirectory/
Is the correct robots.txt instruction:
User-agent: *
Disallow: /subdirectory/
I'm afraid that if I disallowed /directory/subdirectory/
that I would be disallowing crawling of all files in /directory/ which I do not want to do, so am I correct in using:
User-agent: *
Disallow: /subdirectory/
You've overthinking it:
User-agent: *
Disallow: /directory/subdirectory/
is correct.
User-agent: *
Disallow: /directory/subdirectory/
Spiders aren't stupid, they can parse a path :)

My robots.txt is User-agent: * Disallow: /feeds/ Disallow: /*/_/ what its means?

I have a Google site, and today I found google generated a new robots.txt file:
User-agent: *
Disallow: /feeds/
Disallow: /*/_/
What does it mean? Your kind reply is appreciated.
Here is the breakdown:
User-agent: * -- Apply to all robots
Disallow: /feeds/ -- Do not crawl the /feeds/ directory
Disallow: /*/_/ -- Do not crawl any subdirectory that is named _
For more information, see www.robotstxt.org.
The User-Agent header is how browsers and robots identify themselves.
The Disallow lines define the rules the robots are supposed to follow - in this case what they shouldn't crawl.