Disallow rule in the robots file - robots.txt

Disallow: /*?
For a website which has this in the robots.txt file. I am presuming all will be blocked before the ?.
Is this true? All levels/folder before the /? ?

That rule would block every URL that contains a query string. So it would block http://www.example.com/foo.html?name=bar, but it would not block http://www.example.com/foo.html.

Related

Constructing proper regex redirect for multiple pages with variable directories

Something happened to our WordPress site and we ended up with thousands of 404s to basically the same page (probably a plugin conflict or deactivation).
The actual page is at example.com/c/circlek.htm.
The 404 errors follow this pattern:
example.com/movie-review/page/18/c/circlek.htm
example.com/movie-review/page/17/c/circlek.htm
example.com/movie-review/page/155/c/circlek.htm
example.com/movie-review/page/188/c/circlek.htm
example.com/movie-review/page/191/c/circlek.htm
and so on and on and on...
The only variation is the numerical folders that do not exist. I could create 2,000 redirects for each one but I'm sure there's a way to create a regex redirect to address all possible variations. I think I know that the expression (.*) should match the numerical folder, regardless of number of digits but I'm stuck otherwise.
So, anything like https://example.com/movie-review/page/18/c/circlek.htm or https://example.com/movie-review/page/191/c/circlek.htm, etc. should redirect to https://example.com/c/circlek.htm
This is what I'm coming up with and I know it must be incomplete:
RewriteRule ^movie-review/page/(.*)/ https://example.com/ [R=301,L]
Suggestions?

Can I block search engines from scanning files starting with a certain letter using robots.txt?

I know I can block search engines from accessing types of files using a wild card like this:
Disallow: /*.gif$
That disallows access to gifs, or more like files ending in .gif.
But is there a way to prevent search engines from accessing for example all files starting with "_"?
Would something like this work?
Disallow: /_*.*$
Or at least perhaps this (if I absolutely need to set an extension)?
Disallow: /_*.php$
As per the "official" docs
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines.

Robot.txt special chracters disallow

example link upload.php?id=46. I want to disallow all the link ie id=1,2,3
How can I do that using a special character
will this work for me?
disallow:/upload.php?id=*
Your example will work fine for major search engines, but the final * is unnecessary, and will cause the line to be ignored by older robots that don't support wildcards. The Disallow directive basically means "block anything that starts with the following". Putting a wildcard at the end is redundant, and has no effect on what will be matched. Wildcards are not part of the original robots.txt specification, so not all robots support them. All of the major search engines do, but many older robots do not.
The following does exactly the same thing as your example, but without wildcards:
User-agent: *
Disallow: /upload.php?id=
Why not just use a header in the upload.php file? I.e. put:
header("X-Robots-Tag: noindex, nofollow", true);
At the top of upload.php. If you're using Apache to serve your files, you can also set up rule based headers in your configuration file.

Blocking files in robots.txt with [possibly] more than one file extension

Is this correct syntax?
Disallow: /file_name.*
If not, is there are way to accomplish this without listing each file twice [multiple times]?
OK, according to http://tool.motoricerca.info/robots-checker.phtml
The "*" wildchar in file names is not supported by (all) the user-agents addressed by this block of code. You should use the wildchar "*" in a block of code exclusively addressed to spiders that support the wildchar (Eg. Googlebot).
So, I just use:
<meta name="robots" content="noindex,nofollow">
in each page that I wanted to block from search engines.

robots.txt file dissallow option

I want to prevent the robots from accessing URLs that end with /new. I am modifying my robots.txt file as follows:
Disallow: /*/new
Is this the correct pattern to use to disallow access to all urls terminating in /new?
Yes seems google supports this
Try https://www.google.com/webmasters "Site Configuration > Crawler access" to test your robots.txt file for given URLs
Here is the result of this tool
http://www.***.com/new/asdasd
Allowed by line 6: Allow: /
http://www.***.com/asdasd/new/
Blocked by line 8: Disallow: /*/new
Detected as a directory; specific files may have different restrictions
http://www.***.com/asdasd/new
Blocked by line 8: Disallow: /*/new
Disallow: /new$
will check for URLs terminating in /new