Robot.txt blocking URLs with page parameter higher than 10 - robots.txt

I was checking already for similar questions, but I don't think this specific case has been asked and answered yet.
I'd like to block all URLs with the page Parameter higher than 10 (I probably choose a lower value than 10).
Disallow: /events/world-wide/all-event-types/all?page=11
Allow : /events/world-wide/all-event-types/all?page=3
I have alot of similar URLS where the other "parameters" can change with some lists which have up almost 150 pages.
Disallow: /events/germany/triathlon/all?page=13
Allow : /events/germany/triathlon/all?page=4
How can I accomplish this without listing all the URLs (which is basically impossible)
Please let me emphasize again here that the page parameter is the important thing here.
I can probably do something like this:
Disallow: *?page=
Allow: *?page=(1-10)
What's the proper approach here?

The robots.txt "regEx" syntax is fairly limited so unfortunately it can result in unnecessarily large robots.txt files. Although the other answers address the primary use case, you might want to also consider adding some variants to account for shuffling of additional parameters.
Disallow: *?page=
Disallow: *&page=
Allow: *?page=1$
Allow: *?page=2$
Allow: *?page=3$
...
Allow: *?page=1&
Allow: *?page=2&
Allow: *?page=3&
...
Allow: *&page=1&
Allow: *&page=2&
Allow: *&page=3&
....

You can use this way:
Allow: /*?page=1
Allow: /*?page=2
Allow: /*?page=3
Allow: /*?page=4
Allow: /*?page=5
Allow: /*?page=6
Allow: /*?page=7
Allow: /*?page=8
Allow: /*?page=9
Allow: /*?page=10
Disallow: /*?page=1*
Disallow: /*?page=2*
Disallow: /*?page=3*
Disallow: /*?page=4*
Disallow: /*?page=5*
Disallow: /*?page=6*
Disallow: /*?page=7*
Disallow: /*?page=8*
Disallow: /*?page=9*
So we allow pages from 1 to 10
And disallow pages higher, than 10.
You can read the google docs there

Thanks #Bazzilio for the nice try, but we programmers are lazy and try to avoid writing code as much as possible. The best I can come up with for now is the following (which works)
Disallow: *?page=
Allow: *?page=1$
Allow: *?page=2$
Allow: *?page=3$
Allow: *?page=4$
....
But isn't there a way to combine the Allow statements?

Related

TYPO3 10 StaticRoute returns 404 status code

I'm using a staticText route for my robots.txt in TYPO3 v10 (with default .htaccess file).
The Text is delivered as expected, but the StatusCode in the Header is 404. I have no Idea how I can fix that, since there's no option in the staticText route to set the statusCode.
This is my code for the route (like described in the docs: https://docs.typo3.org/m/typo3/reference-coreapi/10.4/en-us/ApiOverview/SiteHandling/StaticRoutes.html):
routes:
-
route: robots.txt
type: staticText
content: |
Sitemap: https://example.com/sitemap.xml
User-agent: *
Allow: /
Disallow: /forbidden/
I don't know if there's something wrong within the docs, but here's what I use:
routes:
-
route: robots.txt
type: staticText
content: "User-agent: *\r\nDisallow: /typo3/\r\n\r\nSitemap: https://example.com/sitemap.xml"
This is automatically generated when you use the Sites backend module.
Note: the contents of the robots.txt is a string with \r\n for line-breaks.
Take a closer look at the documentation.
In YAML files, the indention is key. You have to indent your multiline content (beginning with "Sitemap:...").
routes:
-
route: robots.txt
type: staticText
content: |
Sitemap: https://example.com/sitemap.xml
User-agent: *
Allow: /
Disallow: /forbidden/
https://www.w3schools.io/file/yaml-multiline-strings/

How to block fake Googlebots?

I guess a fake Googlebot visited my site. Here is the entry log:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.72
I think like that because it crawled some addresses that do not exist! actually, they had been created by me at all
the fake bot has some stracture, it adds a spicefic word to first of my URLs
for instance
this page is exist
https://stackoverflow.com/user
but the bot crawled :
https://stackoverflow.com/some-word-user
https://stackoverflow.com/some-word-jobs
and here my robots.txt.
User-agent: *
Disallow: /search?q=*
Disallow: *?replytocom
Disallow: /*add-to-cart=*
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: -----
First you should know, googlebot crawls not existing addresses too,
i.e. on trying to discover new content.
Second, i personally would better live with fake googlebots, as to
risk to exclude googlebot per its IP. Google adds new IPs to
googlebot. Again: don't risk it.
In my experience Googlebot searches always come from a Googlebot IP address as in crawl-xx-xxx-xxx-xxx.googlebot.com
So a possible method is to check that if the agent includes Googlebot/2.1 AND the remote includes googlebot.com then it is valid. If not then it's a fake.
Here is the code -
$agent = $_SERVER['HTTP_USER_AGENT'];
$remote = isset($_SERVER['REMOTE_HOST']) ? $_SERVER['REMOTE_HOST'] : gethostbyaddr($_SERVER['REMOTE_ADDR']);
$value = "googlebot";
$pos1 = strpos(strtolower($remote),$value);
$pos2 = strpos(strtolower($agent),$value);
if ($pos1===false && $pos2>0) {
require_once($_SERVER['DOCUMENT_ROOT'].'errorpage.php');
exit();
}

Robot.txt disallow page but not starting with

I want to disallow specific page
example.com/10
but not other pages starting with /10
example.com/101
example.com/102
example.com/103
how to do this?
You can use the Allow keyword to achieve it:
User-agent: *
Allow: /10*
Disallow: /10$
Results from http://tools.seobook.com/robots-txt/analyzer/:
Url: /10
Multiple robot rules found
Robots disallowed: All robots
Url: /101
Robots allowed: All robots
Url: /102
Robots allowed: All robots
Url: /103
Robots allowed: All robots
However, older robots may interpret it correctly. For example reading just the first line.

Robots.txt disallowing particular type of URL

I want to exclude this URL from bots:-
test.com/p/12345/qwerty
But allow this URL:-
test.com/p/12345
Will this line work?
User-Agent: *
Disallow: */p/*/*
Thanks
According to this tutorial there is no need for the * at the beginning. It should be:
User-Agent: *
Disallow: /p/*/*
Note that this will have the side effect of blocking bots on an address like test.com/p/abc/def
In order to do the exact functionality you are asking for and nothing more (i.e. no side effects), use this:
User-Agent: *
Disallow: /p/12345/qwerty
test.com/p/12345 will be allowed by default.

Disallow dynamic urls using robots.txt

I have URLs like example.com/post/alai-fm-sri-lanka-listen-online-1467/
I want to remove all URLs which have post word in them using robots.txt
So which is corrent format?
Disallow: /post-*
Disallow: /?page=post
Disallow: /*page=post
(Note that the file has to be called robots.txt; I corrected it in your question.)
You only included one example URL, where "post" is the first path segment. If all your URLs look like that, the following robots.txt should work:
User-agent: *
Disallow: /post/
It would block the following URLs:
http://example.com/post/
http://example.com/post/foobar
http://example.com/post/foo/bar
…
The following URLs would still be allowed:
http://example.com/post
http://example.com/foo/post/
http://example.com/foo/bar/post
http://example.com/foo?page=post
http://example.com/foo?post=1
…
Googlebot and Bingbot both handle limited wildcarding, so this will work:
Disallow: /*post
Of course, that will also disallow any url that contains the words "compost", "outpost", "poster", or anything that contains the substring "post".
You could try to make it a little better. For example:
Disallow: /*/post // any segment that starts with "post"
Disallow: /*?post= // the post query parameter
Disallow: /*=post // any value that starts with "post"
Understand, though, that not all bots support wildcards, and of those that do some are buggy. Bing and Google handle them correctly. There's no guarantee if other bots do.