Robots.txt allow URL if it ends with something - robots.txt

I have user created URLs which all end with /post. I want to be able to for search engines to crawl those URLs. I have disallowed all other URLs. Here are some examples:
www.website.com/john-smith/my-blog/post
www.website.com/jim-thomas/my-skiing-blog/post
www.website.com/matt-jones/blog-about-gaming/post
How would I do this? Thanks.

The solution is actually very simple:
Allow: /post$

Related

Is our robots.txt file formatted correctly?

I'm trying to make sure our robots.txt file is correct and would greatly appreciate some info. We want all bots to be able to crawl and index the homepage and the 'sample triallines' but that's it. Here's the file:
User-agent: *
Allow: /$
Allow: /sample-triallines$
Disallow: /
Can anyone please let me know if this is correct?
Thanks in advance.
You can test your XML sitemap directly with a robots testing tool or within the webmaster tools of most major search engines (e.g. Google Search Console). Your current robots.txt file will work for most crawlers for the exact URLs you mentioned (e.g. https://www.example/ and https://www.example/sample-triallines).
However, just to note, if your URLs deviate from these exact URLs they will be blocked to crawlers (e.g. tracking parameters). For example, the below URLs will be blocked with the current robots.txt setup, which may or may not be acceptable for what you're working on.
https://www.example/index.html
https://www.example/?marketing=promo
https://www.example/sample-triallines/
https://www.example/sample-triallines?marketing=promo
If any of these above URLs need to be crawled you'll just need to add additional directives into the robots.txt file as needed and test them within the robots testing tools. Additional information on robots directives can be found here.
Hope this helps

Stop web.archive.org to save the site pages

I tried accessing facebook.com webpages from previous time.
And the site showed me an error that it can not save pages because of the site robots.txt/
Can anyone tell which statements in the robots.txt are making the site inaccessible to web.archive.org
I guess it is because of the #permission statement as mentioned here (http://facebook.com/robots.txt)
Is there any other way i can do this for my site as well.
I also dont want woorank.com or builtwith.com to analyze my site.
Note : search engine bots should face no problems while crawling my site and indexing it if i add some statements to robots.txt in order to achieve results which are mentioned above.
The Internet Archive (archive.org) crawler uses the User-Agent value ia_archiver (see their documentation).
So if you want to target this bot in your robots.txt, use
User-agent: ia_archiver
And this is exactly what Facebook does in its robots.txt:
User-agent: ia_archiver
Allow: /about/privacy
Allow: /full_data_use_policy
Allow: /legal/terms
Allow: /policy.php
Disallow: /
If you would like to submit a request for archives of your site or
account to be excluded from web.archive.org, send us a request to
info#archive.org and indicate:
https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/

Disallow dynamic pages in robots.txt

How would I disallow all dynamic pages within my robots.txt?
E.g.
page.php?hello=there
page.php?hello=everyone
page.php?thank=you
I would like page.php AND all possible dynamic versions to be disallowed.
At the moment I have
User-Agent: *
Disallow: /page.php
But this still allows e.g. page.php?hello=there
Thanks
What you've already got should block all access to /page.php for all search engines which respect robots.txt (no matter whether there are any query string parameters provided)
Don't forget robots.txt is only for robots :-) If you're trying to block users from accessing the page you'll need to use .htaccess or similar

robots.txt deny access to specific URL parameters

I have been trying to get an answer on this question on various Google forums but no-one answers so I'll try here at SO.
I had an old site that used different URL parameters like
domain.com/index.php?showimage=166
domain.com/index.php?x=googlemap&showimage=139
How can I block access to these pages for these parameters? Of course without my domain.com/index.php page being blocked?
Can this be done in robots.txt
EDIT I found a post here: Ignore urls in robot.txt with specific parameters?
Allow: *
Disallow: /index.php?showImage=*
Disallow: /index.php?x=*

How to disallow bots from a single page or file

How to disallow bots from a single page and allow allow all other content to be crawled.
Its so important not to get wrong so I am asking here, cant find a definitive answer elsewhere.
Is this correct?
User-Agent:*
Disallow: /dir/mypage.html
Allow: /
The Disallow line is all that's needed. It will block access to anything that starts with "/dir/mypage.html".
The Allow line is superfluous. The default for robots.txt is Allow: /. In general, Allow is not required. It's there so that you can override access to something that would be disallowed. For example, say you want to disallow access to the "/images" directory, except for images in the "public" subdirectory. You would write:
Allow: /images/public
Disallow: /images
Note that order is important here. Crawlers are supposed to use a "first match" algorithm. If you wrote the 'Disallow` first, then a crawler would assume that access to "/images/public" was blocked.