robots.txt deny access to specific URL parameters - robots.txt

I have been trying to get an answer on this question on various Google forums but no-one answers so I'll try here at SO.
I had an old site that used different URL parameters like
domain.com/index.php?showimage=166
domain.com/index.php?x=googlemap&showimage=139
How can I block access to these pages for these parameters? Of course without my domain.com/index.php page being blocked?
Can this be done in robots.txt
EDIT I found a post here: Ignore urls in robot.txt with specific parameters?

Allow: *
Disallow: /index.php?showImage=*
Disallow: /index.php?x=*

Related

Is our robots.txt file formatted correctly?

I'm trying to make sure our robots.txt file is correct and would greatly appreciate some info. We want all bots to be able to crawl and index the homepage and the 'sample triallines' but that's it. Here's the file:
User-agent: *
Allow: /$
Allow: /sample-triallines$
Disallow: /
Can anyone please let me know if this is correct?
Thanks in advance.
You can test your XML sitemap directly with a robots testing tool or within the webmaster tools of most major search engines (e.g. Google Search Console). Your current robots.txt file will work for most crawlers for the exact URLs you mentioned (e.g. https://www.example/ and https://www.example/sample-triallines).
However, just to note, if your URLs deviate from these exact URLs they will be blocked to crawlers (e.g. tracking parameters). For example, the below URLs will be blocked with the current robots.txt setup, which may or may not be acceptable for what you're working on.
https://www.example/index.html
https://www.example/?marketing=promo
https://www.example/sample-triallines/
https://www.example/sample-triallines?marketing=promo
If any of these above URLs need to be crawled you'll just need to add additional directives into the robots.txt file as needed and test them within the robots testing tools. Additional information on robots directives can be found here.
Hope this helps

Stop web.archive.org to save the site pages

I tried accessing facebook.com webpages from previous time.
And the site showed me an error that it can not save pages because of the site robots.txt/
Can anyone tell which statements in the robots.txt are making the site inaccessible to web.archive.org
I guess it is because of the #permission statement as mentioned here (http://facebook.com/robots.txt)
Is there any other way i can do this for my site as well.
I also dont want woorank.com or builtwith.com to analyze my site.
Note : search engine bots should face no problems while crawling my site and indexing it if i add some statements to robots.txt in order to achieve results which are mentioned above.
The Internet Archive (archive.org) crawler uses the User-Agent value ia_archiver (see their documentation).
So if you want to target this bot in your robots.txt, use
User-agent: ia_archiver
And this is exactly what Facebook does in its robots.txt:
User-agent: ia_archiver
Allow: /about/privacy
Allow: /full_data_use_policy
Allow: /legal/terms
Allow: /policy.php
Disallow: /
If you would like to submit a request for archives of your site or
account to be excluded from web.archive.org, send us a request to
info#archive.org and indicate:
https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/

Proper wildcard Disallow for robots.txt

I am trying to disallow a specific page and its parameters along with a parameter on the entire site. Below I have the exact examples.
We now have a page that will redirect and track exteral urls. Any external URL we want to track will be linked like /redirect?u=http://example.com We do not want to add rel="nofollow" to every link.
Last but not least (our biggest seo and index issue) is every single page has an auto generate URL to disable or enable mobile. So it can be on any page like /?mobileVersion=off (or on) or /accounts?login_to=%2Fdashboard&mobileVersion=off
Basically the easy way to disallow the two parameters would be to disallow mobileVersion and u from any page. (u is the parameter needed to redirect the URL and is only valid on /redirect)
My current robots.txt config:
User-Agent: *
Disallow: /redirect
Disallow: / *?*mobileVersion=off
If you want to see our full robots.txt files its located at http://spicethymeinc.com/robots.txt.
you could change
Disallow: / *?*mobileVersion=off
to
Disallow: /*mobileVersion=off
but it looks like it should work.
I'm going off the wildcard section and examples on this page:
http://tools.seobook.com/robots-txt/
edit: I have tested with the googlebot and googlebot mobile. The are blocked by both your current robots.txt and my suggested change. Google webmaster tools has a handy robots checker you can use to test.

Why robots.txt doesn't work for when I do redirection from http to https

Today I experience the problem with search in the google.
When I type "trakopolis" in the google in shows me my page (so it is indexed by google robots), but the description of the page is not available. It is very important to have a description on my website.
the website is:
https://trakopolis.com
the robots txt file is, so I allow everything:
User-agent: *
Allow: /
https://www.google.com.ua/?gws_rd=cr#gs_rn=23&gs_ri=psy-ab&tok=O7cIXclKCSxtMd3uDVRVhg&cp=2&gs_id=h&xhr=t&q=trakopolis&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq=tr&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.50165853,d.bGE&fp=d3f611552977418f&biw=1680&bih=949
but as you see the description is not available. I confused :( Sorry if the questio is stupid.
As I see from the google webmaster tools. Google use this robots.txt file, so maybe the issue with redirection from http to https? The website doesn't allow http and we use https. And on main page I use redirection to Login.aspx page in case if user didn't authenticate.
Google shows a description when searching for "trakopolis":
It seems that your robots.txt disallowed crawling of your site some time ago, as some other search engines still display that they are not allowed to show your description, e.g. DuckDuckGo.
Note that your robots.txt uses Allow, which is not part of the original robots.txt specification (but many parsers understand it anyway). It’s equivalent to:
User-agent: *
Disallow:
(But because parsers have to ignore unknown fields, you should have no problem using Allow. An empty or no existent robots.txt always allows crawling of everything.)

Disallow dynamic pages in robots.txt

How would I disallow all dynamic pages within my robots.txt?
E.g.
page.php?hello=there
page.php?hello=everyone
page.php?thank=you
I would like page.php AND all possible dynamic versions to be disallowed.
At the moment I have
User-Agent: *
Disallow: /page.php
But this still allows e.g. page.php?hello=there
Thanks
What you've already got should block all access to /page.php for all search engines which respect robots.txt (no matter whether there are any query string parameters provided)
Don't forget robots.txt is only for robots :-) If you're trying to block users from accessing the page you'll need to use .htaccess or similar