Why robots.txt still not blocking urls I have mentained - robots.txt

Its been 48 hours since I updated my robots.txt file you can have a look at my robots.txt file here:-
https://www.qapaper.com/robots.txt
I disallowed url's that are like
/SRM-abc+nksf
/SRM-anything+here
That's why I have mentained
Disallow:/SRM-
But I like to have url's like
/SRM-inc/joid-lkn
That's why I have mentained
/SRM-*/
I have checked with robots.txt tester it is giving me desired test but isn't removing my url's from google search results.
Have a look:
site:qapaper.com

Related

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.
Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.
This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

Why robots.txt doesn't work for when I do redirection from http to https

Today I experience the problem with search in the google.
When I type "trakopolis" in the google in shows me my page (so it is indexed by google robots), but the description of the page is not available. It is very important to have a description on my website.
the website is:
https://trakopolis.com
the robots txt file is, so I allow everything:
User-agent: *
Allow: /
https://www.google.com.ua/?gws_rd=cr#gs_rn=23&gs_ri=psy-ab&tok=O7cIXclKCSxtMd3uDVRVhg&cp=2&gs_id=h&xhr=t&q=trakopolis&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq=tr&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.50165853,d.bGE&fp=d3f611552977418f&biw=1680&bih=949
but as you see the description is not available. I confused :( Sorry if the questio is stupid.
As I see from the google webmaster tools. Google use this robots.txt file, so maybe the issue with redirection from http to https? The website doesn't allow http and we use https. And on main page I use redirection to Login.aspx page in case if user didn't authenticate.
Google shows a description when searching for "trakopolis":
It seems that your robots.txt disallowed crawling of your site some time ago, as some other search engines still display that they are not allowed to show your description, e.g. DuckDuckGo.
Note that your robots.txt uses Allow, which is not part of the original robots.txt specification (but many parsers understand it anyway). It’s equivalent to:
User-agent: *
Disallow:
(But because parsers have to ignore unknown fields, you should have no problem using Allow. An empty or no existent robots.txt always allows crawling of everything.)

How come when I block a directory in robots.txt, its contents are still coming up?

This is what I've got in my robots.txt, placed in the base directory, of course:
User-Agent: *
Disallow: /foo/
But then, in Google, I have no index of /foo/, but for some reason, I still have /foo/foo.php showing up as a link in Google.
How come? Did I write something incorrectly? Do I need to write something else?
When you put robots.txt after your site went live, Google could already index files under /foo/.
You can remove already indexed files via Google Webmaster Tools - removal request.
robots.txt does not prevent Google to link to your blocked pages. Google won't index your blocked pages (so it won't show the page title/description/snippet), but if it finds a link to any blocked page, it might still link it from their search results.
If you want to also forbid this linking, you could use the meta element with robots and noindex.

Can I use a `robots.txt` file for a subdirectory on my school's domain?

I own some webspace which is registered with a University. Google has unfortunately found my CV (resume) on the site, but has mis-indexed it as a scholarly publication, which is screwing up things like citation counts on Google Scholar. I tried to upload a robots.txt into my local subdirectory. The problem is that google ignores this file, and instead uses the rules listed for the school domain.
That is, the url looks like
www.someschool.edu/~myusername/mycv.pdf
I have uploaded a robots.txt, which can be found here
www.someschool.edu/~myusername/robots.txt
And Google is ignoring it and instead using the robots.txt for the school's domain
www.someschool.edu/robots.txt
How can I make Googlebot ignore my CV?
Sadly, robots.txt is defined to be whatever you get when you GET /robots.txt, so you can't use it for your subdirectory.
What you can do is use the X-Robots-Tag HTTP header, if you can use custom .htaccess files. Here's Google's documentation on X-Robots-Tag.

will googlebot index my site?

in my robots.txt file, I have the following line
User-agent: Googlebot-Mobile
Disallow: /
User-agent:GoogleBot
Disallow: /
Sitemap: http://mydomain.com/sitemapindex.xml
I know that if I put the first 4 lines , googlebot won't index the sites, but what if I put the last line Sitemap: http://mydomain.com/sitemapindex.xml, will googlebot be able to index the site?
Thanks,
I tested your robots.txt against my own domain (which has a sitemap entry for every page) and Googlebot and Googlebot-Mobile returned that they were Disallowed access.
Based on this - I would say the robots.txt file takes precedence over any sitemaps.
Plus, logically speaking - if you block the entire domain, the bot is disallowed access to the sitemap. The sitemap entry just tells crawlers where to find your sitemap - not their authorization to access it.
Even if you allowed the sitemap, I don't think bots would crawl your site - sitemaps are designed more for telling the bot how often to crawl your site, not what they are allowed to crawl.
No I dont think Google will do that. Its actually a question of Good bot and Bad bot. Even if you add a robots.txt file to restrict some area Bots can still crawl. Its actually a question of Yes or No. robots.txt is just like a warning board and not a security wall.
googlebot will not even be able to touch the sitemapindex.xml
the robots.txt is a crawler directive.
the sitemap.xml is fetched via the googlebot crawler.
googlebot will not access the sitemapindex.xml
no crawl coverage, no indexing, no SERP listing
you can test this with google webmaster tools robots.txt verification tool and fetch as googlebot (in the labs section) feature.