robots.txt content / selenium web scraping

robots.txt content / selenium web scraping - robots.txt

I am trying to run web scraping using selenium
What does this robot.txt content mean?
User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/
Can i run web scraping in all folders except go and launch-announcement?

What is a robots.txt file?
Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents. view more...
The Disallow: tells the robot that it should not visit the mentioned page on the site.
Can i run web scraping in all folders except go and launch-announcement?
Yes you can scrape the other page except these 2.

According to the basic robots.txt guide, the rule-
User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/
means crawling /go/ and /launch-announcement/ (and their subdirectories) is disallowed for all user agents.

Related

Robots.txt: Allow everything but the root directory

I have a site that is meant to have http://domain.com/blog as the root directory, and any traffic to http://domain.com is redirected to http://domain.com/blog.
This causes a problem cause when I go to Google and do site:domain.com, I see the root directory with the title of one of the first articles on the page. How can I block the root from being crawled, thus not showing up in search?
In webmaster tools I added the site as http://domain.com but I only fetch as google on the /blog directory and other static pages. Is that correct?
I usually know how to do this but this time the site has a sub-directory as the intended root so it's a bit different.
Can someone verify if this will do what I am trying to achieve?
User-agent: *
Allow: /$
Disallow: /

Robots.txt does NOT block a crawler from crawling certain webpages. Robots.txt is simply a text file with a set of guidelines that you ask the crawler to follow it does not at any time block a crawler. If you want to block a certain webpage from being crawl/visited - you will then have to block all access to that page, this includes other users that are not crawlers. But since you have already have it to redirect I see no issue.
Also the $ is not a unified standard, neither is Allow(technically). Try to make it focused on specific bots. Google and Bing recognise the Allow keyword, but many other bots does not.
Also your current robots.txt says this: Do not crawl any pages, but the root
I recommend this as your robots.txt
User-agent: *
Disallow: /
User-agent: googlebot
Disallow: /$
This tells all other bots, but google to not crawl your webpage. And it tells the google crawler not to crawl in root, but everything else is allowed.

How to deny Googlebot only for a specific set of page variables?

I have a page that is https://www.somedomain.com and then under that page I have the option for users to change the language, like
https://www.somedomain.com/?change_language=en&random_id=123
https://www.somedomain.com/?change_language=de&random_id=123
https://www.somedomain.com/?change_language=fr&random_id=123
etc.
Is it possible to deny Googlebot from crawling these links, but still crawl the https://www.somedomain.com/ main page?

You can use robots.txt to target just the query parameter:
User-agent: *
Disallow: /?change_language
This will prevent Google or other good bots from crawling the language options on the homepage. If you want to make it more universal to all pages:
User-agent: *
Disallow: ?change_language
However, you might want to consider letting those language changes to be crawled and instead utilize the rel="alternate" hreflang specification that Google and Bing support.
This way you can indiciate to the engines that the content is in multiple languages allowing your site to get indexed in all the different country specific versions of Google, Bing, and Yahoo.

how to set Robots.txt files for subdomains?

I have a subdomain eg blog.example.com and i want this domain not to index by Google or any other search engine. I put my robots.txt file in 'blog' folder in the server with following configuration:
User-agent: *
Disallow: /
Would it be fine to not to index by Google?
A few days before my site:blog.example.com shows 931 links but now it is displaying 1320 pages. I am wondering if my robots.txt file is correct then why Google is indexing my domain.
If i am doing anything wrong please correct me.

Rahul,
Not sure if your robots.txt is verbatim, but generally the directives are on TWO lines:
User-agent: *
Disallow: /
This file must be accessible from http://blog.example.com/robots.txt - if it is not accessible from that URL, the search engine spider will not find it.
If you have pages that have already been indexed by Google, you can also try using Google Webmaster Tools to manually remove pages from the index.

This question is actually about how to prevent indexing of a subdomain, here your robots file is actually preventing your site from being noindexed.
Don’t use a robots.txt file as a means to hide your web pages from Google search results.
Introduction to robots.txt: What is a robots.txt file used for? Google Search Central Documentation
For the noindex directive to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Block Search indexing with noindex Google Search Central Documentation

Proper wildcard Disallow for robots.txt

I am trying to disallow a specific page and its parameters along with a parameter on the entire site. Below I have the exact examples.
We now have a page that will redirect and track exteral urls. Any external URL we want to track will be linked like /redirect?u=http://example.com We do not want to add rel="nofollow" to every link.
Last but not least (our biggest seo and index issue) is every single page has an auto generate URL to disable or enable mobile. So it can be on any page like /?mobileVersion=off (or on) or /accounts?login_to=%2Fdashboard&mobileVersion=off
Basically the easy way to disallow the two parameters would be to disallow mobileVersion and u from any page. (u is the parameter needed to redirect the URL and is only valid on /redirect)
My current robots.txt config:
User-Agent: *
Disallow: /redirect
Disallow: / *?*mobileVersion=off
If you want to see our full robots.txt files its located at http://spicethymeinc.com/robots.txt.

you could change
Disallow: / *?*mobileVersion=off
to
Disallow: /*mobileVersion=off
but it looks like it should work.
I'm going off the wildcard section and examples on this page:
http://tools.seobook.com/robots-txt/
edit: I have tested with the googlebot and googlebot mobile. The are blocked by both your current robots.txt and my suggested change. Google webmaster tools has a handy robots checker you can use to test.

will googlebot index my site?

in my robots.txt file, I have the following line
User-agent: Googlebot-Mobile
Disallow: /
User-agent:GoogleBot
Disallow: /
Sitemap: http://mydomain.com/sitemapindex.xml
I know that if I put the first 4 lines , googlebot won't index the sites, but what if I put the last line Sitemap: http://mydomain.com/sitemapindex.xml, will googlebot be able to index the site?
Thanks,

I tested your robots.txt against my own domain (which has a sitemap entry for every page) and Googlebot and Googlebot-Mobile returned that they were Disallowed access.
Based on this - I would say the robots.txt file takes precedence over any sitemaps.
Plus, logically speaking - if you block the entire domain, the bot is disallowed access to the sitemap. The sitemap entry just tells crawlers where to find your sitemap - not their authorization to access it.
Even if you allowed the sitemap, I don't think bots would crawl your site - sitemaps are designed more for telling the bot how often to crawl your site, not what they are allowed to crawl.

No I dont think Google will do that. Its actually a question of Good bot and Bad bot. Even if you add a robots.txt file to restrict some area Bots can still crawl. Its actually a question of Yes or No. robots.txt is just like a warning board and not a security wall.

googlebot will not even be able to touch the sitemapindex.xml
the robots.txt is a crawler directive.
the sitemap.xml is fetched via the googlebot crawler.
googlebot will not access the sitemapindex.xml
no crawl coverage, no indexing, no SERP listing
you can test this with google webmaster tools robots.txt verification tool and fetch as googlebot (in the labs section) feature.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

robots.txt content / selenium web scraping - robots.txt

I am trying to run web scraping using selenium What does this robot.txt content mean? User-Agent: * Disallow: /go/ Disallow: /launch-announcement/ Can i run web scraping in all folders except go and launch-announcement?

According to the basic robots.txt guide, the rule- User-Agent: * Disallow: /go/ Disallow: /launch-announcement/ means crawling /go/ and /launch-announcement/ (and their subdirectories) is disallowed for all user agents.

Related

Robots.txt: Allow everything but the root directory

How to deny Googlebot only for a specific set of page variables?

how to set Robots.txt files for subdomains?

Proper wildcard Disallow for robots.txt

will googlebot index my site?

Categories

Resources