I just want to know whether you need robots.txt in your website for the crawler to index the pages. Or it is just to disallow any content that you dont want to index.
You don't need robots.txt for the crawler to index. It's the opposite, it is used to ban certain robots for some parts of the site.
All you need to know is here: http://www.robotstxt.org/
I disagree, a robots.txt can (and should) be used when one or multiple maps are used. Although they are not mandatory, it is highly recommended to use them to facilitate indexing.
Related
I am new to Github Pages and was just trying out the links to some pages.
username.github.io works, but www.username.github.io does not. Why is that so? I understand that the answer will be in some corner of the internet, but I did search it and failed to find an explanation for it.
That is due to the simple fact that Github has not configured their DNS records to support this naming scheme.
While this is entirely possible using wildcards, see Wildcard DNS record, the web has been shifting away from the www convention for some time now.
The reason why this is happening is very subjective, so it is not in the scope of Stackoverflow to answer this, but it can be inferred that given the ubiquitous nature of the World Wide Web, the shorter an URL is, the better it is, if just to allow people to more easily remember them, and type them.
I know Tumblr makes it very easy to point your TumblrBlog to a specific subdomain by changing the "CNAME" or "A" record in DNS. I wanted to know how can I point the TumblrBlog to website.com/blog rather than the subdomain blog.website.com?
If you simply prefer website.com/blog over blog.website.com, you could set up the former url with a redirect to blog.website.com. That has the added benefit of allowing people to access the site two different ways.
However, if the problem is that you can't change your DNS settings, you could use an iframe. That probably isn't a very good idea though, because it messes with the browser's back and forward buttons and doesn't display the updated url when you navigate within the frame.
Another non-DNS solution, if you have some kind of scripting available, is to use Tumblr's API to recreate your blog in your own page.
Your options are really dependent on your situation, which you haven't told us much about.
I'm maintaining the website http://www.totalworkflow.co.uk and not sure if the HTTrack follow the instructions given in robots.txt file. If there is any answer that we can keep the HTTrack away from the website please suggest me implement with or else just tell the robot name so I could be able to block this crap from crawling my website. If this is not possible by robots.txt, please recommend if any other way to keep this robots away from the website?
You are right there is no necessity for spam crawlers to follow the guidelines given in the robots.txt file. I know that the robots.txt is only for genuine search engines only. However, the application HTTrack may look genuine if the developers hard code this application not to skip the robots.txt guidelines if provided. If this option is provided then the application would be really useful for the purpose intended. OK lets come to my issue, actually what I would like to find the solution is to keeps the HTTRack crawlers away without hard code anything on the web server. I try to solve this issue at the webmasters level first. However, your idea is great to consider in the future. Thank you
It should obey robots.txt, but robots.txt is a thing that you don't have to obey (and actually a pretty good thing to find what you don't want other people to see for spam bots) so what's the guarantee that (even if it obeys robots now) some time in the future there won't be an option to ignore all robots.txt and metatags? I think a better way is to configure your server-side application to detect and block user agents. There is a chance that the user agent string is hardcoded somewhere in the crawler's source code and the user won't be able to change it to stop you from blocking that crawler. All you have to do is write a server script to spit out user agent information (or check server logs) and then create blocking rules according to this information. Alternatively, you can just google a list of known "bad agents". To block user agents on a server that supports HTACCESS, have a look at this thread for one way of doing it:
Block by useragent or empty referer
I'm setting up another website and still run into this issue. Once and for all, I would like to know what are the pros and cons of each method. No external resource seems to provide a decent answer, so I hoped fellow coders can help me. I don't want to know HOW to do it, it is fairly easy to find out, I just want to know which one (wwww. to null or null to www.) would be better for my site.
Redirect www.domain.com to domain.com.
Advertising the www. part of a domain is now rather antiquated, and besides, without the www., it's shorter. When we post a domain, the general public assumes that we mean the web service unless we specify otherwise.
If there's a question as to whether it's a domain (for less common Top Level Domains like .cc), I'd rather include http:// than www..
The main reason not to include the www. is because it's shorter (ie. the www. is not necessary).
Put yourself in the reader's shoes, with their short attention spans. Since the www. does not differentiate your website, you want the reader to see and recognize the differentiating part of your domain immediately. The best way to do this is to put the unique part first (without the www.). Plus, social networking and the mobile space like shorter links.
To summarize, the trend is to no longer use www., but to redirect for the people who are in the habit of typing www..
As far as I know there is no technical or SEO advantage either way, as long as the redirects work properly.
I prefer no-www, because the 'www.' is simply unnecessary.
Is there a setting that I can toggle or DownloaderMiddleware which I can use that will enforce the Crawl-Delay setting of robots.txt? If not, how do I implement rate limiting within a scraper?
There is a feature request (#892) to support this in Scrapy, but it is not currently implemented.
However, #892 includes a link to a code fragment that you could use as a starting point to create your own implementation.
If you do, and you are up to the task, consider sending a pull request to Scrapy to integrate your changes.
Spider can or cannot respect the crawl delay in the robots.txt, it's not mandatory to parse the robots.txt for bots!
You can use a firewall who will ban an ip which is crawling aggressively in your website.
Do you know which bots cause you trouble? Google Bot or other big search engine use bots which try to not overflow your server.