How to keep HTTrack Crawlers away from my website through robots.txt? - robots.txt

I'm maintaining the website http://www.totalworkflow.co.uk and not sure if the HTTrack follow the instructions given in robots.txt file. If there is any answer that we can keep the HTTrack away from the website please suggest me implement with or else just tell the robot name so I could be able to block this crap from crawling my website. If this is not possible by robots.txt, please recommend if any other way to keep this robots away from the website?
You are right there is no necessity for spam crawlers to follow the guidelines given in the robots.txt file. I know that the robots.txt is only for genuine search engines only. However, the application HTTrack may look genuine if the developers hard code this application not to skip the robots.txt guidelines if provided. If this option is provided then the application would be really useful for the purpose intended. OK lets come to my issue, actually what I would like to find the solution is to keeps the HTTRack crawlers away without hard code anything on the web server. I try to solve this issue at the webmasters level first. However, your idea is great to consider in the future. Thank you

It should obey robots.txt, but robots.txt is a thing that you don't have to obey (and actually a pretty good thing to find what you don't want other people to see for spam bots) so what's the guarantee that (even if it obeys robots now) some time in the future there won't be an option to ignore all robots.txt and metatags? I think a better way is to configure your server-side application to detect and block user agents. There is a chance that the user agent string is hardcoded somewhere in the crawler's source code and the user won't be able to change it to stop you from blocking that crawler. All you have to do is write a server script to spit out user agent information (or check server logs) and then create blocking rules according to this information. Alternatively, you can just google a list of known "bad agents". To block user agents on a server that supports HTACCESS, have a look at this thread for one way of doing it:
Block by useragent or empty referer

Related

securing usage of REST API when using SPA without authentication

after reading all the threads on stackoverflow and other platforms, I still wasn't able to find an answer, which satisfies me.
The task:
I want to create a single page application (SPA) which receives data from a REST API. In this SPA, NO authentication should be used. It's a public site.
But the REST API should only be accessible from people who loaded the SPA from my webserver.
I assume this is only solvable with something on server side like sessions, cookies etc. - otherwise I'm open for your suggestions, solutions etc.
Thx in advance!
There's no reasonably easy way to do this. You can easily prevent other domains (in browsers) from accessing a an API on your domain (via CORS), but it's significantly harder to prevent scripts from doing this.
The issue lies in 'how do you detect legit browser traffic from a script'. It turns out that this is not easy. You could try to detect 'unusual behavior' as much as possible (for example a large amount of requests in a short time), but this doesn't stop clients that are slower.
Ultimately if people want your data, they will find some way around whatever restrictions you come up with. You should reevaluate this and use one of the following options:
Don't do an SPA and API. Although one could wonder, if the data exists in HTML it can still be crawled.
Add authentication. But obviously this won't help you in any way if anyone can authenticate.
Re-evaluate why you have this restriction. What are you worried about? If you're worried about people taking your data and using it elsewhere, how does only showing it in a browser from 1 domain help with that? If you're worried about copyright theft, why not use a legal approach to this?
I've seen a lot of these types of questions, but in my opinion I haven't yet seen one that has a legitimate good reason to want this. But, maybe you're the first.
I believe I answered my question myself on a comment 30 minutes ago... I think with captcha I'm able to secure the REST API against unwanted access to my REST API

redirect www.domain.com to domain.com or other way round?

I'm setting up another website and still run into this issue. Once and for all, I would like to know what are the pros and cons of each method. No external resource seems to provide a decent answer, so I hoped fellow coders can help me. I don't want to know HOW to do it, it is fairly easy to find out, I just want to know which one (wwww. to null or null to www.) would be better for my site.
Redirect www.domain.com to domain.com.
Advertising the www. part of a domain is now rather antiquated, and besides, without the www., it's shorter. When we post a domain, the general public assumes that we mean the web service unless we specify otherwise.
If there's a question as to whether it's a domain (for less common Top Level Domains like .cc), I'd rather include http:// than www..
The main reason not to include the www. is because it's shorter (ie. the www. is not necessary).
Put yourself in the reader's shoes, with their short attention spans. Since the www. does not differentiate your website, you want the reader to see and recognize the differentiating part of your domain immediately. The best way to do this is to put the unique part first (without the www.). Plus, social networking and the mobile space like shorter links.
To summarize, the trend is to no longer use www., but to redirect for the people who are in the habit of typing www..
As far as I know there is no technical or SEO advantage either way, as long as the redirects work properly.
I prefer no-www, because the 'www.' is simply unnecessary.

Which is the easiest way for Scrapy scrapers to respect Crawl-Delay in robots.txt?

Is there a setting that I can toggle or DownloaderMiddleware which I can use that will enforce the Crawl-Delay setting of robots.txt? If not, how do I implement rate limiting within a scraper?
There is a feature request (#892) to support this in Scrapy, but it is not currently implemented.
However, #892 includes a link to a code fragment that you could use as a starting point to create your own implementation.
If you do, and you are up to the task, consider sending a pull request to Scrapy to integrate your changes.
Spider can or cannot respect the crawl delay in the robots.txt, it's not mandatory to parse the robots.txt for bots!
You can use a firewall who will ban an ip which is crawling aggressively in your website.
Do you know which bots cause you trouble? Google Bot or other big search engine use bots which try to not overflow your server.

Ethics of robots.txt [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a serious question. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind:
If someone puts a web site up they're expecting some visits. Granted, web crawlers are using bandwidth without clicking on ads that may support the site but the site owner is putting their site on the web, right, so how reasonable is it for them to expect that they'll never get visited by a bot?
Some sites apparently use a robots.txt exactly in order to keep their site from being crawled by Google or some other utility that might grab prices and therefore allow people to do price comparisons easily. They have private search engines on the site so they obviously want people to be able to search the site; apparently they just don't want people to be able to easily compare their information with other vendors.
As I said, I'm not trying to be argumentative; I would just like to know if anyone has ever come up with a case where it's ethically permissible to ignore the presence of a robots.txt file? I cannot think of a case where it's permissible to ignore the robots.txt mainly because people (or businesses) are paying money to put up their web sites so they should be able to tell the Googles/Yahoos/Other SE's of the world that they don't want to be on their indices.
To put this discussion in context, I'd like to create a price comparison website and one of the major vendors has a robots.txt that basically prevents anyone from grabbing their prices. I'd like to be able to get their information but, as I said, I can't justify simply ignoring the wishes of the site owner.
I have seen some very sharp discussion here and that's why I would like to hear the opinions of developers that follow Stack Overflow.
By the way, there is some discussion of this topic on a Hacker News question but they seem to mainly focus on the legal aspects of this.
Arguments:
A robots.txt file is an implied license, especially since you are aware of it. Thus, continuing to scrape their site could be seen as unauthorized access (i.e., hacking). Sucks, but arguments like this have been made in other legal cases recently (not directly related to robots.txt, but in relation to other "passive controls".)
Grabbing prices violates no copyright law, including DMCA, since copyright does not include factual information, only creative.
Ethically, you should not grab prices because the vendor should have the ability to change prices without worrying about being accused of a bait/switch by people coming from your site.
Have you taken the high road, explaining the site to them and saying you'd love to include them in your list of vendors? Maybe they will love the idea and actually expose the data in a way that is easy for you to consume and less resource-intensive for them to produce.
There are no laws written directly about robots.txt because netiquette is generally followed. Don't be one of the "bad guys."
Some people filter robots because they use URL links to perform "actions" like adding things to carts, and robots leave them with massive numbers of abandoned shopping carts in their database.
Some people filter robots because they have exclusive prices that they can't advertise openly based on agreements with their vendors. You could be putting them in a bad position by exposing those prices on your site.
In this economy, if a company doesn't want to do everything possible to advertise themselves, it's their own fault that you don't include them.
The other use of robots.txt is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructed robots.txt file will tell the spider that "you don't need to go here".
Many people have tried to build businesses off building "price comparison" engines that scraped major sites.
Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.
You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.
If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.
Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.
One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.
Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?
If too many people violate robots.txt we might get something worse.
"No" means "no".
To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.
An interesting IRL version of story involving The Harvard Coop:
Coop Calls Cops On ISBN Copiers.
Short answer: No.
On the narrow issue: If a seller says that their prices are secret, I think you have to respect that. I'd contact them and ask if they really don't want price comparison engines like yours to include them, or if the "no trespassing" sign is for technical reasons. If the latter, perhaps they'll provide you with an alternative. If the former, then I'd say too bad, they don't get included, they lose some business, and it's their problem.
Tangential rant: Personally, I get pretty annoyed with companies that make me jump through hoops to find out the price of their products, places that make me call and talk to a salesman so he can give me a hard-sell pitch, or worse, make me give them my phone number so their salesman can call and harass me. I figure that if they're afraid to tell me the price, it probably means that it's too high.
In general: A robots.txt file is like a "No Trespassing" sign. It's the owner's right to say who is allowed on their property. If you think their reasons are dumb, you can politely suggest they take the sign down. But you don't have the right to disregard their wishes. If someone puts a No Trespassing sign on his yard, and I say, "Hey, I just want to take a quick short cut, what's the big deal?" -- Maybe I'm stepping on his prized Bulgarian violet bulbs and destroying a valuable investment. Maybe I'm crossing his people's sacred burial ground and offending their religious sensibilities. Or maybe he's just an ornery jerk. But it's still his property and his right. Oh, and if I fall into the dangerous sinkhole after ignoring the No Trespassing sign, who's to blame? (In America, I could probably still sue him for all he's worth despite the fact that he warned me, but is that right?)
I'm showing some ignorance here, but I always thought a bot was something only sent out by a search engine. Like Google or Yahoo.
Thus, if you wrote an application that searched content on the internet, I wouldn't consider that a search engine bot, which to my knowledge is what robots.txt is trying to block.
But this may just be selective ignorance, because I might do it until the webmaster of that site contacted me and asked me to stop :)
If people make it available to public access, they shouldn't try to put limits on it. Adding a robots.txt file to your site is the equivalent to putting a sign on your lawn that says "Please don't look at me."

Email obfuscation question

Yes, I realize this question was asked and answered, but I have specific questions about this that I feel were not clear on that thread and I'd prefer not to get lost in the shuffle on another thread as well.
Previous threads said that rendering the email address to an image the way Facebook does is overkill and unprofessional user experience for business/professional websites. And it seems that the general consensus is to use a JavaScript document.write solution using html entities or some other method that breaks up and/or makes the string unreadable by a simple bot. The application I'm building doesn't even need the "mailto:" functionality, I just need to display the email address. Also, this is a business web application, so it needs to look/act as professional as possible. Here are my questions:
If I go the document.write route and pass the html entity version of each character, are there no web crawlers sophisticated enough to execute the javascript and pull the rendered text anyway? Or is this considered best practice and completely (or almost completely) spammer proof?
What's so unprofessional about the image solution? If Facebook is one of the highest trafficked applications in the world and not at all run by amateurs, why is their method completely dismissed in the other thread about this subject?
If your answer (as in the other thread) is to not bother myself with this issue and let the users' spam filters do all the work, please explain why you feel this way. We are displaying our users' email addresses that they have given us, and I feel responsible to protect them as much as I can. If you feel this is unnecessary, please explain why.
Thanks.
It is not spammer proof. If someone looks at the code for your site and determines the pattern that you are using for your email addresses, then specific code can be written to try and decipher that.
I don't know that I would say it is unprofessional, but it prevents copy-and-paste functionality, which is quite a big deal. With images, you simply don't get that functionality. What if you want to copy a relatively complex email address to your address book in Outlook? You have to resort to typing it out which is prone to error.
Moving the responsibility to the users spam filters is really a poor response. While I believe that users should be diligent in guarding against spam, that doesn't absolve the person publishing the address from responsibility.
To that end, trying to do this in an absolutely secure manner is nearly impossible. The only way to do that is to have a shared secret which the code uses to decipher the encoded email address. The problem with this is that because the javascript is interpreted on the client side, there isn't anything that you can keep a secret from scrapers.
Encoders for email addresses nowadays generally work because most email bot harvesters aren't going to concern themselves with coding specifically for every site. They are going to try and have a minimal algorithm which will get maximum results (the payoff isn't worth it otherwise). Because of this, simple encoders will defeat most bots. But if someone REALLY wants to get at the emails on your site, then they can and probably easily as well, since the code that writes the addresses is publically available.
Taking all this into consideration, it makes sense that Facebook went the image route. Because they can alter the image to make OCR all but impossible, they can virtually guarantee that email addresses won't be harvested. Given that they are probably one of the largest email address repositories in the world, it could be argued that they carry a heavier burden than any of us, and while inconvenient, are forced down that route to ensure security and privacy for their vast user base.
Quite a few reasons Javascript is a good solution for now (that may change as the landscape evolves).
Javascript obfuscation is a better mouse trap for now
You just need to outrun the others. As long as there are low hanging fruit, spammers will go for those. So unless everyone starts moving to javascript, you're okay for now at least
most spammers use http based scripts which GET and parse using regex. using a javascript engine to parse is certainly possible but will slow things down
Regarding the facebook solution, I don't consider it unprofessional but I can clearly see why purists may disagree.
It breaks accessibility standards (cannot be parsed by browsers, voice readers or be clicked.
It breaks semantic construct (it's an image, not a mailto link anymore)
It breaks the presentational layer. If you increase browser default font size or use high contrast custom CSS, it won't apply to the email.
Here is a nice blog post comparing a few methods, with benchmarks.
http://techblog.tilllate.com/2008/07/20/ten-methods-to-obfuscate-e-mail-addresses-compared/