Which is the easiest way for Scrapy scrapers to respect Crawl-Delay in robots.txt? - robots.txt

Is there a setting that I can toggle or DownloaderMiddleware which I can use that will enforce the Crawl-Delay setting of robots.txt? If not, how do I implement rate limiting within a scraper?

There is a feature request (#892) to support this in Scrapy, but it is not currently implemented.
However, #892 includes a link to a code fragment that you could use as a starting point to create your own implementation.
If you do, and you are up to the task, consider sending a pull request to Scrapy to integrate your changes.

Spider can or cannot respect the crawl delay in the robots.txt, it's not mandatory to parse the robots.txt for bots!
You can use a firewall who will ban an ip which is crawling aggressively in your website.
Do you know which bots cause you trouble? Google Bot or other big search engine use bots which try to not overflow your server.

Related

Is building your own CDN worth it?

I've been recently looking at how I could speed up page loading on my website and specifically to reduce the response time between my server and the CDNs I use (FontAwesome, jQuery, BootstrapCDN, and CloudFlare) since I figured that it was highly dependent on the traffic on those big CDNs. And I thought that if I built my own CDN (via a subdomain on my server), then traffic would be a lot smaller hence more fluid. However since I'm not an expert at all on that matter, I'd like to know if I'm right about that, and if it would be worth doing it in terms of performance?
Thanks!
If you had to ask, then no.
The first strike is on CloudFlare. By using CloudFlare, right now most of the cacheable traffic from your website should be between the user's browser (which can be anywhere in the world) to the nearest CloudFlare endpoint. Unless you have mirrors all over the globe, CloudFlare should be faster than your own CDN.
By using BootstrapCDN (which includes FontAwesome) and jQuery CDN, if the user's browser ever visited any other BootstrapCDN and jQuery CDN powered site on the near-past and assuming they're using the same resources, there will be no re-downloading of those resources. This mean using your own CDN will always add traffic.

securing usage of REST API when using SPA without authentication

after reading all the threads on stackoverflow and other platforms, I still wasn't able to find an answer, which satisfies me.
The task:
I want to create a single page application (SPA) which receives data from a REST API. In this SPA, NO authentication should be used. It's a public site.
But the REST API should only be accessible from people who loaded the SPA from my webserver.
I assume this is only solvable with something on server side like sessions, cookies etc. - otherwise I'm open for your suggestions, solutions etc.
Thx in advance!
There's no reasonably easy way to do this. You can easily prevent other domains (in browsers) from accessing a an API on your domain (via CORS), but it's significantly harder to prevent scripts from doing this.
The issue lies in 'how do you detect legit browser traffic from a script'. It turns out that this is not easy. You could try to detect 'unusual behavior' as much as possible (for example a large amount of requests in a short time), but this doesn't stop clients that are slower.
Ultimately if people want your data, they will find some way around whatever restrictions you come up with. You should reevaluate this and use one of the following options:
Don't do an SPA and API. Although one could wonder, if the data exists in HTML it can still be crawled.
Add authentication. But obviously this won't help you in any way if anyone can authenticate.
Re-evaluate why you have this restriction. What are you worried about? If you're worried about people taking your data and using it elsewhere, how does only showing it in a browser from 1 domain help with that? If you're worried about copyright theft, why not use a legal approach to this?
I've seen a lot of these types of questions, but in my opinion I haven't yet seen one that has a legitimate good reason to want this. But, maybe you're the first.
I believe I answered my question myself on a comment 30 minutes ago... I think with captcha I'm able to secure the REST API against unwanted access to my REST API

Cq5 dispatcher is it must or optional

We are getting lot of problems with dispatcher, As per CQ5 documentation dispatcher is cache and/or load balancing tool, so as per my analysis we can go with out dispatcher also,I am correct? I want to integrate Squid or varnish web cache with my apache, so want get shutdown the dispatcher, will it be a good option
Any views/help is appreciated.
Yes, it's perfectly possible to run a website without the Dispatcher in front. Your options would then seem to come down to:
No caching
Implementing a cache in front of the Publish instance (e.q. Squid/Varnish, as you mentioned; configuration required)
Integrate a caching solution in Java that you can apply to parts of your templates/components individually (development required)
Also, you'd need to check with Adobe what level of support they'd give you for any of the above solutions before undertaking them. If you like, you could post specific questions to SO around the problems you're facing with the Dispatcher and you may get some resolutions too.
I was told that you should use dispatcher servers for your publish instance, because it really helps the loading times. There also was a documentation with a table showing how much it affects the performance depending on the number of documents served.
To avoid caching problems, you can specify files, folders or file types which should never be cached. You can also specify caching behaviour in the source code of the pages. Also, making changes to content on your author instance triggers a flush on the dispatcher for the affected content, to make sure that no cached old version is beeing served.
Last but not least using an apache server also allows you to handle virtual hosts and rewrite rules easily.
Its a must.
If you are getting problems with dispatcher, this could be a sign that you are using the wrong platform for your development needs. Seeing as you are needing to revert to technologies that are not needed for AEM.

How to keep HTTrack Crawlers away from my website through robots.txt?

I'm maintaining the website http://www.totalworkflow.co.uk and not sure if the HTTrack follow the instructions given in robots.txt file. If there is any answer that we can keep the HTTrack away from the website please suggest me implement with or else just tell the robot name so I could be able to block this crap from crawling my website. If this is not possible by robots.txt, please recommend if any other way to keep this robots away from the website?
You are right there is no necessity for spam crawlers to follow the guidelines given in the robots.txt file. I know that the robots.txt is only for genuine search engines only. However, the application HTTrack may look genuine if the developers hard code this application not to skip the robots.txt guidelines if provided. If this option is provided then the application would be really useful for the purpose intended. OK lets come to my issue, actually what I would like to find the solution is to keeps the HTTRack crawlers away without hard code anything on the web server. I try to solve this issue at the webmasters level first. However, your idea is great to consider in the future. Thank you
It should obey robots.txt, but robots.txt is a thing that you don't have to obey (and actually a pretty good thing to find what you don't want other people to see for spam bots) so what's the guarantee that (even if it obeys robots now) some time in the future there won't be an option to ignore all robots.txt and metatags? I think a better way is to configure your server-side application to detect and block user agents. There is a chance that the user agent string is hardcoded somewhere in the crawler's source code and the user won't be able to change it to stop you from blocking that crawler. All you have to do is write a server script to spit out user agent information (or check server logs) and then create blocking rules according to this information. Alternatively, you can just google a list of known "bad agents". To block user agents on a server that supports HTACCESS, have a look at this thread for one way of doing it:
Block by useragent or empty referer

send request with gwt to a different domain

Is there a way I can make a request to a different server than the one that's being used for development using a RequestBuilder?
I keep getting
com.google.gwt.http.client.RequestPermissionException: The URL
http://127.0.0.1:4321/getSellers is invalid or violates the same-origin
security restriction
while I am sending request from 127.0.0.1:8888
GWT currently doesnt support cross domain ajax calls - but it can be worked around if you are willing to do a bit of jsni. And I heard a rumour some time ago that there is a gwt patch with the solution, but its not perfect. see this http://groups.google.com/group/Google-Web-Toolkit-Contributors/browse_thread/thread/94c18c4ec158070c/
to work around using jsni, you can use whats called the windows.name transport - see this blog http://www.sitepen.com/blog/2008/07/22/windowname-transport/ . i havent been able to locate a library for gwt to automate this, but i dont think its too hard to do yourself in jsni (and dont me misled by the blog being about dojo, its a general technique).
There is a detailed explanation on the topic of the Same Origin Policy and its consequences for developing with GWT here:
http://code.google.com/p/google-web-toolkit-doc-1-5/wiki/FAQ_SOP
The simple answer is: No, that's something that is disallowed for security reasons.
However, it should be possible to work around this limitation with all kinds of techniques (proxy servers, Yahoo Pipes, etc). As I'm no AJAX expert, I will leave the explanation of those to others.