What would be the best distributed storage solution for a heavy use web scraper/crawler? - mongodb

I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.

Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.

I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.

Related

Redis / Memcached ReST caching for an external service

Question here about caching data from calls to an external ReST API.
There is currently a ReST service set up to generate and retrieve some specific types of reports that the UI must consume. However, this service is not meant for high volume usage, or to be exposed to the public and these reports are fairly static. Possibly only changing every 10-20 minutes. The web application resides on a separate server.
What I would like to do is, using memcached or Redis, is when a request for data comes in from the UI to the web back-end, make a call from the web application back-end to the report server to get the specified report, transform the data to the appropriate format for the UI to consume, cache it with a timestamp, and return it to the UI so subsequent requests will be available in memory on the web applications back-end without having to re-request from the report server. I would also need to check this timestamp and make a new request if the cached report has been held for longer than the specified time. The data that will be cached is fairly minuscule just some smallish JSON objects with only a handful of values holding the information the UI needs and there is NOT a ton of these objects, I would not be surprised if they could all be easily stored in memory at once so the time stamping is the only invalidation that should be necessary.
I have almost 0 experience here when it comes to caching / memcached / Redis. Is there advantages to one or the other? Is something like this possible? How would I go about implementing this? Are there other options?
Appreciate the help!
Server-caching these kinds of RESTful query responses is very possible and quite common.
With any server based caching, you should also think hard about whether you really need it, as it does add complexity. It can certainly make a huge improvement, but since your usage volume is low, it might actually be overkill. You may also be able to use HTTP caching protocols to avoid the need for caching on the server. If the data doesn't change very often and you use eTags or modified dates correctly, along with an intermediary proxy like AWS CloudFront, users will rarely experience that delay.
Also, if you are finding your database to be a bottleneck, you might be able to get away with just configuring it to cache more aggressively.
Assuming you do want to cache in memory ...
For server-side caching, the normal approach is to cache results for some time period or manually clear them from the cache. But a more modern and better approach imo is to use Russian-doll caching, where you key items according to the time their inputs changed. Then you never need to worry about manually clearing them, you just make sure timestamps are correct and synchronised.
Memcached versus Redis versus something else? For this usage, Memcached is probably best as it's extremely simple and you don't have to worry about persistence, which is a big advantage of Redis over Memcached. Redis is well-engineered and would work fine too, but I don't see the benefit to use something that's considerably more feature-rich and complex if you don't need it and there's a good alternative. That said, the one big advantage of Redis is it now has excellent built-in clustering support, so it's easy to scale and stay online. But that would be overkill for your use case.
Something else? There are plenty of other in-memory databases, but I think Memcached and Redis are probably best if you want to avoid the problems of relying on cutting-edge frameworks without too much support. However, there is something else: boring old files. If you're generating reports, you might want to consider just generating them as temporary files. If your OS is doing its job, the files will end up being cached anyway.

Best practise for modeling analytics data

I am working on a product using which a user can create his/her mobile site. Now, as this is a mobile site creation platform, there are lots of site created in the application. I need to keep all the visitor data in the database so that product can show the analytics to the user of his/her site.
When there was less site, all was working fine. But now the data is growing fast as there are lots of requests on the server. I make use of mongo as NoSQL DBMS to keep all the data. In a collection named "analytics", I usually insert row with site id so that it can be shown to the user. As the data is large, performance to show user analytics is also slow. Also disk space is growing gradually.
What should be best modeling to keep this type of BIG data.
Should i create collection per site and store data in separate collection per site ?
Should I also separate collection date wise ?
What should be the cleaning procedure of the data. What is the best practices adopted by other leader in the industry ?
Please help
I would strongly suggest reading through MongoDB Optimization strategies at http://docs.mongodb.org/manual/administration/optimization/ . You will find various ways to identify slow performing queries / ops and suggestions to improve them at the mentioned page. Hopefully that should help you resolve slow queries / performance issues.
If you haven't already seen, I would also suggest taking a look at various use cases at http://docs.mongodb.org/ecosystem/use-cases/ , how they are modeled for those scenarios and if there is any that resembles what you are trying to achieve.
After following through the optimization strategies and making appropriate changes, if you still have performance issues, I would suggest posting following information for further suggestions:
What is your current state in terms of performance and what is the planned target state?
How does your system look i.e. hardware / software characteristics?
Once you have needed performance characteristics, following questions may help you achieve your targets:
What are the common query patterns and which ones are slow?
Potentially look for adding indexes that can enhance query performance
Potentially look for schema refactoring based on access patterns
Potentially look for schema refactoring for rolling-up / aggregating analytics data based on how it will be used.
Are writes also slow and is that a concern as well?
Potentially plan for Sharding which would provide write as well as read scaling. Sharding is entirely a topic in itself and I would suggest reading about it at http://docs.mongodb.org/manual/sharding/
How big the data is and how is it growing or intended to grow
Potentially this would give further insights into what could be suggested

content Revision History moving database?

I keep a content revision history for a certain content type. It's stored in MongoDB. But since the data is not frequently accessed I don't really need it there, taking up memory. I'd put it in a slower hard disk database.
Which database should I put it in? I'm looking for something that's really cheap and with cloud hosting available. And I don't need speed. I'm looking at SimpleDB, but it doesn't seem very popular. rdbms doesn't seem easy enough to handle since my data is structured into documents. What are my options?
Thanks
Depends on how often you want to look at this old data:
Why don't you mongodump it to your local disk and mongorestore when you want it back.
Documentation here
OR
Setup a local mongo instance and clone the database using the information here
Based on your questions and comments, you might not find the perfect solution. You want free or dirt cheap storage, and you want to have your data available online.
There is only one solution I can see feasible:
Stick with MongoDB. SimpleDB does not allow you to store documents, only key-value pairs.
You could create a separate collection for your history. Use a cloud service that gives you a free tier. For example, http://MongoLab.com gives you 240Mb free tier.
If you exceed the free tier, you can look at discarding the oldest data, moving it to offline storage, or start paying for what you are using.
If you data grows a lot you will have to make the decision whether to pay for it, keep it available online or offline, or discard it.
If you are dealing with a lot of large objects (BLOBS or CLOBS), you can also store the 'non-indexed' data separate from the database. This keeps the database both cheap and fast. The large objects can be retrieved from any cheap storage when needed.
Cloudant.com is pretty cool for hosting your DB in the cloud and it uses Big Couch which is a nosql thing. I'm using it for my social site in the works as Couch DB (Big Couch) similar has an open ended structure and you talk to it via JSON. Its pretty awesome stuff but so weird to move from SQL to using Map-Reduce but once you do its worth it. I did some research because I'm a .NET guy for a long time but moving to Linux and Node.js partly out of bordom and the love of JavaScript. These things just fit together because Node.js is all JavaScript on the backend and talks seemlessly to Couch DB and the whole thing scales like crazy.

I need suggestions for a distributed media storage data store

I want to develop one multimedia system, the system need to save millions videos and images, so I want to select a distributed storage subsystem. who can give me some suggestion ? thanks!
I guess that best option for the 'millions videos and images' is content distribution/delivery network (CDN):
CDN is a server setup which allows for
faster, more efficient delivery of
your media files. It does this by
maintaining copies of your media at
different points of presence (POPs)
along a global network to ensure quick
client access and the fastest delivery
possible
If you will use CDN you no need care about many problems(distribution, fast access). Integration with CDN also should be very simple.
#yi_H
You can configure your writes to be first replicated to multiple nodes before it return to the client. Now whether or not that is needed is of course unto the use case. And definitely involves a performance hit. So if you are implementing a write heavy analytical database, it will have a significant impact on write throughput.
All other points you make about the question in terms of lack of requirements etc, I second that.
Having replicated file system with metadata in a nosql database is a very common way of doing things. #why did you consider this kinda approach?
Have you taken a look at Mongodb gridfs? I have never used it, but it is something I would take a look at to see if it gives you any ideas.
Yo gave us (near) zero information about what your requirements are. Eg:
Do you want atomic transactions?
Is the system read or write heavy?
Do you need fast queries or want to batch-process the data set?
How big are the videos?
Do you want to distribute data locally (on a LAN) or spanning multiple data centers / continents?
How are we supposed to pick the right tool if we don't know what it needs to support?
Without any knowledge of the system I would advise using some kind of FS replication for the videos and images and then storing the metadata associated with the items either in MongoDB, MySQL Master-Master or MySQL Cluster.
Distributed related to what?
If you are talking of replication to distribute:
MongoDb only restricted to Master-Slave replication, so only one node is able to read/write which leaves you with a single point of failure for a really distributed system.
CouchDB is able to peer-to-peer replicate.
Find a very good comparison here and here also compared with hbase.
With CouchDB you also have to be aware that you are going to talk http to the database and have build in webservices.
Regards,
Chris
An alternative is to use MongoDB's GridFS, serving as a (very easily manageable) redundant and distributed filesystem.
Some will say that it's slow on reads, (and it is, mostly because of the nature of its design) but that doesn't have to mean it's a dealbreaker for your system in whole, because if you need performance later on, you could always put Varnish or Squid in front of the filesystem tier.
For all I know, Squid also supports on-disk cache for all the less-hot files.
Sources:
http://www.mongodb.org/display/DOCS/GridFS
http://www.squid-cache.org/Doc/config/cache_dir/

How to collect statistics about data usage in an enterprise application?

For optimization purposes, I would like to collect statistics about data usage in an enterprise Java application. In practise, I would like to know which database tables, and moreover, which individual records are accessed most frequently.
What I thought I could do is write an aspect that records all data access and asynchronously writes the results to a database but I feel that I would be re-inventing the wheel by doing so. So, are there any existing open source frameworks that already tackle this problem or is it somehow possible to squeeze this information directly from MySQL?
This might be useful - have you seen the UserTableMonitoring project?