content Revision History moving database? - mongodb

I keep a content revision history for a certain content type. It's stored in MongoDB. But since the data is not frequently accessed I don't really need it there, taking up memory. I'd put it in a slower hard disk database.
Which database should I put it in? I'm looking for something that's really cheap and with cloud hosting available. And I don't need speed. I'm looking at SimpleDB, but it doesn't seem very popular. rdbms doesn't seem easy enough to handle since my data is structured into documents. What are my options?
Thanks

Depends on how often you want to look at this old data:
Why don't you mongodump it to your local disk and mongorestore when you want it back.
Documentation here
OR
Setup a local mongo instance and clone the database using the information here

Based on your questions and comments, you might not find the perfect solution. You want free or dirt cheap storage, and you want to have your data available online.
There is only one solution I can see feasible:
Stick with MongoDB. SimpleDB does not allow you to store documents, only key-value pairs.
You could create a separate collection for your history. Use a cloud service that gives you a free tier. For example, http://MongoLab.com gives you 240Mb free tier.
If you exceed the free tier, you can look at discarding the oldest data, moving it to offline storage, or start paying for what you are using.
If you data grows a lot you will have to make the decision whether to pay for it, keep it available online or offline, or discard it.

If you are dealing with a lot of large objects (BLOBS or CLOBS), you can also store the 'non-indexed' data separate from the database. This keeps the database both cheap and fast. The large objects can be retrieved from any cheap storage when needed.

Cloudant.com is pretty cool for hosting your DB in the cloud and it uses Big Couch which is a nosql thing. I'm using it for my social site in the works as Couch DB (Big Couch) similar has an open ended structure and you talk to it via JSON. Its pretty awesome stuff but so weird to move from SQL to using Map-Reduce but once you do its worth it. I did some research because I'm a .NET guy for a long time but moving to Linux and Node.js partly out of bordom and the love of JavaScript. These things just fit together because Node.js is all JavaScript on the backend and talks seemlessly to Couch DB and the whole thing scales like crazy.

Related

What would be the best distributed storage solution for a heavy use web scraper/crawler?

I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.
Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.
I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.

I need suggestions for a distributed media storage data store

I want to develop one multimedia system, the system need to save millions videos and images, so I want to select a distributed storage subsystem. who can give me some suggestion ? thanks!
I guess that best option for the 'millions videos and images' is content distribution/delivery network (CDN):
CDN is a server setup which allows for
faster, more efficient delivery of
your media files. It does this by
maintaining copies of your media at
different points of presence (POPs)
along a global network to ensure quick
client access and the fastest delivery
possible
If you will use CDN you no need care about many problems(distribution, fast access). Integration with CDN also should be very simple.
#yi_H
You can configure your writes to be first replicated to multiple nodes before it return to the client. Now whether or not that is needed is of course unto the use case. And definitely involves a performance hit. So if you are implementing a write heavy analytical database, it will have a significant impact on write throughput.
All other points you make about the question in terms of lack of requirements etc, I second that.
Having replicated file system with metadata in a nosql database is a very common way of doing things. #why did you consider this kinda approach?
Have you taken a look at Mongodb gridfs? I have never used it, but it is something I would take a look at to see if it gives you any ideas.
Yo gave us (near) zero information about what your requirements are. Eg:
Do you want atomic transactions?
Is the system read or write heavy?
Do you need fast queries or want to batch-process the data set?
How big are the videos?
Do you want to distribute data locally (on a LAN) or spanning multiple data centers / continents?
How are we supposed to pick the right tool if we don't know what it needs to support?
Without any knowledge of the system I would advise using some kind of FS replication for the videos and images and then storing the metadata associated with the items either in MongoDB, MySQL Master-Master or MySQL Cluster.
Distributed related to what?
If you are talking of replication to distribute:
MongoDb only restricted to Master-Slave replication, so only one node is able to read/write which leaves you with a single point of failure for a really distributed system.
CouchDB is able to peer-to-peer replicate.
Find a very good comparison here and here also compared with hbase.
With CouchDB you also have to be aware that you are going to talk http to the database and have build in webservices.
Regards,
Chris
An alternative is to use MongoDB's GridFS, serving as a (very easily manageable) redundant and distributed filesystem.
Some will say that it's slow on reads, (and it is, mostly because of the nature of its design) but that doesn't have to mean it's a dealbreaker for your system in whole, because if you need performance later on, you could always put Varnish or Squid in front of the filesystem tier.
For all I know, Squid also supports on-disk cache for all the less-hot files.
Sources:
http://www.mongodb.org/display/DOCS/GridFS
http://www.squid-cache.org/Doc/config/cache_dir/

How to collect statistics about data usage in an enterprise application?

For optimization purposes, I would like to collect statistics about data usage in an enterprise Java application. In practise, I would like to know which database tables, and moreover, which individual records are accessed most frequently.
What I thought I could do is write an aspect that records all data access and asynchronously writes the results to a database but I feel that I would be re-inventing the wheel by doing so. So, are there any existing open source frameworks that already tackle this problem or is it somehow possible to squeeze this information directly from MySQL?
This might be useful - have you seen the UserTableMonitoring project?

Non-relational databases (NoSQL) for small to medium sized applications

The benefits of a non-relational database (such as a key-value pair storage) are evident when used in large scale datasets (google, facebook, linkedin). How do you think small to medium sized applications can benefit from using non-relational databases?
IBM Mainframes have had "non-relational" databases since the 60s (hierarchial databases such as IMS + variants). These databases are still in use because they are extremely fast and handle huge scale well.
The point of relational databases was to provide a regular, relatively abstract method for storing and retrieving data in which the tuning can be done relatively independently of the data model (not true for IMS). They were designed rather in reaction to the inability to reorganize hiearchical databases easily. The upside is nice organization; the downside is medium, not high performance.
Google provides scalable storage and MapReduce to handle scale. It isn't relational.
There was a huge push early in the last decade to store data in XML, in essentially hiearchical form because XML is implicitly hierarchical. That was a huge mistake IMHO, because it repeated the inconvenience of heirarchical databases, but had none of the performance. I'm not very surprised this movement seems to have pretty much died.
Most of the practical push to non-relational seems to me to be towards performance and scale. I don't see how this helps "small" applications much.
People have proposed, but not done a lot of practical data management using knowledge-based schemes. Doug Lenat's CYC comes to mind here. The ability of the database
to help an application draw non-obvious conclusions strikes me a very interesting for "small" applications that are trying be "smart". But there aren't a lot of these yet.
The sweet spot of using a NoSQL database at that scale is when the database model (key-value, document, etc.) is a good match to the application's needs and the advanced relational functionality is not needed.
At the small end of the spectrum, performance is a non issue because just about everything is fast. Storage engines are a non issue, if you don't need a sophisticated query engine, the lack of SQL support is a non issue.
You are left with how well it fits and how easy it is to use. Honestly though, tooling does become an issue. Relational database tooling is mature, NoSQL tooling is less feature rich and less battle hardened. Too often it is roll-your-own tooling. Definitely consider what tools you'd be giving up and how much you need them.
There is an additional slate of advantages for smaller projects when considering a NoSQL service (like Amazon SimpleDB and Microsoft Azure) as compared to a product. If you only have to pay for what you use and you don't use much, it can be cheaper than running a dedicated server, going all the way down to free for something like the SimpleDB free usage tier.
You also avoid some of the server and database maintenance costs. This can be a big win if you don't have a DBA, or when your DBAs are already over worked. Of course you'll still have admin work to do, but it is significantly reduced, and typically simpler.
When it comes to graph databases (like Neo4j - a project I'm involved in) they excel at scaling to complexity. This means, they provide "better substrates for modeling business domains" (see The State of NoSQL, also by Ben Scofield, too). As I see it, this is very important in small to medium sized apps.
This may be better explained through examples, so here's some links to example apps/domain modeling:
Access control lists the graph
database way
Social networks in the database: using a graph database
Domain modeling gallery
The question perhaps requires a bit more context... assuming a Python environment, consider the tutorial at the y_serial project: http://yserial.sourceforge.net/
NoSQL is not merely adopted for reasons of scalability. Serialization (of any arbitrary Python object) and persistence are very convenient at any scale -- so consider the key-value system as one approach.
Well one of the problems with a RDBMS is that you need to spend effort mapping your programming languages domain models to the relational schema of your RDBMS. This effort is usually spent configuring your ORM layer.
With NoSQL databases you are not forced to map your objects to a relational model and in most cases your objects are serialized as-is. Because of the lack of an intermediary schema, data migrations and versioning become easier.
Another benefit is scalability and performance. Since most of the time your data is received by 'keys' effectively everything uses and index. Trivial sharding is possible by doing a % (MOD) on the key against the number of your available NoSQL instances providing natural data partitioning which is crucial for sharding.
If you're interested in seeing how developing with a NoSQL differs from a RDBMS, I have a tutorial where I show how to go about designing a simple blog application using Redis.
If you match up a few common PaaS cloud services like a Key-Value store, a BLOB store, and a Message Queue store you have some handy tools that can free small application developers from the tyranny of the DBA and the infrastructure folks.
Today small developers often resort to Jet MDBs. Why? Easy, shared access is as easy as storing the MDB file on a file share visible to the entire application community. When they can get away with it (i.e. get the necessary support from the gatekeepers) they might use SQL Server Express, MySQL, etc.
Sadly those gatekeepers can be pretty hostile to deal with in a large organization. Mention a "database" and suddenly you face the DBA gang and associated delays, application reviews, prioritization, etc. Mention needing a server and you face that other firing squad.
Using a NoSQL solution and related cloud services can eliminate a ton of this if you don't need an RDBMS.
For one thing, all that's really required is an account with a public cloud provider. This is something that becomes fairly easy once the concept has been approved. And easier for you as a developer once you've been approved and assigned an account, though of course there are the usual bookkeeping issues.
But let's even set that aside. What if your organization implemented a private cloud for such uses? Lots of the issues of outside billing go away, data insecurity worries go away, etc.
Such a thing could be implemented and provisioned in a semi-anonymous fashion, almost as easily as administering file shares. The anonymity comes in because once you've been approved to develop on the in-house cloud nobody needs to nitpick the details of your activities using it any more than they need to examine a request before you can create a file on an existing file share.
Obviously there would be storage and CPU quotas to manage. Nobody can afford to just keep scaling up indefinately. Rogue applications might consume vast quantities of resources. So what you need is some sort of quota system to cap usage. Whether this is monitored by infrastructure folks is an implementation decision, or it might be treated just like file share use: run out and somebody yells at the programmer who in turn looks into it and requests more if appropriate (or fixes his bugs).
But you end up with "utility computing" and by "using no SQL" you don't incur the cost (and issues) of dealing with DBAs. They can still sit quietly surfing the Web in their big offices while you get some work done.
Amazon SimpleDB can be useful for those who need a non-relational database for storage of smaller, non-structural data. Amazon SimpleDB has restricted storage size to 10GB per domain. Amazon SimpleDB offers simplicity and flexibility. SimpleDB automatically indexes all data. Amazon SimpleDB pricing is based on your actual box usage. You can store any UTF-8 string data in Amazon SimpleDB.

NO-SQL reliable for small business app?

I'm deciding between go for a NON-SQL engine or a regular SQL one for a document managment system for small bussines.
I have experience with firebird/sql server and found a good track of reliability (specially with firebird).
This market is full of crappy "servers" (clon-made PC, the mayority), cheap harddisk, rarely use of RAID or anything like that, some are in locations where a power-off is normal, some not have a UPS, etc... (I will include off-site auto-backup to external servers, but that no change the internal setup). (I know about end-user education about such proper setups, but is stupid depend on that, so stick to te point)
From the desing point of view, a schema-less database is the way to go for my system, but, I worry if any of the actual solutions (MongoDb, Tokyo Cabinet, etc) are like firebird and survice crash, malfunctions & abuse so data corruption is very rare.
The plan is store the office documents there & provide a central repository.
Check out Neo4j. It is a graph database (schema-free) that can be used like a document or key/value store.
Neo4j has been in production for many years in environments like you describe. Unlike many other NOSQL databases Neo4j actually flushes data to disk and uses a transaction log to recover from an inconsistent state. It also has real transactions (full ACID) that can span multiple operations and treat them as a single unit (which also seems to be a feature that is frequently left out in many other NOSQL stores).
-Johan
(Disclaimer: I am part of the Neo4j team)
CouchDB has the reliability you need:
The CouchDB file layout and commitment system features all Atomic Consistent Isolated Durable (ACID) properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state.
Look at the ACID Properties section here for more info.
With CouchDB you also get easy backup and replication.
I've no code in production using CouchDB yet, but so far I'm very happy with the tests and the development process with CouchDB.