MongoDB/CouchDB for storing files + replication? - mongodb

if I would like to store a lot of files + replicate the db, what NoSql databse would be the best for this kind of job?
I was testing MongoDB and CouchDB and these DBs are really nice and easy to use. If it would be possible I would use one of them for storing files. Now I see the difference between Mongo and Couch, but I cannot explain which one is better for storing files. And if Im talking about storing files I mean files with 10-50MB but also maybe files with 50-500MB - and maybe a lot of updates.
I found here a nice table:
http://weblogs.asp.net/britchie/archive/2010/08/17/document-databases-compared-mongodb-couchdb-and-ravendb.aspx
Still not sure which of these properties are the best for filestoring and replication. But maybe I should choose another NoSql DB?

That table is way out of date:
Master-Slave replication has been deprecated in favour of replica sets for starters and also consistency is wrong there as well. You will want to completely re-read this section on the MongoDB docs.
Map/Reduce is only JavaScript, there is no others.
I have no idea what that table means by attachments but GridFS is a storage standard built into the drivers to help make storing large files in MongoDB easier. Meta-data is also supported through this method.
MongoDB is on version 2.2 so anything it mentions about versions before is now obsolete (i.e. sharding and single server durability).
I do not have personal experience with CouchDBs interface for storing files however I wouldn't be surprised if there was hardly any differences between the two. I would think this part is too subjective for us to answer and you will need to just go for which one suites you better.
It is actually possible to build MongoDB clusters multi-regional (which S3 buckets are not and cannot be replicated as such without work) and replicate the most accessed files in a specific part of the world through MongoDB to these clusters.
I mean the main upshot I have found at times is that MongoDB can act like S3 and Cloudfront put together which is great since you have the redundant storage and the ability to distribute your data.
However that being said S3 is very valid option here and I would seriously give it a try, you might not be looking for the same stuff as me in a content network.
Database storage of files do not come without their serious downsides, however speed shouldn't be a huge problem here since you should get the same speed from a none Cloudfront fronted S3 as you should get from MongoDB really (remember S3 is a redundant storage network, not a CDN).
If you were to use S3 you would then store a row in your database that points to the file and houses meta-data about it.

There is a project called CBFS by Dustin Sallings (one of the Couchbase founders, and creator of spymemcached and core contributor of memcached) and Marty Schoch that uses Couchbase and Go.
It's an Infinite Node file store with redundancy and replication. Basically your very own S3 that supports lots of different hardware and sizes. It uses REST HTTP PUT/GET/DELETE, etc. so very easy to use. Very fast, very powerful.
CBFS on Github: https://github.com/couchbaselabs/cbfs
Protocol: https://github.com/couchbaselabs/cbfs/wiki/Protocol
Blog Post: http://dustin.github.com/2012/09/27/cbfs.html
Diverse Hardware: https://plus.google.com/105229686595945792364/posts/9joBgjEt5PB
Other Cool Visuals:
http://www.youtube.com/watch?v=GiFMVfrNma8
http://www.youtube.com/watch?v=033iKVvrmcQ
Contact me if you have questions and I can put you in touch.

Have you considered Amazon S3 as an option? It's highly available, proven and has redundant storage etc....
CouchDB, even though I personally like it a lot as it works very well with node.js, has the disadvantage that you need to compact it regularly if you don't want to waste too much diskspace. In your case if you are going to be doing a lot of updates to the same documents, that might be an issue.
I can't really commment on MongoDB as I haven't used it, but again, if file storage is your main concern, then have a look at S3 and similar as they are completely focused on filestorage.
You could combine the two where you store your meta data in a NoSql or Sql datastore and your actual files in a separate file store but keeping those 2 stores in sync and replicated might be tricky.

Related

Disadvantages of CouchDB

I've very recently fallen in love with CouchDB. I'm pretty excited by its enormous benefits and by its beauty. Now I want to make sure that I haven't missed any show-stopping disadvantages.
What comes to your mind? Attached is a list of points that I have collected. Is there anything to add?
Blog posts from as late as 2010 claim "not mature enough" (whatever that's worth).
Slower than in-memory DBMS.
In-place updates require server-side logic (update handlers).
Trades disk vs. speed: Databases can become huge compared to other DBMS (compaction functionality exists, though).
"Only" eventual consistency.
Temporary views on large datasets are very slow.
Replication of large databases may fail.
Map/reduce paradigm requires rethinking (only for completeness).
The only point that worries me is #3 (in-place updates), because it's quite inconvenient.
The data is in JSON
Which means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they add up to the document size.
No built in full text search
Although there are ways: couchdb-lucene, elasticsearch
plus some more:
It doesn't support transactions
It means that enforcing uniqueness of one field across all documents is not safe, for example, enforcing that a username is unique. Another consequence of CouchDB's inability to support the typical notion of a transaction is that things like inc/decrementing a value and saving it back are also dangerous. There aren't many instances that we would want to simply inc/decrement some value where we couldn't just store the individual documents separately and aggregate them with a view.
Relational data
If the data makes a lot of sense to be in 3rd normal form, and we try to follow that form in CouchDB, we are going to run into a lot of trouble. A possible way to solve this problem is with view collations, but we might constantly going to be fighting with the system. If the data can be reformatted to be much more denormalized, then CouchDB will work fine.
Data warehouse
The problem with this is that temporary views in CouchDB on large datasets are really slow. Using CouchDB and permanent views could work quite well. However, in most of cases, a Column-Oriented Database of some sort is a much better tool for the data warehousing job.
But CouchDB Rocks!
But don't let it discorage you: NoSQL DBs that are written in Erlang (CouchDB, Riak) are the best, since Erlang is meant for distributed systems. Have fun with Couch!
2 more things, which make me cry when using CouchDB (though it's awesome):
It is not designed for frequently updated data
It doesn't have built-in fulltext search
Lack of reader ACLs (does exist for writers, however)
As an old Lotus Domino pro I was looking to CouchDB as an alternative for a new project I'm kicking off and found the limits on readers to be very weak in Couch vs. Domino. In my app security is an important consideration and Couch would require a middleware layer to handle reader security.
If you have database in which it's okay that all defined users can see all the documents, then Couch looks like an interesting platform.
If restricting reads is needed then you'll need to look to a middleware solution or consider another alternative.
Note to CouchDB developers: Improve the platform security options. I realize they will diminish performance when used but note that and make the option available.
Now back to determining which database to use...
currently no support for ad-hoc queries (might change with advent of UnQL)
lack of binary protocol support for faster communication
It's nothing to do with CouchDB itself, but being a relative newcomer on the scene means that most sysadmins are still unfamiliar with it and won't allow it anywhere near "their" data centers. If you're in a situation where you're deploying to an environment you don't control yourself, this can be quite the battle.
Lack of support for data archiving - No official support for data
archiving is provided with couch db open source distribution.
Deleting records from db is not straightforward
No option to set a expire (TTL) flag for documents

Difference between Memcached and Hadoop?

What is the basic difference between Memcached and Hadoop? Microsoft seems to do memcached with the Windows Server AppFabric.
I know memcached is a giant key value hashing function using multiple servers. What is hadoop and how is hadoop different from memcached? Is it used to store data? objects? I need to save giant in memory objects, but it seems like I need some kind of way of splitting this giant objects into "chunks" like people are talking about. When I look into splitting the object into bytes, it seems like Hadoop is popping up.
I have a giant class in memory with upwards of 100 mb in memory. I need to replicate this object, cache this object in some fashion. When I look into caching this monster object, it seems like I need to split it like how google is doing. How is google doing this. How can hadoop help me in this regard. My objects are not simple structured data. It has references up and down the classes inside, etc.
Any idea, pointers, thoughts, guesses are helpful.
Thanks.
memcached [ http://en.wikipedia.org/wiki/Memcached ] is a single focused distributed caching technology.
apache hadoop [ http://hadoop.apache.org/ ] is a framework for distributed data processing - targeted at google/amazon scale many terrabytes of data. It includes sub-projects for the different areas of this problem - distributed database, algorithm for distributed processing, reporting/querying, data-flow language.
The two technologies tackle different problems. One is for caching (small or large items) across a cluster. And the second is for processing large items across a cluster. From your question it sounds like memcached is more suited to your problem.
Memcache wont work due to its limit on the value of object stored.
memcache faq . I read some place that this limit can be increased to 10 mb but i am unable to find the link.
For your use case I suggest giving mongoDB a try.
mongoDb faq . MongoDB can be used as alternative to memcache. It provides GridFS for storing large file systems in the DB.
You need to use pure Hadoop for what you need (no HBASE, HIVE etc). The Map Reduce mechanism will split your object into many chunks and store it in Hadoop. The tutorial for Map Reduce is here. However, don't forget that Hadoop is, in the first place, a solution for massive compute and storage. In your case I would also recommend checking Membase which is implementation of Memcached with addition storage capabilities. You will not be able to map reduce with memcached/membase but those are still distributed and your object may be cached in a cloud fashion.
Picking a good solution depends on requirements of the intended use, say the difference between storing legal documents forever to a free music service. For example, can the objects be recreated or are they uniquely special? Would they be requiring further processing steps (i.e., MapReduce)? How quickly does an object (or a slice of it) need to be retrieved? Answers to these questions would affect the solution set widely.
If objects can be recreated quickly enough, a simple solution might be to use Memcached as you mentioned across many machines totaling sufficient ram. For adding persistence to this later, CouchBase (formerly Membase) is worth a look and used in production for very large game platforms.
If objects CANNOT be recreated, determine if S3 and other cloud file providers would not meet requirements for now. For high-throuput access, consider one of the several distributed, parallel, fault-tolerant filesystem solutions: DDN (has GPFS and Lustre gear), Panasas (pNFS). I've used DDN gear and it had a better price point than Panasas. Both provide good solutions that are much more supportable than a DIY BackBlaze.
There are some mostly free implementations of distributed, parallel filesystems such as GlusterFS and Ceph that are gaining traction. Ceph touts an S3-compatible gateway and can use BTRFS (future replacement for Lustre; getting closer to production ready). Ceph architecture and presentations. Gluster's advantage is the option for commercial support, although there could be a vendor supporting Ceph deployments. Hadoop's HDFS may be comparable but I have not evaluated it recently.

What NoSQL solution to choose for domain names database?

I have a project that stores several millions of domain names in database and perform search requests to find if domain is present in DB. The only operation I need - check if given value exists. No range queries, no additional information, nothing.
The number of queries that I make to database is rather big, for example 100'000 per one user session.
I have new database once a day and even it's possible to check what records were deleted and what added - I don't think that it's worth it. So, I am importing database to a new table and point script to a new name.
Looking for solution that can make the whole things faster, as I don't use any SQL features. Name search and import time are important for me.
My server can't store this database in memory, even half of it, so I think some NoSQL solution working from hard drive can help me.
Can you suggest something?
A much smaller and faster solution would be to use Berkeley DB with the key-value pair API. Berkeley DB is a database library that links into your application, so there is no client/server overhead nor separate server to install and manage. Berkeley DB is very straightforward and provides, among several APIs, a simple key-value (NoSQL) API that provides all of the basic data management routines that you would expect to find in a much larger, more complex RDBMS (indexing, secondary indexes, foreign keys), but without the overhead of a SQL engine.
Disclaimer: I am the Product Manager for Berkeley DB, so I am a little biased. That said, it was designed to do exactly what you're asking for -- straightforward, fast, scalable key-value data management without unnecessary overhead.
In fact, there are many "database domain" type application services that use Berkeley DB as their primary data store. Most of the open source and/or commercial LDAP implementations use Berkeley DB (including OpenLDAP, Redhat's LDAP, Sun Directory Server, etc.). Cisco, Juniper, AT&T, Alcatel, Mitel, Motorola and many others use Berkeley DB to manage their They use Berkeley DB for their gateway, authentication, and configuration management systems, They use BDB because it does exactly what they need, it's very fast, scalable and reliable.
You could get by quite nicely with just a Bloom filter if you can accept a very small false positive rate (assuming you use a large enough filter).
On the other hand, you could certainly use Cassandra. It makes heavy use of bloom filters, so asking for something that doesn't exist is quick, and you don't have to worry about false positives. It's designed to handle data sets that do not fit into memory, so performance degredation there is quite smooth.
Importing any amount of data should be quick -- on a normal machine, Cassandra can handle about 15k writes per second.
Many options here. Berkeley DB certainly does the job and is probably one of the simplest solutions. Just as simple: store everything in memcached, then you have the option of splitting the cache of the values across several machines if needed (if query load or data size grows).

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.

HBase cassandra couchdb mongodb..any fundamental difference?

I just wanted to know if there is a fundamental difference between hbase, cassandra, couchdb and monogodb ? In other words, are they all competing in the exact same market and trying to solve the exact same problems. Or they fit best in different scenarios?
All this comes to the question, what should I chose when. Matter of taste?
Thanks,
Federico
Those are some long answers from #Bohzo. (but they are good links)
The truth is, they're "kind of" competing. But they definitely have different strengths and weaknesses and they definitely don't all solve the same problems.
For example Couch and Mongo both provide Map-Reduce engines as part of the main package. HBase is (basically) a layer over top of Hadoop, so you also get M-R via Hadoop. Cassandra is highly focused on being a Key-Value store and has plug-ins to "layer" Hadoop over top (so you can map-reduce).
Some of the DBs provide MVCC (Multi-version concurrency control). Mongo does not.
All of these DBs are intended to scale horizontally, but they do it in different ways. All of these DBs are also trying to provide flexibility in different ways. Flexible document sizes or REST APIs or high redundancy or ease of use, they're all making different trade-offs.
So to your question: In other words, are they all competing in the exact same market and trying to solve the exact same problems?
Yes: they're all trying to solve the issue of database-scalability and performance.
No: they're definitely making different sets of trade-offs.
What should you start with?
Man, that's a tough question. I work for a large company pushing tons of data and we've been through a few years. We tried Cassandra at one point a couple of years ago and it couldn't handle the load. We're using Hadoop everywhere, but it definitely has a steep learning curve and it hasn't worked out in some of our environments. More recently we've tried to do Cassandra + Hadoop, but it turned out to be a lot of configuration work.
Personally, my department is moving several things to MongoDB. Our reasons for this are honestly just simplicity.
Setting up Mongo on a linux box takes minutes and doesn't require root access or a change to the file system or anything fancy. There are no crazy config files or java recompiles required. So from that perspective, Mongo has been the easiest "gateway drug" for getting people on to KV/Document stores.
CouchDB and MongoDB are document stores
Cassandra and HBase are key-value based
Here is a detailed comparison between HBase and Cassandra
Here is a (biased) comparison between MongoDB and CouchDB
Short answer: test before you use in production.
I can offer my experience with both HBase (extensive) and MongoDB (just starting).
Even though they are not the same kind of stores, they solve the same problems:
scalable storage of data
random access to the data
low latency access
We were very enthusiastic about HBase at first. It is built on Hadoop (which is rock-solid), it is under Apache, it is active... what more could you want? Our experience:
HBase is fragile
administrator's nightmare (full of configuration settings where default ones are less than perfect, nontransparent configuration, changes from version to version,...)
loses data (unless you have set the X configuration and changed Y to... you get the point :) - we found that out when HBase crashed and we lost 2 hours (!!!) of data because WAL was not setup properly
lacks secondary indexes
lacks any way to perform a backup of database without shutting it down
All in all, HBase was a nightmare. Wouldn't recommend it to anyone except to our direct competitors. :)
MongoDB solves all these problems and many more. It is a delight to setup, it makes administrating it a simple and transparent job and the default configuration settings actually make sense. You can perform (hot) backups, you can have secondary indexes. From what I read, I wouldn't recommend MapReduce on MongoDB (JavaScript, 1 thread per node only), but you can use Hadoop for that.
And it is also VERY active when compared to HBase.
Also:
http://www.google.com/trends?q=HBase%2CMongoDB
Need I say more? :)
UPDATE: many months later I must say MongoDB delivered on all accounts and more. The only real downside is that hosting companies do not offer it the way they offer MySQL. ;)
It also looks like MapReduce is bound to become multi-threaded in 2.2. Still, I wouldn't use MR this way. YMMV.
Cassandra is good for writing the data. it has advantage of "writes never fail". It has no single point failure.
HBase is very good for data processing. HBase is based on Hadoop File System (HDFS) so HBase dosen't need to worry for data replication, data consistency. HBase has the single point of failure. I am not really sure that what does it's mean if it has single point of failure then it is somhow similar to RDBMS where we have single point of failure. I might be wrong in sense since I am quite new.
How abou RIAK ? Does someone has experience using RIAK. I red some where that you need to pay, I am not sure. Need explanation.
One more thing which one you will prefer to use when you are only concern to reading a lot of data. You don't have any concern with writing. Just imagine you have database with pitabyte and you want to make fast search which NOSQL database would you prefer ?