What is the recommended approach towards multi-tenant databases in MongoDB? - mongodb

I'm thinking of creating a multi-tenant app using MongoDB. I don't have any guesses in terms of how many tenants I'd have yet, but I would like to be able to scale into the thousands.
I can think of three strategies:
All tenants in the same collection, using tenant-specific fields for security
1 Collection per tenant in a single shared DB
1 Database per tenant
The voice in my head is suggesting that I go with option 2.
Thoughts and implications, anyone?

I have the same problem to solve and also considering variants.
As I have years of experience creating SaaS multi-tenant applicatios I also was going to select the second option based on my previous experience with the relational databases.
While making my research I found this article on mongodb support site (way back added since it's gone):
https://web.archive.org/web/20140812091703/http://support.mongohq.com/use-cases/multi-tenant.html
The guys stated to avoid 2nd options at any cost, which as I understand is not particularly specific to mongodb. My impression is that this is applicable for most of the NoSQL dbs I researched (CoachDB, Cassandra, CouchBase Server, etc.) due to the specifics of the database design.
Collections (or buckets or however they call it in different DBs) are not the same thing as security schemas in RDBMS despite they behave as container for documents they are useless for applying good tenant separation. I couldn't find NoSQL database that can apply security restrictions based on collections.
Of course you can use mongodb role based security to restrict the access on database/server level. (http://docs.mongodb.org/manual/core/authorization/)
I would recommend 1st option when:
You have enough time and resources to deal with the complexity of the
design, implementation and testing of this scenario.
If you are not going to have much differences in structure and
functionality in the database for different tenants.
Your application design will allow tenants to make only minimal
customizations at runtime.
If you want to optimize space and minimize usage of hardware
resources.
If you are going to have thousands of tenants.
If you want to scale out fast and at good cost.
If you are NOT going to backup data based on tenants (keep separate
backups for each tenant). It is possible to do that even in this
scenario but the effort will be huge.
I would go for variant 3 if:
You are going to have small list of tenants (several hundred).
The specifics of the business requires you to be able to support big differences in the database structure for different tenants (e.g. integration with 3rd-party systems, import-export of data).
Your application design will allow customers (tenants) to make significant changes in the application runtime (adding modules, customizing the fields etc.).
If you have enough resources to scale out with new hardware nodes quickly.
If you are required to keep versions/backups of data per tenant. Also the restore will be easy.
There are legal/regulatory restrictions that forces you to keep different tenants in different databases (even data centers).
If you want to fully utilize the out-of-the-box security features of mongodb such as roles.
There are big differences in matter of size between tenants (you have many small tenants and few very large tenants).
If you post additional details about your application, perhaps I can give you more detailed advice.

I found a good answer in the comments in this link:
http://blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment/
Basically option #2 seems to be the best way to go.
Quote from David Mytton's comment:
We decided not to have a database per
customer because of the way MongoDB
allocates its data files. Each
database uses it’s own set of files:
The first file for a database is
dbname.0, then dbname.1, etc. dbname.0
will be 64MB, dbname.1 128MB, etc., up
to 2GB. Once the files reach 2GB in
size, each successive file is also
2GB.
Thus if the last datafile present is
say, 1GB, that file might be 90% empty
if it was recently reached.
from the manual.
As users sign up to the trial and give
things a go, we’d get more and more
databases that were at least 2GB in
size, even if the whole of the data
file wasn’t use. We found this used a
massive amount of disk space compared
to having several databases for all
customers where the disk space can be
used to maximum efficiency.
Sharding will be on a per collection
basis as standard which presents a
problem where the collection never
reaches the minimum size to start
sharding, as is the case for quite a
few of ours (e.g. collections just
storing user login details). However,
we have requested that this will also
be able to be done on a per database
level. See
http://jira.mongodb.org/browse/SHARDING-41
There are no performance tradeoffs
using lots of collections. See
http://www.mongodb.org/display/DOCS/Using+a+Large+Number+of+Collections

There is a reasonable article on MSDN about multi-tenant data architecture which you might wish to refer to. Some key topics touched on by this article:
Economic considerations
Security
Tenant considerations
Regulatory (legal)
Skill set concerns
Also touched upon are some patterns for Software as a Service (SaaS) configuration.
Additionally, worth a gander is an interesting write-up from the SQL Anywhere guys.
My own personal take - unless you are certain of enforced security / trust, I would go with option 3, or if scalability concerns prohibit fallback to option 2 at a minimum. That said... I'm no pro with MongoDB. I get pretty nervous using a shared "schema" - but I will happily defer to more experienced practitioners.

I would go for option 2.
However you could set mongod.exe command line option --smallfiles. This means that the biggest file size of an extent will be 0.5 gigabyte and not 2 gigabyte. I tested this with mongo 1.42 . So option 3 is not impossible.

According to my research in MongoDB. Trucos y consejos. Aplicaciones multitenant.
that option is not recommended if you do not know how many tenants you can have, it could be thousands and it would be complicated when it comes to sharding, also imagine having thousands of collections in a single database ... So in your case it is recommended to use option one. Now if you are going to have a limited number of users, it is already different and yes, you could use option two as you thought.

While the discussion here is on NoSQL and primarily MongoDB, we at Citus are using PostgreSQL and building a distributed/sharded multi-tenant database.
Our use-case guide walks through an example app, covering the schema and various multi-tenant specific features.
For more unstructured data, we use PostgreSQL's JSONB column to store such and tenant-specific data.

Related

Is it better to use multiple databases when you are managing independent sets of things in MongoDB?

If, as an example, you have a blogging website done with MongoDB to store data
Is it better to have a database per blogger? given that their blogs and comments are completely independent from other bloggers. Or just lump everything together? or it doesn't make too much difference?
I'm imagining the same web app (not independent webs/urls per blogger) is used by all bloggers. So when someone logs in / accesses the blog the code would find the right database to use and haul data out it.
Does this have any downsides? is this normal for handling these kinds of things?
I am making plenty of assumptions about your needs. But, generally, there are 3 paths to multi-tenant apps in MongoDB:
Single collection per customer; never, ever do this.
Single database per customer. Good. You will trade off free space if your product is on the freemium model. Either way, you will want to run with "smallfiles" option. As stated, you will build the routing system for your environment. Thus, you will want to connect to the proper database for the proper customer.
customer_id key per document + path slug. Good. The trade off here is recovery of free space. Traditionally, MongoDB does not recover space used by deleted documents. Thus customers creating and deleting blog posts would create unused space. By using 'usePowerOf2Sizes' collections, you will recover disk space of deleted documents. However, 'usePowerOf2Sizes' creates bloated padding space.
To get over the disk space padding, take a look at the compression used here: http://blog.appsignal.com/blog/2013/07/30/taming-mongodb-disk-usage.html
Recap, I would recommend using customer_id plus the compression. It gives you the best of both worlds.
As stated in the comments under the original question, there's really no performance benefit to splitting up your MongoDB store into separate databases per blogger, due to the overhead of having each database and minimum storage.
On the flipside: You are going to make some cross-user analysis more difficult for yourself. As a very simple example, based on your blogging example: Imagine you want to look at average post count per user. This is pretty simple if your users (and posts) are in the same database (typically in the same collections), and you can likely use the aggregation framework for this task. This task will not be so straightforward with an unbounded number of databases, where you'll need to first enumerate all databases, then perform your aggregations/averaging once per database. This could end up being a slower operation than within a single-database architecture.
Having said all that: You still might have some reason to split data across databases. Maybe you have to separate data due to legal reasons, or to ensure customers that their sensitive data won't be commingled with other companies' data. Maybe your customer needs full read/write access to their database, and so you use per-database configuration as a security boundary. I'm sure there are other reasons as well...
It is perfectly normal to allocate 100's of databases if that is all you will see.
Database separation can have many benefits. They can be sharded independantly, since sharding occurs on database level. Databases also have the upside of being completely isolated instances (including locks) of the data within them (good example: space allocation occurs on database level).
This means they can be moved around the network as users data is accessed more and since a single users data might not be that big it would be easier than moving all of your users data to a more powerful node.
However, you must consider the problematic sides in the application of managing the connections to each database. There will be over head on it and you will need to have far more complex coding than what is considered standard.
Considering space, you will not see a drastic usage of space. The most problematic part of using separate databases is the journal allocation. Every collection you use in separate databases will also, of course, pre-allocate itself but this is actually considered one of the upsides to using database separation (movement of databases between nodes, isolation).
So the space problem is really only a problem if your scenario makes it one.
is this normal for handling these kinds of things?
For a normal blogger site, no, and I do not know enough about the complexities of your scenario to say any different. Normal operation would be to lump everything together, since you could see into the region of 1,000's maybe 1,000,000's of users and database separation just won't scale over that very well.

When to use CouchDB over MongoDB and vice versa

I am stuck between these two NoSQL databases.
In my project, I will be creating a database within a database. For example, I need a solution to create dynamic tables.
So users can create tables with columns and rows. I think either MongoDB or CouchDB will be good for this, but I am not sure which one. I will also need efficient paging as well.
Of C, A & P (Consistency, Availability & Partition tolerance) which 2 are more important to you? Quick reference, the Visual Guide To NoSQL Systems
MongodB : Consistency and Partition Tolerance
CouchDB : Availability and Partition Tolerance
A blog post, Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison has 'Best used' scenarios for each NoSQL database compared. Quoting the link,
MongoDB: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
CouchDB : For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
A recent (Feb 2012) and more comprehensive comparison by Riyad Kalla,
MongoDB : Master-Slave Replication ONLY
CouchDB : Master-Master Replication
A blog post (Oct 2011) by someone who tried both, A MongoDB Guy Learns CouchDB commented on the CouchDB's paging being not as useful.
A dated (Jun 2009) benchmark by Kristina Chodorow (part of team behind MongoDB),
I'd go for MongoDB.
The answers above all overcomplicate the story.
If you plan to have a mobile component, or need desktop users to work offline and then sync their work to a server you need CouchDB.
If your code will run only on the server then go with MongoDB
That's it. Unless you need CouchDB's (awesome) ability to replicate to mobile and desktop devices, MongoDB has the performance, community and tooling advantage at present.
Very old question but it's on top of Google and I don't quite like the answers I see so here's my own.
There's much more to Couchdb than the ability to develop CouchApps. Most people use CouchDb in a classical 3-tiers web architecture.
In practice the deciding factor for most people will be the fact that MongoDb allows ad-hoc querying with a SQL like syntax while CouchDb doesn't (you've got to create map/reduce views which turns some people off even though creating these views is Rapid Application Development friendly - they have nothing to do with stored procedures).
To address points raised in the accepted answer : CouchDb has a great versionning system, but it doesn't mean that it is only suited (or more suited) for places where versionning is important. Also, couchdb is heavy-write friendly thanks to its append-only nature (writes operations return in no time while guaranteeing that no data will ever be lost).
One very important thing that is not mentioned by anyone is the fact that CouchDb relies on b-tree indexes. This means that whether you have 1 "row" or 20 billions, the querying time will always remain below 10ms. This is a game changer which makes CouchDb a low-latency and read-friendly database, and this really shouldn't be overlooked.
To be fair and exhaustive the advantage MongoDb has over CouchDb is tooling and marketing. They have first-class citizen tools for all major languages and platforms making the on-boarding easy and this added to their adhoc querying makes the transition from SQL even easier.
CouchDb doesn't have this level of tooling - even though there are many libraries available today - but CouchDb is exposed as an HTTP API and it is therefore quite easy to create a wrapper in your favorite language to talk with it. I personally like this approach as it avoids bloat and allows you to only take what you want (interface segregation principle).
So I'd say using one or the other is largely a matter of comfort and preference with their paradigms. CouchDb approach "just fits", for certain people, but if after learning about the database features (in the exhaustive official guide) you don't have your "hell yeah" moment, you should probably move on.
I'd discourage using CouchDb if you just want to use "the right tool for the right job". because you'll find out that you can't just use it that way and you'll end up being pissed and writing blog posts such as "Where are joins in CouchDb ?" and "Where is transaction management ?". Indeed Couchdb is - paradoxically - very transparent but at the same time requires a paradigm shift and a change in the way you approach problems to really shine (and really work).
But once you've done that it really pays off. I'd personally need very strong reasons or a major deal breaker on a project to choose another database, but so far I haven't met any.
Update December 2022:
Since this post is still getting a lot of views, I felt important to inform people that I have recently moved to using MongoDB as my daily driver, while keeping CouchDB in my toolbelt for specialized cases where this database makes more sense (namely cases where views are not needed). There were multiple reasons for this choice, the most important ones were:
Performance: While precomputed indexes are a powerful asset, the main limitation of CouchDB is its QueryServer architecture. Every time a document is updated, it has to be serialized and processed by every view (even though this happens in a deferred manner, namely when the view is accessed). But more importantly, every time a view is updated (for example to add filtering logic for a new field added as part of the implementation of a new feature), ALL documents of the database must be sent to the view. This becomes a big deal when you have millions of documents in the database. You start worrying about the impact of updating your views and it becomes a distraction. Should you decide to create one database per data type to bypass this limitation, you'd then lose the ability to map/reduce across all your documents since views are scoped per database. MongoDB avoids this by segmenting documents into collections (ie. data types) so that when an index is updated only a subset of the data of the database is impacted. Moreover, MongoDB uses a binary format making these operations way more performant (while CouchDB uses JSON sent to the view server in plain text). This point may not be important if you do not design products needing to operate at large scale (hundreds of thousands of daily users or more).
the tooling available with MongoDB is comprehensive and mature, whether we are talking about the drivers officially supported for various programming languages, or integration with IDEs.
Advanced querying: A wide range of data types and advanced query capabilities are available out of the box (geo types, GridFS allowing one to store files of arbitrary size directly in the DB etc...). Having easy access to powerful query aggregation capabilities made me realize how much CouchDB had been inhibiting my productivity.
Seamless support for resharding: resharding is easy with MongoDB, while it is a dangerous operation involving moving files by hands with CouchDB.
Many other small items that improve quality of life and really add up.
I have been a big CouchDB fan but I have to admit that moving to MongoDB as a daily driver felt a lot like moving back to civilization in terms of productivity and quality of life improvement. Now I only consider CouchDB for key-value store scenarios (in which no map-reduce views are required and all that is needed is getting a document by key - CouchDB shines quite a lot for this), and advanced situations in which having per-user like databases is needed (for example to support advanced synchronization between devices).
The only drawback I see with MongoDB is that it consumes a lot of memory to the point that I cannot install it on development machines having low specs (while by comparison couchdb is launched at startup without me noticing and consumes almost no resource). However I feel this is worth it considering the time saved and the features provided.
As a long-time CouchDB user, the value I see in MongoDB is quite different from the items highlighted in the other answers promoting MongoDB so I felt it was important for me to provide this update (and also out of intellectual honestly when I remembered this post). CouchDB gave me quite a boost in productivity back in the days compared to the SQL products and ORMs I had been using, and at that time there were a lot of horror stories circulating regarding the reliability of MongoDB.
However, as of now, the few concerns I could have (and that were probably given disproportionate importance by internet folks - they essentially all boiled down to defaults whose reliability tradeoffs may surprise new users in a number of scenarios) no longer stand.
At this point, as a long-time CouchDB user in a great position to compare both products, I would recommend MongoDB to people needing a productive and scalable software development experience for their web app and advise to only pick CouchDB for specific needs.
CouchDB had momentum back in the days which probably influenced my perception, but development has stalled, no meaningful features have been introduced for a long-time, otherwise it would probably have caught up with MongoDB in terms of quality of life. I see two possible reasons for this: the way a now aborted rewrite of CouchDB has diverted resources for a long-time, and maybe early architectural decisions (such as the Query Server architecture) that may very well have restricted its future from the start. None of these aspects seem to be the priority of the core team.
I do not totally regret choosing CouchDB because it has been massively helpful and the mindset it has taught me is extremely helpful to allow me to write performant code in MongoDB (writing performant code in MongoDB is a breeze compared to the discipline one has to observe to solve business problems using CouchDB). However if I had to do it again today, I would have transitioned to MongoDB as my daily driver MUCH sooner. I'm usually quite good at picking the winning horse when technologies popup, but this time it seems I haven't played the game that well. Hope this helps.
Ask this questions yourself? And you will decide your DB selection.
Do you need master-master? Then CouchDB. Mainly CouchDB supports master-master replication which anticipates nodes being disconnected for long periods of time. MongoDB would not do well in that environment.
Do you need MAXIMUM R/W throughput? Then MongoDB
Do you need ultimate single-server durability because you are only going to have a single DB server? Then CouchDB.
Are you storing a MASSIVE data set that needs sharding while maintaining insane throughput? Then MongoDB.
Do you need strong consistency of data? Then MongoDB.
Do you need high availability of database? Then CouchDB.
Are you hoping multi databases and multi tables/ collections? Then MongoDB
You have a mobile app offline users and want to sync their activity data to a server? Then you need CouchDB.
Do you need large variety of querying engine? Then MongoDB
Do you need large community to be using DB? Then MongoDB
I summarize the answers found in that article:
http://www.quora.com/How-does-MongoDB-compare-to-CouchDB-What-are-the-advantages-and-disadvantages-of-each
MongoDB: Better querying, data storage in BSON (faster access), better data consistency, multiple collections
CouchDB: Better replication, with master to master replication and conflict resolution, data storage in JSON (human-readable, better access through REST services), querying through map-reduce.
So in conclusion, MongoDB is faster, CouchDB is safer.
Also: http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
Be aware of an issue with sparse unique indexes in MongoDB. I've hit it and it is extremely cumbersome to workaround.
The problem is this - you have a field, which is unique if present and you wish to find all the objects where the field is absent. The way sparse unique indexes are implemented in Mongo is that objects where that field is missing are not in the index at all - they cannot be retrieved by a query on that field - {$exists: false} just does not work.
The only workaround I have come up with is having a special null family of values, where an empty value is translated to a special prefix (like null:) concatenated to a uuid. This is a real headache, because one has to take care of transforming to/from the empty values when writing/quering/reading. A major nuisance.
I have never used server side javascript execution in MongoDB (it is not advised anyway) and their map/reduce has awful performance when there is just one Mongo node. Because of all these reasons I am now considering to check out CouchDB, maybe it fits more to my particular scenario.
BTW, if anyone knows the link to the respective Mongo issue describing the sparse unique index problem - please share.
I'm sure you can with Mongo (more familiar with it), and pretty sure you can with couch too.
Both are documented oriented (JSON-based) so there would be no "columns" but rather fields in documents -- but they can be fully dynamic.
They both do it you may want to look at other factors on which to use: other features you care about, popularity, etc. Google insights and indeed.com job posts would be ways to look at popularity.
You could just try it I think you should be able to have mongo running in 5 minutes.

Disadvantages of CouchDB

I've very recently fallen in love with CouchDB. I'm pretty excited by its enormous benefits and by its beauty. Now I want to make sure that I haven't missed any show-stopping disadvantages.
What comes to your mind? Attached is a list of points that I have collected. Is there anything to add?
Blog posts from as late as 2010 claim "not mature enough" (whatever that's worth).
Slower than in-memory DBMS.
In-place updates require server-side logic (update handlers).
Trades disk vs. speed: Databases can become huge compared to other DBMS (compaction functionality exists, though).
"Only" eventual consistency.
Temporary views on large datasets are very slow.
Replication of large databases may fail.
Map/reduce paradigm requires rethinking (only for completeness).
The only point that worries me is #3 (in-place updates), because it's quite inconvenient.
The data is in JSON
Which means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they add up to the document size.
No built in full text search
Although there are ways: couchdb-lucene, elasticsearch
plus some more:
It doesn't support transactions
It means that enforcing uniqueness of one field across all documents is not safe, for example, enforcing that a username is unique. Another consequence of CouchDB's inability to support the typical notion of a transaction is that things like inc/decrementing a value and saving it back are also dangerous. There aren't many instances that we would want to simply inc/decrement some value where we couldn't just store the individual documents separately and aggregate them with a view.
Relational data
If the data makes a lot of sense to be in 3rd normal form, and we try to follow that form in CouchDB, we are going to run into a lot of trouble. A possible way to solve this problem is with view collations, but we might constantly going to be fighting with the system. If the data can be reformatted to be much more denormalized, then CouchDB will work fine.
Data warehouse
The problem with this is that temporary views in CouchDB on large datasets are really slow. Using CouchDB and permanent views could work quite well. However, in most of cases, a Column-Oriented Database of some sort is a much better tool for the data warehousing job.
But CouchDB Rocks!
But don't let it discorage you: NoSQL DBs that are written in Erlang (CouchDB, Riak) are the best, since Erlang is meant for distributed systems. Have fun with Couch!
2 more things, which make me cry when using CouchDB (though it's awesome):
It is not designed for frequently updated data
It doesn't have built-in fulltext search
Lack of reader ACLs (does exist for writers, however)
As an old Lotus Domino pro I was looking to CouchDB as an alternative for a new project I'm kicking off and found the limits on readers to be very weak in Couch vs. Domino. In my app security is an important consideration and Couch would require a middleware layer to handle reader security.
If you have database in which it's okay that all defined users can see all the documents, then Couch looks like an interesting platform.
If restricting reads is needed then you'll need to look to a middleware solution or consider another alternative.
Note to CouchDB developers: Improve the platform security options. I realize they will diminish performance when used but note that and make the option available.
Now back to determining which database to use...
currently no support for ad-hoc queries (might change with advent of UnQL)
lack of binary protocol support for faster communication
It's nothing to do with CouchDB itself, but being a relative newcomer on the scene means that most sysadmins are still unfamiliar with it and won't allow it anywhere near "their" data centers. If you're in a situation where you're deploying to an environment you don't control yourself, this can be quite the battle.
Lack of support for data archiving - No official support for data
archiving is provided with couch db open source distribution.
Deleting records from db is not straightforward
No option to set a expire (TTL) flag for documents

Example of a task that a NoSQL database can't handle (if any)

I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.

When should I use a NoSQL database instead of a relational database? Is it okay to use both on the same site?

What are the advantages of using NoSQL databases? I've read a lot about them lately, but I'm still unsure why I would want to implement one, and under what circumstances I would want to use one.
Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.
For your 2nd question: Is it okay to use both on the same site?
Why not? Both serves different purposes right?
NoSQL solutions are usually meant to solve a problem that relational databases are either not well suited for, too expensive to use (like Oracle) or require you to implement something that breaks the relational nature of your db anyway.
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS I see no reason why you would choose NoSQL.
I myself use MongoDB and Riak for specific problems where a RDBMS is not a viable solution, for all other things I use MySQL (or SQLite for testing).
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on
a high traffic site.
your data makes
no sense in SQL, you find yourself
doing multiple JOIN queries for
accessing some piece of information.
you are breaking the relational
model, you have CLOBs that store
denormalized data and you generate
external indexes to search that data.
If you don't need a NoSQL solution keep in mind that these solutions weren't meant as replacements for an RDBMS but rather as alternatives where the former fails and more importantly that they are relatively new as such they still have a lot of bugs and missing features.
Oh, and regarding the second question it is perfectly fine to use any technology in conjunction with another, so just to be complete from my experience MongoDB and MySQL work fine together as long as they aren't on the same machine
Martin Fowler has an excellent video which gives a good explanation of NoSQL databases. The link goes straight to his reasons to use them, but the whole video contains good information.
You have large amounts of data - especially if you cannot fit it all on one physical server as NoSQL was designed to scale well.
Object-relational impedance mismatch - Your domain objects do not fit well in a relaitional database schema. NoSQL allows you to persist your data as documents (or graphs) which may map much more closely to your data model.
NoSQL is a database system where data is organized into the document (MongoDB), key-value pair (MemCache, Redis), and graph structure form(Neo4J).
Maybe there are possible questions and answer for "When to go for NoSQL":
Require flexible schema or deal with tree-like data?
Generally, in agile development we start designing systems without knowing all requirements upfront, whereas later on throughout the development database system may need to accommodate frequent design changes, showcasing MVP (Minimal Viable product).
Or you are dealing with a data schema that is dynamic in nature.
e.g. System logs, very precise example is AWS cloudtrail logs.
Data set is vast/big?
Yes NoSQL databases are the better candidate for applications where the database needs to manage millions or even billions of records without compromising performance and availability while may be trading for inconsistency(though modern databases are exception here where it allows tunable consistency over availability e.g. Casandra, Cloud provider databases CosmosDB, DynamoDB).
Trade-off between scaling over consistency
Unlike RDMS, NoSQL databases may make the dataset consistent across other nodes eventually which is the default behavior, but it's easy to scale in terms of performance and availability.
Example: This may be good for storing people who are online in the instant messaging app, API tokens in DB, and logging website traffic stats.
Performing Geolocation Operations:
MongoDB hash rich support for doing GeoQuerying & Geolocation operations. I really loved this feature of MongoDB. So does the PostresSQL but ease of implementation is something that depends on the use case
In nutshell, MongoDB is a great fit for applications where you can store dynamic structured data on a large scale.
Edits:
Updated the answer about the consistency of the database.
Some essential information is missing to answer the question: Which use cases must the database be able to cover? Do complex analyses have to be performed from existing data (OLAP) or does the application have to be able to process many transactions (OLTP)? What is the data structure? That is far from the end of question time.
In my view, it is wrong to make technology decisions on the basis of bold buzzwords without knowing exactly what is behind them. NoSQL is often praised for its scalability. But you also have to know that horizontal scaling (over several nodes) also has its price and is not free. Then you have to deal with issues like eventual consistency and define how to resolve data conflicts if they cannot be resolved at the database level. However, this applies to all distributed database systems.
The joy of the developers with the word "schema less" at NoSQL is at the beginning also very big. This buzzword is quickly disenchanted after technical analysis, because it correctly does not require a schema when writing, but comes into play when reading. That is why it should correctly be "schema on read". It may be tempting to be able to write data at one's own discretion. But how do I deal with the situation if there is existing data but the new version of the application expects a different schema?
The document model (as in MongoDB, for example) is not suitable for data models where there are many relationships between the data. Joins have to be done on application level, which is additional effort and why should I program things that the database should do.
If you make the argument that Google and Amazon have developed their own databases because conventional RDBMS can no longer handle the flood of data, you can only say: You are not Google and Amazon. These companies are the spearhead, some 0.01% of scenarios where traditional databases are no longer suitable, but for the rest of the world they are.
What's not insignificant: SQL has been around for over 40 years and millions of hours of development have gone into large systems such as Oracle or Microsoft SQL. This has to be achieved by some new databases. Sometimes it is also easier to find an SQL admin than someone for MongoDB. Which brings us to the question of maintenance and management. A subject that is not exactly sexy, but that is a part of the technology decision.
Handling A Large Number Of Read Write Operations
Look towards NoSQL databases when you need to scale fast. And when do you generally need to scale fast?
When there are a large number of read-write operations on your website & when dealing with a large amount of data, NoSQL databases fit best in these scenarios. Since they have the ability to add nodes on the fly, they can handle more concurrent traffic & big amount of data with minimal latency.
Flexibility With Data Modeling
The second cue is during the initial phases of development when you are not sure about the data model, the database design, things are expected to change at a rapid pace. NoSQL databases offer us more flexibility.
Eventual Consistency Over Strong Consistency
It’s preferable to pick NoSQL databases when it’s OK for us to give up on Strong consistency and when we do not require transactions.
A good example of this is a social networking website like Twitter. When a tweet of a celebrity blows up and everyone is liking and re-tweeting it from around the world. Does it matter if the count of likes goes up or down a bit for a short while?
The celebrity would definitely not care if instead of the actual 5 million 500 likes, the system shows the like count as 5 million 250 for a short while.
When a large application is deployed on hundreds of servers spread across the globe, the geographically distributed nodes take some time to reach a global consensus.
Until they reach a consensus, the value of the entity is inconsistent. The value of the entity eventually gets consistent after a short while. This is what Eventual Consistency is.
Though the inconsistency does not mean that there is any sort of data loss. It just means that the data takes a short while to travel across the globe via the internet cables under the ocean to reach a global consensus and become consistent.
We experience this behaviour all the time. Especially on YouTube. Often you would see a video with 10 views and 15 likes. How is this even possible?
It’s not. The actual views are already more than the likes. It’s just the count of views is inconsistent and takes a short while to get updated.
Running Data Analytics
NoSQL databases also fit best for data analytics use cases, where we have to deal with an influx of massive amounts of data.
I came across this question while looking for convincing grounds to deviate from RDBMS design.
There is a great post by Julian Brown which sheds lights on constraints of distributed systems. The concept is called Brewer's CAP Theorem which in summary goes:
The three requirements of distributed systems are : Consistency, Availability and Partition tolerance (CAP in short). But you can only have two of them at a time.
And this is how I summarised it for myself:
You better go for NoSQL if Consistency is what you are sacrificing.
I designed and implemented solutions with NoSQL databases and here is my checkpoint list to make the decision to go with SQL or document-oriented NoSQL.
DON'Ts
SQL is not obsolete and remains a better tool in some cases. It's hard to justify use of a document-oriented NoSQL when
Need OLAP/OLTP
It's a small project / simple DB structure
Need ad hoc queries
Can't avoid immediate consistency
Unclear requirements
Lack of experienced developers
DOs
If you don't have those conditions or can mitigate them, then here are 2 reasons where you may benefit from NoSQL:
Need to run at scale
Convenience of development (better integration with your tech stack, no need in ORM, etc.)
More info
In my blog posts I explain the reasons in more details:
7 reasons NOT to NoSQL
2 reasons to NoSQL
Note: the above is applicable to document-oriented NoSQL only. There are other types of NoSQL, which require other considerations.
Ran into this thread and wanted to add my experience.. Many SQL databases support json data in columns and support querying of this json. So what I have used is a hybrid using a relational database with columns containing json..