Which NoSQL DB is best fitted for OLTP financial systems? - mongodb

We're designing an OLTP financial system. it should be able to support 10.000 transactions per second and have reporting features.
So we have come to the idea of using:
a NoSQL DB as our main storage
a MySQL DB (Percona server actually) making some ETLs from the NoSQL DB for store reporting data
We're considering MongoDB and Riak for the NoSQL job. we have read that Riak scales more smoothly than MongoDB. And we would like to listen your opinion.
Which NoSQL DB would you use for a
OLTP financial system?
How has been
your experience scaling MongoDB/Riak?

There is no conceivable circumstance where I would use a NOSQl database for anything to do with finance. You simply don't have the data integrity needed or the internal controls. Dow Jones uses SQL Server to do its transactions and if they can properly design a high performance, high transaction Relational datbase so can you. You will have to invest in some people who know what they are doing though.

One has to think about the problem differently. The notion of transaction consistency stems from the UD (update) in CRUD (Create, Read, Update, Delete). noSQL DBs are CRAP (Create, Replicate, Append, Process) oriented, working by accretion of time-stamped data. With the right domain model, there is no reason that auditability and the equivalent of referential integrity can't be achieved.

The global-storage based NoSQL databases - Cache from InterSystems and GT.M from FIS - are used extensively in financial services and have been for many years. Cache in particular is used for both the core database and for OLTP.

I can answer regarding my experience with scaling Riak.
Riak scales smoothly to the extreme. Scaling is as easy as adding nodes to the cluster, which is a very simple operation in itself. You can achieve near linear scalability by simply adding nodes. Our experience with Riak as far as scaling is concerned has been amazing.
The flip side is that it is lacking in many respects. Some examples:
You can't do something like count(*) or list keys on a production cluster. That would require a work around if you want to do ETL from Riak into MySQL - or how would you know what to (E)xtract?
(One possible work around would be to maintain a bucket with a known key sequence that map to values that contain the keys you inserted into your other buckets).
The free version of Riak comes with no management console that lets you know what's going on, and the one that's included in the Enterprise version isn't much of an improvement.
You'll need the Enterprise version of you're looking to replicate your data over WAN (e.g. for DR / high availability). That's alright if you don't mind paying, but keep in mind that Basho pricing is very high.

I work with the Starcounter (so I’m biased), but I think I can safely say that for a system processing financial transactions you have to worry about transaction consistency. Unfortunately, this is what the engines used for Facebook and Twitter had to give up allow their scale-out strategy to offer performance. This is not because engines such as MongoDb or Cassandra are poorly designed; rather, it follows naturally from the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem). Simply put, changes you make in your database will overwrite other changes if they occur close in time. Ok for status updates and new tweets, but disastrous if you deal with money or other quantities. The amounts will simply end up wrong when many reads and writes are being done in parallel. So for the throughput you need, a memory centric NoSQL database with ACID support is probably the way to go.

You can use some NoSQL databases (Cassandra, EventStore) as a storage for financial service if you implement your app using event sourcing and concepts from DDD. I recommend you to read this minibook http://www.oreilly.com/programming/free/reactive-microservices-architecture.html

OLTP can be achieved using NoSQL with a custom implementation,
there are two things,
1. How are you going to achieve ACID properties that an RDBMS gives.
2. Provide a custom blocking or non blocking concurrency and transaction handling mechanism.
To take you closer to solution,
Apache Phoenix,apache trafodion or Splice machine.

Trafodion has full ACID support over HBase, you should take a look.

Cassandra can be used for both OLTP and OLAP. Good replication and eventual data consistency gives you the choice in your hand. Need to design the system properly. And after all it's free of cost but not free of developer, give it a try

Related

Is there a reliable (single server) MongoDB alternative?

I like the idea of document databases, especially MongoDB. It allows for faster development as we don't have to adjust database schema's. However MongoDB doesn't support multi-document transactions and doesn't guarantee that modifications get written to disk immediately like normal databases (I know that you can make the time between flushes quite small, but it's still no guarantee).
Most of our projects are not that big that they need things like multi-server environments. So keeping that in mind. Are there any single server MongoDB-like document databases that support multi-document transactions and reliable flushing to disk?
It might be worthwhile to look at ArangoDB. It is a multi model database with a flexible data model for documents, graphs, and key-values. With respect to your specific requirements, ArangoDB database has full ACID transactions which can span over multiple documents in the same collection as well as over multiple collections (see Transactions in ArangoDB). That is, you can execute a group of manipulations to your documents together in a transaction and have guaranteed atomicity and isolation. If you additionally set waitForSync: true
(as described further down on said page), you get a guaranteed sync to disk before your transaction reports completion. Note that this happens automatically if your transaction spans multiple collections.
A very short answer to your specific (but brief) requirements:
Are there any single server MongoDB-like document databases that support multi-document transactions and reliable flushing to disk?
RavenDB [1] provides support for multi-doc transactions [2]. Unfortunately I don't know it handles durability.
CouchDB [3] provides durable writes, but no multi-doc transactions
RethinkDB [4] provides durable writes, but no multi-doc transactions.
So you might wonder what's different about these 3 solutions? Most of the time is their querying support (I'd say RethinkDB has the most advanced one covering pretty much all types of queries: sub-queries, JOINs, aggregations, etc.), their history (read: production readiness -- here I'd probably say CouchDB is in the lead), their distribution model (you mentioned that's not interesting for you), their licensing (RavenDB: commercial, CouchDB: Apache License, Rethinkdb: AGPL).
The next step would be for you to briefly look over their feature set and figure out which one comes close to your needs and give it a try.
I have some experience with CouchDB and ArangoDB which I can share:
You can run CouchDB with durability turned on (delayed_commits = false) so it will also sync your data to disk.
However, this is a global setting so it affects all writes. AFAIK you cannot set it on a per-collection level (the CouchDB term for "collection" would be "database").
Regarding multi-document operations: CouchDB has MVCC, so reading multiple documents from the same database provides a consistent result even in the face of parallel writers.
Writing multiple documents to the same database can also be made transactional for special cases, e.g. when using the bulk documents API.
But there is no way to execute cross-database operations in CouchDB. This is just not intended.
On ArangoDB: in ArangoDB you can turn on immediate syncing to disk on a per-collection level: you can turn it on for collections which you cannot tolerate any data loss in. You can turn immediate syncing off for not-so-important collections for performance reasons. It will then still sync modifications to disk frequently, but not immediately. It provides multi-document and multi-collection transactions.
Checkout the following:
arangodb
rethinkdb
I would suggest you look at Couchbase.
Couchbase can be run single server & you can add nodes later if you want.
Couchbase has memcached integrated so you have fast caching of common data, with a reliable method of writing updates to disk.
They also have a new query language (in development but you can use it now) called NQL ("Nickel") that gives you SQL like access, if that's important to you.
With cross-datacenter replication, you can keep two DBs on different machines or data centers in sync, which is good for having an offsite backup. This also allows you to add elastic search if you wish to have a full text search engine for those types of queries.
In short, Couchbase is a pretty complete solution, all open source and has intelligent (in my opinion) architecture for addressing the typical problems with distributed databases (e.g.: every document is "owned" by a given node, so all changes go to that node, and then the updates are replicated, this is better, I think, than say Riak where you can have updates go to two nodes and then have to be reconciled.)
You can use Couchbase on one node to run the database for many projects by separating the projects into different buckets.
there are so many nosql databases and definitely its hard to choose one. You will have to come up with proper requirements and know exactly what you want.
Following link compared almost all the popular nosql databases
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
I hope this helps.
Berkeley DB is one we used. It supports ACID. It does have transactions, but as to your term "multidocument" applies, I'm not entirely sure. I imagine so long as each database (i.e. individual document) shares the same BDB environment (i.e. where transactions are stored) then maybe that gets what you want. BDB does have other tradeoffs though. With fully durability and high concurrency, commits are pretty slow.
Give a try to: http://www.orientdb.org/
"OrientDB has the flexibility of the Document databases and the power of the Graph databases to manage relationships. It can work in schema-less mode, schema-full or a mix of both. Supports advanced features such as ACID Transactions, Fast Indexes, Native and SQL queries. It imports and exports documents in JSON. OrientDB uses a new indexing algorithm called MVRB-Tree, derived from the Red-Black Tree and from the B+Tree with benefits of both: fast insertion and ultra fast lookup".
You do not have to adjust schemas in document data stores, but that does not mean you do not need some sort of schema as you probably want to do something meaningful with your data. It appears you would like an ACID database. If you have relational data, and you need transactions with that data, well it sounds very much like you need a relational database.
With "NoSQL" databases like Mongo, you are giving up ACID for features like many writable replicas, sharding, and quick accessing of document data. Sounds like you do not benefit from that so why take the tradeoff? A lot of people have been doing hybrid approaches lately with PostgreSQL by storing documents in a relational table as blobs of JSON. With this, you can have the advantage of storing your data as not strictly structured columns where it is not needed.
So if you have multiple documents that you need to be transactional on update, you can column out the keys, and have a column "document" or something where it is simply a blob of JSON where you serialize and deserialize it. This is not criticizing Mongo or other document stores as a database but it is just not really a good choice for transactional multidocument data. MarkLogic I believe does ACID over multiple documents too.
I think a lot of people find appeal with mongodb due to the schema-less-ness but I think in the end they get bit by trying to shoehorn a relational model into it. So as always the DB choice depends on how your data is.
If I were you I would take a close look at Solr. The underlying data-layer (Lucene) is by far the most mature of the NoSQL databases, and Solr makes installing, configuring, and integrating a single-host lucene store trivial.
In answer to your question, it supports user-delineated transactions. The read-optimised nature of Lucene can make it unsuitable for many applications, but most of those are well suited to Solr/Lucene+[SQL,Cassandra,CouchDB,RDF] depending on the requirements.
Personally I tend to start with Solr+SQL or Solr+RDF, but I know some people who love the whole NodeJS+CouchDB style, and I am convinced of the value of the flexibility that provides.
The bottom line is that there are enough NoSQL and SQL-extensions out there that care about data integrity to satisfy any requirement you have without you having to compromise you or your users' data.
Personally I believe you really need to check what your requirements are.
Due to the dynamics of how the OS of your server works it is complicated to say that everything "immediately" goes to disk even when you tell it to. certainly I know ACID techs like SQL are vulnerable to partial corruption through unfinished business and losing operations within a specific window when a single server goes down, unfortunately this is one of the problems of using a single server; you have no choice but to accept it.
I should note that a transaction does not ensure that your server will receive the entire data before failure ( http://en.wikipedia.org/wiki/Database_transaction ), I mean what if the server dies part way through a transaction?
You can perform a safe rollback based on constraints with transactions but few databases will provide the ability to continue playing the transaction unless they have already received all necessary data for it (which isn't normally the case), by which time the data might even be stale anyway.
In fact due to the weight of some transactions and the amount of queries performed within them I reckon you might get a greater window of operational loss using transactions than you might from the 60ms write to disk window on MongoDB at times. But of course that depends upon abuse, however, just like stored procedures, this abuse is common place.
Transactions shine on cascading deletes and typical scenarios like transferring money in a bank account, however, cascadable deletes are normally better done (as most sites do) by a cronjob with the application marking the row as deleted (to avoid the rollback of a transaction showing the deleted data back to the user again); this way you can do a lot of stuff to ensure consistency that you cannot in real-time do while the user is using your application.
So you should really question why you need a tech and what it will succeed in doing, atm the brevity of your question tells me your not sure about your requirements completely.

SQL vs NoSQL for an inventory management system

I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.

MongoDB vs. Cassandra vs. MySQL for real-time advertising platform

I'm working on a real-time advertising platform with a heavy emphasis on performance. I've always developed with MySQL, but I'm open to trying something new like MongoDB or Cassandra if significant speed gains can be achieved. I've been reading about both all day, but since both are being rapidly developed, a lot of the information appears somewhat dated.
The main data stored would be entries for each click, incremented rows for views, and information for each campaign (just some basic settings, etc). The speed gains need to be found in inserting clicks, updating view totals, and generating real-time statistic reports. The platform is developed with PHP.
Or maybe none of these?
There are several ways to achieve this with all of the technologies listed. It is more a question of how you use them. Your ideal solution may use a combination of these, with some consideration for usage patterns. I don't feel that the information out there is that dated because the concepts at play are very fundamental. There may be new NoSQL databases and fixes to existing ones, but your question is primarily architectural.
NoSQL solutions like MongoDB and Cassandra get a lot of attention for their insert performance. People tend to complain about the update/insert performance of relational databases but there are ways to mitigate these issues.
Starting with MySQL you could review O'Reilly's High Performance MySQL, optimise the schema, add more memory perhaps run this on different hardware from the rest of your app (assuming you used MySQL for that), or partition/shard data. Another area to consider is your application. Can you queue inserts and updates at the application level before insertion into the database? This will give you some flexibility and is probably useful in all cases. Depending on how your final schema looks, MySQL will give you some help with extracting the data as long as you are comfortable with SQL. This is a benefit if you need to use 3rd party reporting tools etc.
MongoDB and Cassandra are different beasts. My understanding is that it was easier to add nodes to the latter but this has changed since MongoDB has replication etc built-in. Inserts for both of these platforms are not constrained in the same manner as a relational database. Pulling data out is pretty quick too, and you have a lot of flexibility with data format changes. The tradeoff is that you can't use SQL (a benefit for some) so getting reports out may be trickier. There is nothing to stop you from collecting data in one of these platforms and then importing it into a MySQL database for further analysis.
Based on your requirements there are tools other than NoSQL databases which you should look at such as Flume. These make use of the Hadoop platform which is used extensively for analytics. These may have more flexibility than a database for what you are doing. There is some content from Hadoop World that you might be interested in.
Characteristics of MySQL:
Database locking (MUCH easier for financial transactions)
Consistency/security (as above, you can guarantee that, for instance, no changes happen between the time you read a bank account balance and you update it).
Data organization/refactoring (you can have disorganized data anywhere, but MySQL is better with tables that represent "types" or "components" and then combining them into queries -- this is called normalization).
MySQL (and relational databases) are more well suited for arbitrary datasets and requirements common in AGILE software projects.
Characteristics of Cassandra:
Speed: For simple retrieval of large documents. However, it will require multiple queries for highly relational data – and "by default" these queries may not be consistent (and the dataset can change between these queries).
Availability: The opposite of "consistency". Data is always available, regardless of being 100% "correct".[1]
Optional fields (wide columns): This CAN be done in MySQL with meta tables etc., but it's for-free and by-default in Cassandra.
Cassandra is key-value or document-based storage. Think about what that means. TYPICALLY I give Cassandra ONE KEY and I get back ONE DATASET. It can branch out from there, but that's basically what's going on. It's more like accessing a static file. Sure, you can have multiple indexes, counter fields etc. but I'm making a generalization. That's where Cassandra is coming from.
MySQL and SQL is based on group/set theory -- it has a way to combine ANY relationship between data sets. It's pretty easy to take a MySQL query, make the query a "key" and the response a "value" and store it into Cassandra (e.g. make Cassandra a cache). That might help explain the trade-off too, MySQL allows you to always rearrange your data tables and the relationships between datasets simply by writing a different query. Cassandra not so much. And know that while Cassandra might PROVIDE features to do some of this stuff, it's not what it was built for.
MongoDB and CouchDB fit somewhere in the middle of those two extremes. I think MySQL can be a bit verbose[2] and annoying to deal with especially when dealing with optional fields, and migrations if you don't have a good model or tools. Also with scalability, I'm sure there are great technologies for scaling a MySQL database, but Cassandra will always scale, and easily, due to limitations on its feature set. MySQL is a bit more unbounded. However, NoSQL and Cassandra do not do joins, one of the critical features of SQL that allows one to combine multiple tables in a single query. So, complex relational queries will not scale in Cassandra.
[1] Consistency vs. availability is a trade-off within large distributed dataset. It takes a while to make all nodes aware of new data, and eg. Cassandra opts to answer quickly and not to check with every single node before replying. This can causes weird edge cases when you base you writes off previously read data and overwriting data. For more information look into the CAP Theorem, ACID database (in particular Atomicity) as well as Idempotent database operations. MySQL has this issue too, but the idea of high availability over correctness is very baked into Cassandra and gives it many of its scalability and speed advantages.
[2] SQL being "verbose" isn't a great reason to not use it – plus most of us aren't going to (and shouldn't) write plain-text SQL statements.
Nosql solutions are better than Mysql, postgresql and other rdbms techs for this task. Don't waste your time with Hbase/Hadoop, you've to be an astronaut to use it. I recommend MongoDB and Cassandra. Mongo is better for small datasets (if your data is maximum 10 times bigger than your ram, otherwise you have to shard, need more machines and use replica sets). For big data; cassandra is the best. Mongodb has more query options and other functionalities than cassandra but you need 64 bit machines for mongo. There are some works around for analytics in both sides. There is atomic counters in both sides. Both can scale well but cassandra is much better in scaling and high availability. Both have php clients, both have good support and community (mongo community is bigger).
Cassandra analytics project sample:Rainbird http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
mongo sample: http://www.slideshare.net/jrosoff/scalable-event-analytics-with-mongodb-ruby-on-rails
http://axonflux.com/how-superfeedr-built-analytics-using-mongodb
doubleclick developers developed mongo http://www.informationweek.com/news/software/info_management/224200878
Cassandra vs. MongoDB
Are you considering Cassandra or MongoDB as the data store for your next project? Would you like to compare the two databases? Cassandra and MongoDB are both “NoSQL” databases, but the reality is that they are very different. They have very different strengths and value propositions – so any comparison has to be a nuanced one. Let’s start with initial requirements… Neither of these databases replaces RDBMS, nor are they “ACID” databases. So If you have a transactional workload where normalization and consistency are the primary requirements, neither of these databases will work for you. You are better off sticking with traditional relational databases like MySQL, PostGres, Oracle etc. Now that we have relational databases out of the way, let’s consider the major differences between Cassandra and MongoDB that will help you make the decision. In this post, I am not going to discuss specific features but will point out some high-level strategic differences to help you make your choice.
Expressive Object Model
MongoDB supports a rich and expressive object model. Objects can have properties and objects can be nested in one another (for multiple levels). This model is very “object-oriented” and can easily represent any object structure in your domain. You can also index the property of any object at any level of the hierarchy – this is strikingly powerful! Cassandra, on the other hand, offers a fairly traditional table structure with rows and columns. Data is more structured and each column has a specific type which can be specified during creation.
Verdict: If your problem domain needs a rich data model then MongoDB is a better fit for you.
Secondary Indexes
Secondary indexes are a first-class construct in MongoDB. This makes it easy to index any property of an object stored in MongoDB even if it is nested. This makes it really easy to query based on these secondary indexes. Cassandra has only cursory support for secondary indexes. Secondary indexes are also limited to single columns and equality comparisons. If you are mostly going to be querying by the primary key then Cassandra will work well for you.
Verdict: If your application needs secondary indexes and needs flexibility in the query model then MongoDB is a better fit for you.
High Availability
MongoDB supports a “single master” model. This means you have a master node and a number of slave nodes. In case the master goes down, one of the slaves is elected as master. This process happens automatically but it takes time, usually 10-40 seconds. During this time of new leader election, your replica set is down and cannot take writes. This works for most applications but ultimately depends on your needs. Cassandra supports a “multiple master” model. The loss of a single node does not affect the ability of the cluster to take writes – so you can achieve 100% uptime for writes.
Verdict: If you need 100% uptime Cassandra is a better fit for you.
Write Scalability
MongoDB with its “single master” model can take writes only on the primary. The secondary servers can only be used for reads. So essentially if you have three node replica set, only the master is taking writes and the other two nodes are only used for reads. This greatly limits write scalability. You can deploy multiple shards but essentially only 1/3 of your data nodes can take writes. Cassandra with its “multiple master” model can take writes on any server. Essentially your write scalability is limited by the number of servers you have in the cluster. The more servers you have in the cluster, the better it will scale.
Verdict: If write scalability is your thing, Cassandra is a better fit for you.
Query Language Support
Cassandra supports the CQL query language which is very similar to SQL. If you already have a team of data analysts they will be able to port over a majority of their SQL skills which is very important to large organizations. However CQL is not full blown ANSI SQL – It has several limitations (No join support, no OR clauses) etc. MongoDB at this point has no support for a query language. The queries are structured as JSON fragments.
Verdict: If you need query language support, Cassandra is the better fit for you.
Performance Benchmarks
Let’s talk performance. At this point, you are probably expecting a performance benchmark comparison of the databases. I have deliberately not included performance benchmarks in the comparison. In any comparison, we have to make sure we are making an apples-to-apples comparison.
Database model - The database model/schema of the application being tested makes a big difference. Some schemas are well suited for MongoDB and some are well suited for Cassandra. So when comparing databases it is important to use a model that works reasonably well for both databases.
Load characteristics – The characteristics of the benchmark load are very important. E.g. In write-heavy benchmarks, I would expect Cassandra to smoke MongoDB. However, in read-heavy benchmarks, MongoDB and Cassandra should be similar in performance.
Consistency requirements - This is a tricky one. You need to make sure that the read/write consistency requirements specified are identical in both databases and not biased towards one participant. Very often in a number of the ‘Marketing’ benchmarks, the knobs are tuned to disadvantage the other side. So, pay close attention to the consistency settings.
One last thing to keep in mind is that the benchmark load may or may not reflect the performance of your application. So in order for benchmarks to be useful, it is very important to find a benchmark load that reflects the performance characteristics of your application. Here are some benchmarks you might want to look at:
- NoSQL Performance Benchmarks
- Cassandra vs. MongoDB vs. Couchbase vs. HBase
Ease of Use
If you had asked this question a couple of years ago MongoDB would be the hands-down winner. It’s a fairly simple task to get MongoDB up and running. In the last couple of years, however, Cassandra has made great strides in this aspect of the product. With the adoption of CQL as the primary interface for Cassandra, it has taken this a step further – they have made it very simple for legions of SQL programmers to use Cassandra very easily.
Verdict: Both are fairly easy to use and ramp up.
Native Aggregation
MongoDB has a built-in Aggregation framework to run an ETL pipeline to transform the data stored in the database. This is great for small to medium jobs but as your data processing needs become more complicated the aggregation framework becomes difficult to debug. Cassandra does not have a built-in aggregation framework. External tools like Hadoop, Spark are used for this.
Schema-less Models
In MongoDB, you can choose to not enforce any schema on your documents. While this was the default in prior versions in the newer version you have the option to enforce a schema for your documents. Each document in MongoDB can be a different structure and it is up to your application to interpret the data. While this is not relevant to most applications, in some cases the extra flexibility is important. Cassandra in the newer versions (with CQL as the default language) provides static typing. You need to define the type of very column upfront.
I'd also like to add Membase (www.couchbase.com) to this list.
As a product, Membase has been deployed at a number of Ad Agencies (AOL Advertising, Chango, Delta Projects, etc). There are a number of public case studies and examples of how these companies have used Membase successfully.
While it's certainly up for debate, we've found that Membase provides better performance and scalability than any other solution. What we lack in indexing/querying, we are planning on more than making up for with the integration of CouchDB as our new persistence backend.
As a company, Couchbase (the makers of Membase) has a large amount of knowledge and experience specifically serving the needs of Ad/targeting companies.
Would certainly love to engage with you on this particular use case to see if Membase is the right fit.
Please shoot me an email (perry -at- couchbase -dot- com) or visit us on the forums: http://www.couchbase.org/forums/
Perry Krug
I would look at New Relic as an example of a similar workload. They capture over 200 Billion data points a day to disk and are using MySQL 5.6 (Percona) as a backend.
A blog post is available here:
http://blog.newrelic.com/2014/06/13/store-200-billion-data-points-day-disk/

Example of a task that a NoSQL database can't handle (if any)

I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.

When should I use a NoSQL database instead of a relational database? Is it okay to use both on the same site?

What are the advantages of using NoSQL databases? I've read a lot about them lately, but I'm still unsure why I would want to implement one, and under what circumstances I would want to use one.
Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.
For your 2nd question: Is it okay to use both on the same site?
Why not? Both serves different purposes right?
NoSQL solutions are usually meant to solve a problem that relational databases are either not well suited for, too expensive to use (like Oracle) or require you to implement something that breaks the relational nature of your db anyway.
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS I see no reason why you would choose NoSQL.
I myself use MongoDB and Riak for specific problems where a RDBMS is not a viable solution, for all other things I use MySQL (or SQLite for testing).
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on
a high traffic site.
your data makes
no sense in SQL, you find yourself
doing multiple JOIN queries for
accessing some piece of information.
you are breaking the relational
model, you have CLOBs that store
denormalized data and you generate
external indexes to search that data.
If you don't need a NoSQL solution keep in mind that these solutions weren't meant as replacements for an RDBMS but rather as alternatives where the former fails and more importantly that they are relatively new as such they still have a lot of bugs and missing features.
Oh, and regarding the second question it is perfectly fine to use any technology in conjunction with another, so just to be complete from my experience MongoDB and MySQL work fine together as long as they aren't on the same machine
Martin Fowler has an excellent video which gives a good explanation of NoSQL databases. The link goes straight to his reasons to use them, but the whole video contains good information.
You have large amounts of data - especially if you cannot fit it all on one physical server as NoSQL was designed to scale well.
Object-relational impedance mismatch - Your domain objects do not fit well in a relaitional database schema. NoSQL allows you to persist your data as documents (or graphs) which may map much more closely to your data model.
NoSQL is a database system where data is organized into the document (MongoDB), key-value pair (MemCache, Redis), and graph structure form(Neo4J).
Maybe there are possible questions and answer for "When to go for NoSQL":
Require flexible schema or deal with tree-like data?
Generally, in agile development we start designing systems without knowing all requirements upfront, whereas later on throughout the development database system may need to accommodate frequent design changes, showcasing MVP (Minimal Viable product).
Or you are dealing with a data schema that is dynamic in nature.
e.g. System logs, very precise example is AWS cloudtrail logs.
Data set is vast/big?
Yes NoSQL databases are the better candidate for applications where the database needs to manage millions or even billions of records without compromising performance and availability while may be trading for inconsistency(though modern databases are exception here where it allows tunable consistency over availability e.g. Casandra, Cloud provider databases CosmosDB, DynamoDB).
Trade-off between scaling over consistency
Unlike RDMS, NoSQL databases may make the dataset consistent across other nodes eventually which is the default behavior, but it's easy to scale in terms of performance and availability.
Example: This may be good for storing people who are online in the instant messaging app, API tokens in DB, and logging website traffic stats.
Performing Geolocation Operations:
MongoDB hash rich support for doing GeoQuerying & Geolocation operations. I really loved this feature of MongoDB. So does the PostresSQL but ease of implementation is something that depends on the use case
In nutshell, MongoDB is a great fit for applications where you can store dynamic structured data on a large scale.
Edits:
Updated the answer about the consistency of the database.
Some essential information is missing to answer the question: Which use cases must the database be able to cover? Do complex analyses have to be performed from existing data (OLAP) or does the application have to be able to process many transactions (OLTP)? What is the data structure? That is far from the end of question time.
In my view, it is wrong to make technology decisions on the basis of bold buzzwords without knowing exactly what is behind them. NoSQL is often praised for its scalability. But you also have to know that horizontal scaling (over several nodes) also has its price and is not free. Then you have to deal with issues like eventual consistency and define how to resolve data conflicts if they cannot be resolved at the database level. However, this applies to all distributed database systems.
The joy of the developers with the word "schema less" at NoSQL is at the beginning also very big. This buzzword is quickly disenchanted after technical analysis, because it correctly does not require a schema when writing, but comes into play when reading. That is why it should correctly be "schema on read". It may be tempting to be able to write data at one's own discretion. But how do I deal with the situation if there is existing data but the new version of the application expects a different schema?
The document model (as in MongoDB, for example) is not suitable for data models where there are many relationships between the data. Joins have to be done on application level, which is additional effort and why should I program things that the database should do.
If you make the argument that Google and Amazon have developed their own databases because conventional RDBMS can no longer handle the flood of data, you can only say: You are not Google and Amazon. These companies are the spearhead, some 0.01% of scenarios where traditional databases are no longer suitable, but for the rest of the world they are.
What's not insignificant: SQL has been around for over 40 years and millions of hours of development have gone into large systems such as Oracle or Microsoft SQL. This has to be achieved by some new databases. Sometimes it is also easier to find an SQL admin than someone for MongoDB. Which brings us to the question of maintenance and management. A subject that is not exactly sexy, but that is a part of the technology decision.
Handling A Large Number Of Read Write Operations
Look towards NoSQL databases when you need to scale fast. And when do you generally need to scale fast?
When there are a large number of read-write operations on your website & when dealing with a large amount of data, NoSQL databases fit best in these scenarios. Since they have the ability to add nodes on the fly, they can handle more concurrent traffic & big amount of data with minimal latency.
Flexibility With Data Modeling
The second cue is during the initial phases of development when you are not sure about the data model, the database design, things are expected to change at a rapid pace. NoSQL databases offer us more flexibility.
Eventual Consistency Over Strong Consistency
It’s preferable to pick NoSQL databases when it’s OK for us to give up on Strong consistency and when we do not require transactions.
A good example of this is a social networking website like Twitter. When a tweet of a celebrity blows up and everyone is liking and re-tweeting it from around the world. Does it matter if the count of likes goes up or down a bit for a short while?
The celebrity would definitely not care if instead of the actual 5 million 500 likes, the system shows the like count as 5 million 250 for a short while.
When a large application is deployed on hundreds of servers spread across the globe, the geographically distributed nodes take some time to reach a global consensus.
Until they reach a consensus, the value of the entity is inconsistent. The value of the entity eventually gets consistent after a short while. This is what Eventual Consistency is.
Though the inconsistency does not mean that there is any sort of data loss. It just means that the data takes a short while to travel across the globe via the internet cables under the ocean to reach a global consensus and become consistent.
We experience this behaviour all the time. Especially on YouTube. Often you would see a video with 10 views and 15 likes. How is this even possible?
It’s not. The actual views are already more than the likes. It’s just the count of views is inconsistent and takes a short while to get updated.
Running Data Analytics
NoSQL databases also fit best for data analytics use cases, where we have to deal with an influx of massive amounts of data.
I came across this question while looking for convincing grounds to deviate from RDBMS design.
There is a great post by Julian Brown which sheds lights on constraints of distributed systems. The concept is called Brewer's CAP Theorem which in summary goes:
The three requirements of distributed systems are : Consistency, Availability and Partition tolerance (CAP in short). But you can only have two of them at a time.
And this is how I summarised it for myself:
You better go for NoSQL if Consistency is what you are sacrificing.
I designed and implemented solutions with NoSQL databases and here is my checkpoint list to make the decision to go with SQL or document-oriented NoSQL.
DON'Ts
SQL is not obsolete and remains a better tool in some cases. It's hard to justify use of a document-oriented NoSQL when
Need OLAP/OLTP
It's a small project / simple DB structure
Need ad hoc queries
Can't avoid immediate consistency
Unclear requirements
Lack of experienced developers
DOs
If you don't have those conditions or can mitigate them, then here are 2 reasons where you may benefit from NoSQL:
Need to run at scale
Convenience of development (better integration with your tech stack, no need in ORM, etc.)
More info
In my blog posts I explain the reasons in more details:
7 reasons NOT to NoSQL
2 reasons to NoSQL
Note: the above is applicable to document-oriented NoSQL only. There are other types of NoSQL, which require other considerations.
Ran into this thread and wanted to add my experience.. Many SQL databases support json data in columns and support querying of this json. So what I have used is a hybrid using a relational database with columns containing json..