I'm working on a real-time advertising platform with a heavy emphasis on performance. I've always developed with MySQL, but I'm open to trying something new like MongoDB or Cassandra if significant speed gains can be achieved. I've been reading about both all day, but since both are being rapidly developed, a lot of the information appears somewhat dated.
The main data stored would be entries for each click, incremented rows for views, and information for each campaign (just some basic settings, etc). The speed gains need to be found in inserting clicks, updating view totals, and generating real-time statistic reports. The platform is developed with PHP.
Or maybe none of these?
There are several ways to achieve this with all of the technologies listed. It is more a question of how you use them. Your ideal solution may use a combination of these, with some consideration for usage patterns. I don't feel that the information out there is that dated because the concepts at play are very fundamental. There may be new NoSQL databases and fixes to existing ones, but your question is primarily architectural.
NoSQL solutions like MongoDB and Cassandra get a lot of attention for their insert performance. People tend to complain about the update/insert performance of relational databases but there are ways to mitigate these issues.
Starting with MySQL you could review O'Reilly's High Performance MySQL, optimise the schema, add more memory perhaps run this on different hardware from the rest of your app (assuming you used MySQL for that), or partition/shard data. Another area to consider is your application. Can you queue inserts and updates at the application level before insertion into the database? This will give you some flexibility and is probably useful in all cases. Depending on how your final schema looks, MySQL will give you some help with extracting the data as long as you are comfortable with SQL. This is a benefit if you need to use 3rd party reporting tools etc.
MongoDB and Cassandra are different beasts. My understanding is that it was easier to add nodes to the latter but this has changed since MongoDB has replication etc built-in. Inserts for both of these platforms are not constrained in the same manner as a relational database. Pulling data out is pretty quick too, and you have a lot of flexibility with data format changes. The tradeoff is that you can't use SQL (a benefit for some) so getting reports out may be trickier. There is nothing to stop you from collecting data in one of these platforms and then importing it into a MySQL database for further analysis.
Based on your requirements there are tools other than NoSQL databases which you should look at such as Flume. These make use of the Hadoop platform which is used extensively for analytics. These may have more flexibility than a database for what you are doing. There is some content from Hadoop World that you might be interested in.
Characteristics of MySQL:
Database locking (MUCH easier for financial transactions)
Consistency/security (as above, you can guarantee that, for instance, no changes happen between the time you read a bank account balance and you update it).
Data organization/refactoring (you can have disorganized data anywhere, but MySQL is better with tables that represent "types" or "components" and then combining them into queries -- this is called normalization).
MySQL (and relational databases) are more well suited for arbitrary datasets and requirements common in AGILE software projects.
Characteristics of Cassandra:
Speed: For simple retrieval of large documents. However, it will require multiple queries for highly relational data – and "by default" these queries may not be consistent (and the dataset can change between these queries).
Availability: The opposite of "consistency". Data is always available, regardless of being 100% "correct".[1]
Optional fields (wide columns): This CAN be done in MySQL with meta tables etc., but it's for-free and by-default in Cassandra.
Cassandra is key-value or document-based storage. Think about what that means. TYPICALLY I give Cassandra ONE KEY and I get back ONE DATASET. It can branch out from there, but that's basically what's going on. It's more like accessing a static file. Sure, you can have multiple indexes, counter fields etc. but I'm making a generalization. That's where Cassandra is coming from.
MySQL and SQL is based on group/set theory -- it has a way to combine ANY relationship between data sets. It's pretty easy to take a MySQL query, make the query a "key" and the response a "value" and store it into Cassandra (e.g. make Cassandra a cache). That might help explain the trade-off too, MySQL allows you to always rearrange your data tables and the relationships between datasets simply by writing a different query. Cassandra not so much. And know that while Cassandra might PROVIDE features to do some of this stuff, it's not what it was built for.
MongoDB and CouchDB fit somewhere in the middle of those two extremes. I think MySQL can be a bit verbose[2] and annoying to deal with especially when dealing with optional fields, and migrations if you don't have a good model or tools. Also with scalability, I'm sure there are great technologies for scaling a MySQL database, but Cassandra will always scale, and easily, due to limitations on its feature set. MySQL is a bit more unbounded. However, NoSQL and Cassandra do not do joins, one of the critical features of SQL that allows one to combine multiple tables in a single query. So, complex relational queries will not scale in Cassandra.
[1] Consistency vs. availability is a trade-off within large distributed dataset. It takes a while to make all nodes aware of new data, and eg. Cassandra opts to answer quickly and not to check with every single node before replying. This can causes weird edge cases when you base you writes off previously read data and overwriting data. For more information look into the CAP Theorem, ACID database (in particular Atomicity) as well as Idempotent database operations. MySQL has this issue too, but the idea of high availability over correctness is very baked into Cassandra and gives it many of its scalability and speed advantages.
[2] SQL being "verbose" isn't a great reason to not use it – plus most of us aren't going to (and shouldn't) write plain-text SQL statements.
Nosql solutions are better than Mysql, postgresql and other rdbms techs for this task. Don't waste your time with Hbase/Hadoop, you've to be an astronaut to use it. I recommend MongoDB and Cassandra. Mongo is better for small datasets (if your data is maximum 10 times bigger than your ram, otherwise you have to shard, need more machines and use replica sets). For big data; cassandra is the best. Mongodb has more query options and other functionalities than cassandra but you need 64 bit machines for mongo. There are some works around for analytics in both sides. There is atomic counters in both sides. Both can scale well but cassandra is much better in scaling and high availability. Both have php clients, both have good support and community (mongo community is bigger).
Cassandra analytics project sample:Rainbird http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
mongo sample: http://www.slideshare.net/jrosoff/scalable-event-analytics-with-mongodb-ruby-on-rails
http://axonflux.com/how-superfeedr-built-analytics-using-mongodb
doubleclick developers developed mongo http://www.informationweek.com/news/software/info_management/224200878
Cassandra vs. MongoDB
Are you considering Cassandra or MongoDB as the data store for your next project? Would you like to compare the two databases? Cassandra and MongoDB are both “NoSQL” databases, but the reality is that they are very different. They have very different strengths and value propositions – so any comparison has to be a nuanced one. Let’s start with initial requirements… Neither of these databases replaces RDBMS, nor are they “ACID” databases. So If you have a transactional workload where normalization and consistency are the primary requirements, neither of these databases will work for you. You are better off sticking with traditional relational databases like MySQL, PostGres, Oracle etc. Now that we have relational databases out of the way, let’s consider the major differences between Cassandra and MongoDB that will help you make the decision. In this post, I am not going to discuss specific features but will point out some high-level strategic differences to help you make your choice.
Expressive Object Model
MongoDB supports a rich and expressive object model. Objects can have properties and objects can be nested in one another (for multiple levels). This model is very “object-oriented” and can easily represent any object structure in your domain. You can also index the property of any object at any level of the hierarchy – this is strikingly powerful! Cassandra, on the other hand, offers a fairly traditional table structure with rows and columns. Data is more structured and each column has a specific type which can be specified during creation.
Verdict: If your problem domain needs a rich data model then MongoDB is a better fit for you.
Secondary Indexes
Secondary indexes are a first-class construct in MongoDB. This makes it easy to index any property of an object stored in MongoDB even if it is nested. This makes it really easy to query based on these secondary indexes. Cassandra has only cursory support for secondary indexes. Secondary indexes are also limited to single columns and equality comparisons. If you are mostly going to be querying by the primary key then Cassandra will work well for you.
Verdict: If your application needs secondary indexes and needs flexibility in the query model then MongoDB is a better fit for you.
High Availability
MongoDB supports a “single master” model. This means you have a master node and a number of slave nodes. In case the master goes down, one of the slaves is elected as master. This process happens automatically but it takes time, usually 10-40 seconds. During this time of new leader election, your replica set is down and cannot take writes. This works for most applications but ultimately depends on your needs. Cassandra supports a “multiple master” model. The loss of a single node does not affect the ability of the cluster to take writes – so you can achieve 100% uptime for writes.
Verdict: If you need 100% uptime Cassandra is a better fit for you.
Write Scalability
MongoDB with its “single master” model can take writes only on the primary. The secondary servers can only be used for reads. So essentially if you have three node replica set, only the master is taking writes and the other two nodes are only used for reads. This greatly limits write scalability. You can deploy multiple shards but essentially only 1/3 of your data nodes can take writes. Cassandra with its “multiple master” model can take writes on any server. Essentially your write scalability is limited by the number of servers you have in the cluster. The more servers you have in the cluster, the better it will scale.
Verdict: If write scalability is your thing, Cassandra is a better fit for you.
Query Language Support
Cassandra supports the CQL query language which is very similar to SQL. If you already have a team of data analysts they will be able to port over a majority of their SQL skills which is very important to large organizations. However CQL is not full blown ANSI SQL – It has several limitations (No join support, no OR clauses) etc. MongoDB at this point has no support for a query language. The queries are structured as JSON fragments.
Verdict: If you need query language support, Cassandra is the better fit for you.
Performance Benchmarks
Let’s talk performance. At this point, you are probably expecting a performance benchmark comparison of the databases. I have deliberately not included performance benchmarks in the comparison. In any comparison, we have to make sure we are making an apples-to-apples comparison.
Database model - The database model/schema of the application being tested makes a big difference. Some schemas are well suited for MongoDB and some are well suited for Cassandra. So when comparing databases it is important to use a model that works reasonably well for both databases.
Load characteristics – The characteristics of the benchmark load are very important. E.g. In write-heavy benchmarks, I would expect Cassandra to smoke MongoDB. However, in read-heavy benchmarks, MongoDB and Cassandra should be similar in performance.
Consistency requirements - This is a tricky one. You need to make sure that the read/write consistency requirements specified are identical in both databases and not biased towards one participant. Very often in a number of the ‘Marketing’ benchmarks, the knobs are tuned to disadvantage the other side. So, pay close attention to the consistency settings.
One last thing to keep in mind is that the benchmark load may or may not reflect the performance of your application. So in order for benchmarks to be useful, it is very important to find a benchmark load that reflects the performance characteristics of your application. Here are some benchmarks you might want to look at:
- NoSQL Performance Benchmarks
- Cassandra vs. MongoDB vs. Couchbase vs. HBase
Ease of Use
If you had asked this question a couple of years ago MongoDB would be the hands-down winner. It’s a fairly simple task to get MongoDB up and running. In the last couple of years, however, Cassandra has made great strides in this aspect of the product. With the adoption of CQL as the primary interface for Cassandra, it has taken this a step further – they have made it very simple for legions of SQL programmers to use Cassandra very easily.
Verdict: Both are fairly easy to use and ramp up.
Native Aggregation
MongoDB has a built-in Aggregation framework to run an ETL pipeline to transform the data stored in the database. This is great for small to medium jobs but as your data processing needs become more complicated the aggregation framework becomes difficult to debug. Cassandra does not have a built-in aggregation framework. External tools like Hadoop, Spark are used for this.
Schema-less Models
In MongoDB, you can choose to not enforce any schema on your documents. While this was the default in prior versions in the newer version you have the option to enforce a schema for your documents. Each document in MongoDB can be a different structure and it is up to your application to interpret the data. While this is not relevant to most applications, in some cases the extra flexibility is important. Cassandra in the newer versions (with CQL as the default language) provides static typing. You need to define the type of very column upfront.
I'd also like to add Membase (www.couchbase.com) to this list.
As a product, Membase has been deployed at a number of Ad Agencies (AOL Advertising, Chango, Delta Projects, etc). There are a number of public case studies and examples of how these companies have used Membase successfully.
While it's certainly up for debate, we've found that Membase provides better performance and scalability than any other solution. What we lack in indexing/querying, we are planning on more than making up for with the integration of CouchDB as our new persistence backend.
As a company, Couchbase (the makers of Membase) has a large amount of knowledge and experience specifically serving the needs of Ad/targeting companies.
Would certainly love to engage with you on this particular use case to see if Membase is the right fit.
Please shoot me an email (perry -at- couchbase -dot- com) or visit us on the forums: http://www.couchbase.org/forums/
Perry Krug
I would look at New Relic as an example of a similar workload. They capture over 200 Billion data points a day to disk and are using MySQL 5.6 (Percona) as a backend.
A blog post is available here:
http://blog.newrelic.com/2014/06/13/store-200-billion-data-points-day-disk/
Related
My client has an existing PostgreSQL database with around 100 tables and most every table has one or more relationships to other tables. He's got around a thousand customers who use an app that hits that database.
Recently he hired a new frontend web developer, and that person is trying to tell him that we should throw out the PostgreSQL database and replace it with a MongoDB solution. That seems odd to me, but I don't have experience with MongoDB.
Is there any clear reasons why he should, or should not, make the change? Obviously I'm arguing against it and the other guy for it, but I would like to remove the "I like this one better" from the argument and really hear from the community on their experience with such things.
1) Performance
During last years, there were several benchmarks comparing Postgres and Mongo.
Here you can find the most recent performance benchmark (Yahoo): https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional (start with slide #58, where some overview of the past becnhmarks is given).
Notice, that traditionally, MongoDB provided benchmarks, where they didn't turn on write ahead logging or even turned fsync off, so their benchmarks were unfair -- in such states the database system doesn't wait for filesystem, so TPS are high but probability to lose data is also very high.
2) Flexibility – JSON
Postgres has non-structured and semistructured data types since 2003 (hstore, XML, array data types). And now has very strong JSON support with indexing (jsonb data type), you can create partial indexes, functional indexes, index only part of JSON documents, index whole documents in different manners (you can tweek index to reduce it's size and speed).
More interestingly, with Postgres, you can combine relational approach and non-relational JSON data – see this talk again https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional for details. This gives you a lot of flexibility and power (I wouldn't keep money-related or basic accounts-related data in JSON format).
3) Standards and costs of support
SQL experiences new born now -- NoSQL products started to add SQL dialects, there is a lot of people making big data analysis with SQL, you can even run machine learning algorithms inside RDBMS (see MADlib project http://madlib.incubator.apache.org).
When you need to work with data, SQL was, is and will be for long time the best language – there are such many things included to it, so all other languages are lagging too much. I recommend http://modern-sql.com/ to learn modern SQL features and https://use-the-index-luke.com (from the same author) to learn how reach the best performance using SQL.
When Mongo needed to create "BI connector", they also needed to speak SQL, so guess what they chose? https://www.linkedin.com/pulse/mongodb-32-now-powered-postgresql-john-de-goes
SQL will go nowhere, it's extended with SQL/JSON now and this means that for future, Postgres is an excellent choice.
4) Scalability
If you data size is up to several terabytes -- it's easy to live on "single master - multiple replicas" architectuyre either on your own installation or in clouds (Amazon RDS, Heroku, Google Cloud Platform, and since recently, Azure – all them support Postgres). There is an increasing number of solutions which help you to work with microservice architecture, have automatic failover, and/or shard your data. Here is only few of them, which are actively developed and supported, without specific order:
https://wiki.postgresql.org/wiki/PL/Proxy
https://github.com/zalando/spilo and https://github.com/zalando/patroni
https://github.com/dalibo/PAF
https://github.com/postgrespro/postgres_cluster
https://www.2ndquadrant.com/en/resources/bdr/
https://www.postgresql.org/docs/10/static/postgres-fdw.html
5) Extensibility
There are much more additional projects built to work with Postgres than with Mongo. You can work with literally any data type (including but not limited to time ranges, geospatial data, JSON, XML, arrays), have index support for it, ACID and manipulate with it using standard SQL. You can develop your own functions, data types, operators, index structures and much more!
If your data is relational (and it appears that it is), it makes no sense whatsoever to use a non-relational db (like mongodb). You can't underestimate the power and expressiveness of standard SQL queries.
On top of that, postgres has full ACID. And it can handle free-form JSON reasonably well, if that is that guy's primary motivation.
I am stuck between these two NoSQL databases.
In my project, I will be creating a database within a database. For example, I need a solution to create dynamic tables.
So users can create tables with columns and rows. I think either MongoDB or CouchDB will be good for this, but I am not sure which one. I will also need efficient paging as well.
Of C, A & P (Consistency, Availability & Partition tolerance) which 2 are more important to you? Quick reference, the Visual Guide To NoSQL Systems
MongodB : Consistency and Partition Tolerance
CouchDB : Availability and Partition Tolerance
A blog post, Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison has 'Best used' scenarios for each NoSQL database compared. Quoting the link,
MongoDB: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
CouchDB : For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
A recent (Feb 2012) and more comprehensive comparison by Riyad Kalla,
MongoDB : Master-Slave Replication ONLY
CouchDB : Master-Master Replication
A blog post (Oct 2011) by someone who tried both, A MongoDB Guy Learns CouchDB commented on the CouchDB's paging being not as useful.
A dated (Jun 2009) benchmark by Kristina Chodorow (part of team behind MongoDB),
I'd go for MongoDB.
The answers above all overcomplicate the story.
If you plan to have a mobile component, or need desktop users to work offline and then sync their work to a server you need CouchDB.
If your code will run only on the server then go with MongoDB
That's it. Unless you need CouchDB's (awesome) ability to replicate to mobile and desktop devices, MongoDB has the performance, community and tooling advantage at present.
Very old question but it's on top of Google and I don't quite like the answers I see so here's my own.
There's much more to Couchdb than the ability to develop CouchApps. Most people use CouchDb in a classical 3-tiers web architecture.
In practice the deciding factor for most people will be the fact that MongoDb allows ad-hoc querying with a SQL like syntax while CouchDb doesn't (you've got to create map/reduce views which turns some people off even though creating these views is Rapid Application Development friendly - they have nothing to do with stored procedures).
To address points raised in the accepted answer : CouchDb has a great versionning system, but it doesn't mean that it is only suited (or more suited) for places where versionning is important. Also, couchdb is heavy-write friendly thanks to its append-only nature (writes operations return in no time while guaranteeing that no data will ever be lost).
One very important thing that is not mentioned by anyone is the fact that CouchDb relies on b-tree indexes. This means that whether you have 1 "row" or 20 billions, the querying time will always remain below 10ms. This is a game changer which makes CouchDb a low-latency and read-friendly database, and this really shouldn't be overlooked.
To be fair and exhaustive the advantage MongoDb has over CouchDb is tooling and marketing. They have first-class citizen tools for all major languages and platforms making the on-boarding easy and this added to their adhoc querying makes the transition from SQL even easier.
CouchDb doesn't have this level of tooling - even though there are many libraries available today - but CouchDb is exposed as an HTTP API and it is therefore quite easy to create a wrapper in your favorite language to talk with it. I personally like this approach as it avoids bloat and allows you to only take what you want (interface segregation principle).
So I'd say using one or the other is largely a matter of comfort and preference with their paradigms. CouchDb approach "just fits", for certain people, but if after learning about the database features (in the exhaustive official guide) you don't have your "hell yeah" moment, you should probably move on.
I'd discourage using CouchDb if you just want to use "the right tool for the right job". because you'll find out that you can't just use it that way and you'll end up being pissed and writing blog posts such as "Where are joins in CouchDb ?" and "Where is transaction management ?". Indeed Couchdb is - paradoxically - very transparent but at the same time requires a paradigm shift and a change in the way you approach problems to really shine (and really work).
But once you've done that it really pays off. I'd personally need very strong reasons or a major deal breaker on a project to choose another database, but so far I haven't met any.
Update December 2022:
Since this post is still getting a lot of views, I felt important to inform people that I have recently moved to using MongoDB as my daily driver, while keeping CouchDB in my toolbelt for specialized cases where this database makes more sense (namely cases where views are not needed). There were multiple reasons for this choice, the most important ones were:
Performance: While precomputed indexes are a powerful asset, the main limitation of CouchDB is its QueryServer architecture. Every time a document is updated, it has to be serialized and processed by every view (even though this happens in a deferred manner, namely when the view is accessed). But more importantly, every time a view is updated (for example to add filtering logic for a new field added as part of the implementation of a new feature), ALL documents of the database must be sent to the view. This becomes a big deal when you have millions of documents in the database. You start worrying about the impact of updating your views and it becomes a distraction. Should you decide to create one database per data type to bypass this limitation, you'd then lose the ability to map/reduce across all your documents since views are scoped per database. MongoDB avoids this by segmenting documents into collections (ie. data types) so that when an index is updated only a subset of the data of the database is impacted. Moreover, MongoDB uses a binary format making these operations way more performant (while CouchDB uses JSON sent to the view server in plain text). This point may not be important if you do not design products needing to operate at large scale (hundreds of thousands of daily users or more).
the tooling available with MongoDB is comprehensive and mature, whether we are talking about the drivers officially supported for various programming languages, or integration with IDEs.
Advanced querying: A wide range of data types and advanced query capabilities are available out of the box (geo types, GridFS allowing one to store files of arbitrary size directly in the DB etc...). Having easy access to powerful query aggregation capabilities made me realize how much CouchDB had been inhibiting my productivity.
Seamless support for resharding: resharding is easy with MongoDB, while it is a dangerous operation involving moving files by hands with CouchDB.
Many other small items that improve quality of life and really add up.
I have been a big CouchDB fan but I have to admit that moving to MongoDB as a daily driver felt a lot like moving back to civilization in terms of productivity and quality of life improvement. Now I only consider CouchDB for key-value store scenarios (in which no map-reduce views are required and all that is needed is getting a document by key - CouchDB shines quite a lot for this), and advanced situations in which having per-user like databases is needed (for example to support advanced synchronization between devices).
The only drawback I see with MongoDB is that it consumes a lot of memory to the point that I cannot install it on development machines having low specs (while by comparison couchdb is launched at startup without me noticing and consumes almost no resource). However I feel this is worth it considering the time saved and the features provided.
As a long-time CouchDB user, the value I see in MongoDB is quite different from the items highlighted in the other answers promoting MongoDB so I felt it was important for me to provide this update (and also out of intellectual honestly when I remembered this post). CouchDB gave me quite a boost in productivity back in the days compared to the SQL products and ORMs I had been using, and at that time there were a lot of horror stories circulating regarding the reliability of MongoDB.
However, as of now, the few concerns I could have (and that were probably given disproportionate importance by internet folks - they essentially all boiled down to defaults whose reliability tradeoffs may surprise new users in a number of scenarios) no longer stand.
At this point, as a long-time CouchDB user in a great position to compare both products, I would recommend MongoDB to people needing a productive and scalable software development experience for their web app and advise to only pick CouchDB for specific needs.
CouchDB had momentum back in the days which probably influenced my perception, but development has stalled, no meaningful features have been introduced for a long-time, otherwise it would probably have caught up with MongoDB in terms of quality of life. I see two possible reasons for this: the way a now aborted rewrite of CouchDB has diverted resources for a long-time, and maybe early architectural decisions (such as the Query Server architecture) that may very well have restricted its future from the start. None of these aspects seem to be the priority of the core team.
I do not totally regret choosing CouchDB because it has been massively helpful and the mindset it has taught me is extremely helpful to allow me to write performant code in MongoDB (writing performant code in MongoDB is a breeze compared to the discipline one has to observe to solve business problems using CouchDB). However if I had to do it again today, I would have transitioned to MongoDB as my daily driver MUCH sooner. I'm usually quite good at picking the winning horse when technologies popup, but this time it seems I haven't played the game that well. Hope this helps.
Ask this questions yourself? And you will decide your DB selection.
Do you need master-master? Then CouchDB. Mainly CouchDB supports master-master replication which anticipates nodes being disconnected for long periods of time. MongoDB would not do well in that environment.
Do you need MAXIMUM R/W throughput? Then MongoDB
Do you need ultimate single-server durability because you are only going to have a single DB server? Then CouchDB.
Are you storing a MASSIVE data set that needs sharding while maintaining insane throughput? Then MongoDB.
Do you need strong consistency of data? Then MongoDB.
Do you need high availability of database? Then CouchDB.
Are you hoping multi databases and multi tables/ collections? Then MongoDB
You have a mobile app offline users and want to sync their activity data to a server? Then you need CouchDB.
Do you need large variety of querying engine? Then MongoDB
Do you need large community to be using DB? Then MongoDB
I summarize the answers found in that article:
http://www.quora.com/How-does-MongoDB-compare-to-CouchDB-What-are-the-advantages-and-disadvantages-of-each
MongoDB: Better querying, data storage in BSON (faster access), better data consistency, multiple collections
CouchDB: Better replication, with master to master replication and conflict resolution, data storage in JSON (human-readable, better access through REST services), querying through map-reduce.
So in conclusion, MongoDB is faster, CouchDB is safer.
Also: http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
Be aware of an issue with sparse unique indexes in MongoDB. I've hit it and it is extremely cumbersome to workaround.
The problem is this - you have a field, which is unique if present and you wish to find all the objects where the field is absent. The way sparse unique indexes are implemented in Mongo is that objects where that field is missing are not in the index at all - they cannot be retrieved by a query on that field - {$exists: false} just does not work.
The only workaround I have come up with is having a special null family of values, where an empty value is translated to a special prefix (like null:) concatenated to a uuid. This is a real headache, because one has to take care of transforming to/from the empty values when writing/quering/reading. A major nuisance.
I have never used server side javascript execution in MongoDB (it is not advised anyway) and their map/reduce has awful performance when there is just one Mongo node. Because of all these reasons I am now considering to check out CouchDB, maybe it fits more to my particular scenario.
BTW, if anyone knows the link to the respective Mongo issue describing the sparse unique index problem - please share.
I'm sure you can with Mongo (more familiar with it), and pretty sure you can with couch too.
Both are documented oriented (JSON-based) so there would be no "columns" but rather fields in documents -- but they can be fully dynamic.
They both do it you may want to look at other factors on which to use: other features you care about, popularity, etc. Google insights and indeed.com job posts would be ways to look at popularity.
You could just try it I think you should be able to have mongo running in 5 minutes.
I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes).
To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).
What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?
Some options I am thinking of are:
Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).
To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.
Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id.
What BigTable implementation will be suitable for this? I was considering HBase.
Use MongoDB for storing data and for querying and lookup by id.
recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.
As you suggest MongoDB (or any similar NoSQL persistence solution) is an appropriate fit for you. We've run tests with significantly larger data sets than the one you're suggesting on MongoDB and it works fine. Especially if you're read heavy MongoDB's sharding and/or distributing reads across replicate set members will allow you to speed up your queries significantly. If your usecase allows for keeping your indexes right balanced your goal of getting close to 20ms queries should become feasable without further caching.
You should also check out the Lily project (lilyproject.org). They have integrated HBase with Solr. Internally they use message queues to keep Solr in sync with HBase. This allows them to have the speed of solr indexing (sharding and replication), backed by a highly reliable data storage system.
you could group requests & split them specific to a set of data & have a single (or a group of servers) process that, here you can have the data available in the cache to improve performance.
e.g.,
say, employee & availability data are handled using 10 tables, these can be handled b a small group of server (s) when you configure hibernate cache to load & handle requests.
for this to work you need a load balancer (which balances load by business scenario).
not sure how much of it can be implemented here.
At the 100M records your bottleneck is likely Hibernate, not Oracle. Our customers routinely have billions of records in the individual fact tables of our Oracle-based data warehouse and it handles them fine.
What kind of queries do you execute on your table?
What are the advantages of using NoSQL databases? I've read a lot about them lately, but I'm still unsure why I would want to implement one, and under what circumstances I would want to use one.
Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.
For your 2nd question: Is it okay to use both on the same site?
Why not? Both serves different purposes right?
NoSQL solutions are usually meant to solve a problem that relational databases are either not well suited for, too expensive to use (like Oracle) or require you to implement something that breaks the relational nature of your db anyway.
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS I see no reason why you would choose NoSQL.
I myself use MongoDB and Riak for specific problems where a RDBMS is not a viable solution, for all other things I use MySQL (or SQLite for testing).
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on
a high traffic site.
your data makes
no sense in SQL, you find yourself
doing multiple JOIN queries for
accessing some piece of information.
you are breaking the relational
model, you have CLOBs that store
denormalized data and you generate
external indexes to search that data.
If you don't need a NoSQL solution keep in mind that these solutions weren't meant as replacements for an RDBMS but rather as alternatives where the former fails and more importantly that they are relatively new as such they still have a lot of bugs and missing features.
Oh, and regarding the second question it is perfectly fine to use any technology in conjunction with another, so just to be complete from my experience MongoDB and MySQL work fine together as long as they aren't on the same machine
Martin Fowler has an excellent video which gives a good explanation of NoSQL databases. The link goes straight to his reasons to use them, but the whole video contains good information.
You have large amounts of data - especially if you cannot fit it all on one physical server as NoSQL was designed to scale well.
Object-relational impedance mismatch - Your domain objects do not fit well in a relaitional database schema. NoSQL allows you to persist your data as documents (or graphs) which may map much more closely to your data model.
NoSQL is a database system where data is organized into the document (MongoDB), key-value pair (MemCache, Redis), and graph structure form(Neo4J).
Maybe there are possible questions and answer for "When to go for NoSQL":
Require flexible schema or deal with tree-like data?
Generally, in agile development we start designing systems without knowing all requirements upfront, whereas later on throughout the development database system may need to accommodate frequent design changes, showcasing MVP (Minimal Viable product).
Or you are dealing with a data schema that is dynamic in nature.
e.g. System logs, very precise example is AWS cloudtrail logs.
Data set is vast/big?
Yes NoSQL databases are the better candidate for applications where the database needs to manage millions or even billions of records without compromising performance and availability while may be trading for inconsistency(though modern databases are exception here where it allows tunable consistency over availability e.g. Casandra, Cloud provider databases CosmosDB, DynamoDB).
Trade-off between scaling over consistency
Unlike RDMS, NoSQL databases may make the dataset consistent across other nodes eventually which is the default behavior, but it's easy to scale in terms of performance and availability.
Example: This may be good for storing people who are online in the instant messaging app, API tokens in DB, and logging website traffic stats.
Performing Geolocation Operations:
MongoDB hash rich support for doing GeoQuerying & Geolocation operations. I really loved this feature of MongoDB. So does the PostresSQL but ease of implementation is something that depends on the use case
In nutshell, MongoDB is a great fit for applications where you can store dynamic structured data on a large scale.
Edits:
Updated the answer about the consistency of the database.
Some essential information is missing to answer the question: Which use cases must the database be able to cover? Do complex analyses have to be performed from existing data (OLAP) or does the application have to be able to process many transactions (OLTP)? What is the data structure? That is far from the end of question time.
In my view, it is wrong to make technology decisions on the basis of bold buzzwords without knowing exactly what is behind them. NoSQL is often praised for its scalability. But you also have to know that horizontal scaling (over several nodes) also has its price and is not free. Then you have to deal with issues like eventual consistency and define how to resolve data conflicts if they cannot be resolved at the database level. However, this applies to all distributed database systems.
The joy of the developers with the word "schema less" at NoSQL is at the beginning also very big. This buzzword is quickly disenchanted after technical analysis, because it correctly does not require a schema when writing, but comes into play when reading. That is why it should correctly be "schema on read". It may be tempting to be able to write data at one's own discretion. But how do I deal with the situation if there is existing data but the new version of the application expects a different schema?
The document model (as in MongoDB, for example) is not suitable for data models where there are many relationships between the data. Joins have to be done on application level, which is additional effort and why should I program things that the database should do.
If you make the argument that Google and Amazon have developed their own databases because conventional RDBMS can no longer handle the flood of data, you can only say: You are not Google and Amazon. These companies are the spearhead, some 0.01% of scenarios where traditional databases are no longer suitable, but for the rest of the world they are.
What's not insignificant: SQL has been around for over 40 years and millions of hours of development have gone into large systems such as Oracle or Microsoft SQL. This has to be achieved by some new databases. Sometimes it is also easier to find an SQL admin than someone for MongoDB. Which brings us to the question of maintenance and management. A subject that is not exactly sexy, but that is a part of the technology decision.
Handling A Large Number Of Read Write Operations
Look towards NoSQL databases when you need to scale fast. And when do you generally need to scale fast?
When there are a large number of read-write operations on your website & when dealing with a large amount of data, NoSQL databases fit best in these scenarios. Since they have the ability to add nodes on the fly, they can handle more concurrent traffic & big amount of data with minimal latency.
Flexibility With Data Modeling
The second cue is during the initial phases of development when you are not sure about the data model, the database design, things are expected to change at a rapid pace. NoSQL databases offer us more flexibility.
Eventual Consistency Over Strong Consistency
It’s preferable to pick NoSQL databases when it’s OK for us to give up on Strong consistency and when we do not require transactions.
A good example of this is a social networking website like Twitter. When a tweet of a celebrity blows up and everyone is liking and re-tweeting it from around the world. Does it matter if the count of likes goes up or down a bit for a short while?
The celebrity would definitely not care if instead of the actual 5 million 500 likes, the system shows the like count as 5 million 250 for a short while.
When a large application is deployed on hundreds of servers spread across the globe, the geographically distributed nodes take some time to reach a global consensus.
Until they reach a consensus, the value of the entity is inconsistent. The value of the entity eventually gets consistent after a short while. This is what Eventual Consistency is.
Though the inconsistency does not mean that there is any sort of data loss. It just means that the data takes a short while to travel across the globe via the internet cables under the ocean to reach a global consensus and become consistent.
We experience this behaviour all the time. Especially on YouTube. Often you would see a video with 10 views and 15 likes. How is this even possible?
It’s not. The actual views are already more than the likes. It’s just the count of views is inconsistent and takes a short while to get updated.
Running Data Analytics
NoSQL databases also fit best for data analytics use cases, where we have to deal with an influx of massive amounts of data.
I came across this question while looking for convincing grounds to deviate from RDBMS design.
There is a great post by Julian Brown which sheds lights on constraints of distributed systems. The concept is called Brewer's CAP Theorem which in summary goes:
The three requirements of distributed systems are : Consistency, Availability and Partition tolerance (CAP in short). But you can only have two of them at a time.
And this is how I summarised it for myself:
You better go for NoSQL if Consistency is what you are sacrificing.
I designed and implemented solutions with NoSQL databases and here is my checkpoint list to make the decision to go with SQL or document-oriented NoSQL.
DON'Ts
SQL is not obsolete and remains a better tool in some cases. It's hard to justify use of a document-oriented NoSQL when
Need OLAP/OLTP
It's a small project / simple DB structure
Need ad hoc queries
Can't avoid immediate consistency
Unclear requirements
Lack of experienced developers
DOs
If you don't have those conditions or can mitigate them, then here are 2 reasons where you may benefit from NoSQL:
Need to run at scale
Convenience of development (better integration with your tech stack, no need in ORM, etc.)
More info
In my blog posts I explain the reasons in more details:
7 reasons NOT to NoSQL
2 reasons to NoSQL
Note: the above is applicable to document-oriented NoSQL only. There are other types of NoSQL, which require other considerations.
Ran into this thread and wanted to add my experience.. Many SQL databases support json data in columns and support querying of this json. So what I have used is a hybrid using a relational database with columns containing json..
What scenario makes more sense - host several EC2 instances with MongoDB installed, or much rather use the Amazon SimpleDB webservice?
When having several EC2 instances with MongoDB I have the problem of setting the instance up by myself.
When using SimpleDB I have the problem of locking me into Amazons data structure right?
What differences are there development-wise? Shouldn't I be able to just switch the DAO of my service layers, to either write to MongoDB or AWS SimpleDB?
SimpleDB has some scalability limitations. You can only scale by sharding and it has higher latency than mongodb or cassandra, it has a throughput limit and it is priced higher than other options. Scalability is manual (you have to shard).
If you need wider query options and you have a high read rate and you don't have so much data mongodb is better. But for durability, you need to use at least 2 mongodb server instances as master/slave. Otherwise you can lose the last minute of your data. Scalability is manual. It's much faster than simpledb. Autosharding is implemented in 1.6 version.
Cassandra has weak query options but is as durable as postgresql. It is as fast as mongo and faster on higher data size. Write operations are faster than read operations on cassandra. It can scale automatically by firing ec2 instances, but you have to modify config files a bit (if I remember correctly). If you have terabytes of data cassandra is your best bet. No need to shard your data, it was designed distributed from the 1st day. You can have any number of copies for all your data and if some servers are dead it will automatically return the results from live ones and distribute the dead server's data to others. It's highly fault tolerant. You can include any number of instances, it's much easier to scale than other options. It has strong .net and java client options. They have connection pooling, load balancing, marking of dead servers,...
Another option is hadoop for big data but it's not as realtime as others, you can use hadoop for datawarehousing. Neither cassandra or mongo have transactions, so if you need transactions postgresql is a better fit. Another option is Amazon RDS, but it's performance is bad and price is high. If you want to use databases or simpledb you may also need data caching (eg: memcached).
For web apps, if your data is small I recommend mongo, if it is large cassandra is better. You don't need a caching layer with mongo or cassandra, they are already fast. I don't recommend simpledb, it also locks you to Amazon as you said.
If you are using c#, java or scala you can write an interface and implement it for mongo, mysql, cassandra or anything else for data access layer. It's simpler in dynamic languages (eg rub,python,php). You can write a provider for two of them if you want and can change the storage maybe in runtime by a only a configuration change, they're all possible. Development with mongo,cassandra and simpledb is easier than a database, and they are free of schema, it also depends on the client library/connector you're using. The simplest one is mongo. There's only one index per table in cassandra, so you've to manage other indexes yourself, but with the 0.7 release of cassandra secondary indexes will bu possible as I know. You can also start with any of them and replace it in the future if you have to.
I think you have both a question of time and speed.
MongoDB / Cassandra are going to be much faster, but you will have to invest $$$ to get them going. This means you'll need to run / setup server instances for all them and figure out how they work.
On the other hand, you don't have to per a "per transaction" cost directly, you just pay for the hardware which is probably more efficient for larger services.
In the Cassandra / MongoDB fight here's what you'll find (based on testing I'm personally involved with over the last few days).
Cassandra:
Scaling / Redundancy is very core
Configuration can be very intense
To do reporting you need map-reduce, for that you need to run a hadoop layer. This was a pain to get configured and a bigger pain to get performant.
MongoDB:
Configuration is relatively easy (even for the new sharding, this week)
Redundancy is still "getting there"
Map-reduce is built-in and it's easy to get data out.
Honestly, given the configuration time required for our 10s of GBs of data, we went with MongoDB on our end. I can imagine using SimpleDB for "must get these running" cases. But configuring a node to run MongoDB is so ridiculously simple that it may be worth skipping the "SimpleDB" route.
In terms of DAO, there are tons of libraries already for Mongo. The Thrift framework for Cassandra is well supported. You can probably write some simple logic to abstract away connections. But it will be harder to abstract away things more complex than simple CRUD.
Now 5 years later it is not hard to set up Mongo on any OS. Documentation is easy to follow, so I do not see setting up Mongo as a problem. Other answers addressed the questions of scalability, so I will try to address the question from the point of view of a developer (what limitations each system has):
I will use S for SimpleDB and M for Mongo.
M is written in C++, S is written in Erlang (not the fastest language)
M is open source, installed everywhere, S is proprietary, can run only on amazon AWS. You should also pay for a whole bunch of staff for S
S has whole bunch of strange limitations. M limitations are way more reasonable. The most strange limitations are:
maximum size of domain (table) is 10 GB
attribute value length (size of field) is 1024 bytes
maximum items in Select response - 2500
maximum response size for Select (the maximum amount of data S can return you) - 1Mb
S supports only a few languages (java, php, python, ruby, .net), M supports way more
both support REST
S has a query syntax very similar to SQL (but way less powerful). With M you need to learn a new syntax which looks like json (also it is straight-forward to learn the basics)
with M you have to learn how you architect your database. Because many people think that schemaless means that you can throw any junk in the database and extract this with ease, they might be surprised that Junk in, Junk out maxim works. I assume that the same is in S, but can not claim it with certainty.
both do not allow case insensitive search. In M you can use regex to somehow (ugly/no index) overcome this limitation without introducing the additional lowercase field/application logic.
in S sorting can be done only on one field
because of 5s timelimit count in S can behave strange. If 5 seconds passed and the query has not finished, you end up with a partial number and a token which allows you to continue query. Application logic is responsible for collecting all this data an summing up.
everything is a UTF-8 string, which makes it a pain in the ass to work with non string values (like numbers, dates) in S. M type support is way richer.
both do not have transactions and joins
M supports compression which is really helpful for nosql stores, where the same field name is stored all-over again.
S support just a single index, M has single, compound, multi-key, geospatial etc.
both support replication and sharding
One of the most important things you should consider is that SimpleDB has a very rudimentary query language. Even basic things like group by, sum average, distinct as well as data manipulation is not supported, so the functionality is not really way richer than Redis/Memcached. On the other hand Mongo support a rich query language.