Storing IOT data in MongoDb - mongodb

I am currently streaming IOT data to my MongoDB which is running in a Docker Container(hosted in AWS). Per day I am getting a couple of thousands of data points.
I will be using this data gathered for some intensive data analysis and ML which will run on day to day basis.
So is this how normally how big data is stored? What are the industrial standards and best practices?

It depends on a lot of factors, for example, the type of data one is analyzing, how much data one has and how quickly you need it.
For applications such as user behavior analysis, relational DB is best.
Well, if the data fits into a spreadsheet, then it is better suited for a SQL-type database such as Postgres, BigQuery as relational databases are good at analyzing data in rows and columns.
For semi-structured data, think social media, texts or geographical data which requires a large amount of text mining or image processing, NoSQL type database such as MongoDB, CouchDB works best.
On the other hand, in relational databases, one can use SQL to query them. SQL as a language is well-known among data analysts and engineers and is also easy to learn than most programming languages.
Databases that commonly used in the industry to store Big Data are:
Relational Database Management System: As data engine storage, the platform employs the B-Tree structure. B-Tree concepts are used to organize the index and data, and logarithmic time is used to write and read the data.
MongoDB: You can use this platform if you need to de-normalize
tables. It is apt if you want to resort to documents that comprise all the allied nested structures in a single document for maintaining consistency.
Cassandra: This database platform is perfect for upfront queries and fast writing. However, the query performance is slightly less, and that makes it ideal for Time-Series data. Cassandra uses the
Long-Structured-Merge-Tree format in the storage engine.
Apache HBase: This data management platform has similarities with
Cassandra in its formatting. HBase also comes with the same performance metrics as Cassandra.
OpenTSDB: The platform is perfect for IoT user-cases where the information gathers thousands within seconds. The collected questions are needed for the dashboards.
Hope it helps.

Related

hadoop with mongodb and hadoop vs mongodb

I'm trying to understand key differences between mongoDB and Hadoop.
I understand that mongoDB is a database, while Hadoop is an ecosystem that contains HDFS. Some similarities in the way that data is processed using either technology, while major differences as well.
I'm confused as to why someone would use mongoDB over the Hadoop cluster, mainly what advantages does mongoDB offer over Hadoop. Both perform parallel processing, both can be used with Spark for further data analytics, so what is the value add one over the other.
Now, if you were to combine both, why would you want to store data in mongoDB as well as HDFS? MongoDB has map/reduce, so why would you want to send data to hadoop for processing, and again both are compatible with Spark.
First lets look at what we're talking about
Hadoop - an ecosystem. Two main components are HDFS and MapReduce.
MongoDB - Document type NoSQL DataBase.
Lets compare them on two types of workloads
High latency high throughput (Batch processing) - Dealing with the question of how to process and analyze large volumes of data. Processing will be made in a parallel and distributed way in order to finalize and retrieve results in the most efficient way possible. Hadoop is the best way to deal with such a problem, managing and processing data in a distributed and parallel way across several servers.
Low Latency and low throughput (immediate access to data, real time results, a lot of users) - When dealing with the need to show immediate results in the quickest way possible, or make small parallel processing resulting in NRT results to several concurrent users a NoSQL database will be the best way to go.
A simple example in a stack would be to use Hadoop in order to process and analyze massive amounts of data, then store your end results in MongoDB in order for you to:
Access them in the quickest way possible
Reprocess them now that they are on a smaller scale
The bottom line is that you shouldn't look at Hadoop and MongoDB as competitors, since each one has his own best use case and approach to data, they compliment and complete each other in your work with data.
Hope this makes sense.
Firstly, we should know what these two terms mean.
HADOOP
Hadoop is an open-source tool for Big Data analytics developed by the Apache foundation. It is the most popularly used tool for both storing as well as analyzing Big Data. It uses a clustered architecture for the same.
Hadoop has a vast ecosystem and this ecosystem comprises of some robust tools.
MongoDB
MongoDB is an open-source, general-purpose, document-based, distributed NoSQL database built for storing Big Data. MongoDB has a very rich query language which results in high performance. MongoDB is a document-based database, which implies that it stores data in JSON-like format documents.
DIFFERENCES
Both these tools are good enough for harnessing Big Data. It depends on your requirements. For some projects, Hadoop would be a good option and some MongoDB fits well.
Hope this helps you to distinguish between the two.

Timeseries NoSQL databases

I have read about various so called time series NoSQL databases. But NoSQL has only 4 types: key-value, document, columnar and graph.
For example InfluxDB does not state which NoSQL type it has, but from the documentation it seems like simple key-value store to me.
Are these time series databases only specialized databases from one of those 4 types or it is a new type of NoSQL database?
So to make it short, you can find both pure time series database, or engine that runs at the top of a more generic engine like Redis, Hbase, Cassandra, Elasticsearch ...
TimeSeries Databases (TSDBs) are data engines that are focusing on saving and retrieving time-based information very efficiently.
Very often since these databases will capture "events" (systems, devices/iot, applications ticks) they have support highly concurrent writes, and they usually do a lot more writes than reads.
TSDBs are storing data points within a time series, and the timestamp is usually the main index/key; allowing very efficient time range queries (give me data point from this time to this time).
Data points can be multi dimensionnal, and add tags/label.
TSDBs provide mathematical operations on datapoint: SUM, DIV AVG, ... to combine data over time.
So based on these characteristics you can find databases that offer this. As you mention you can use specialized solutions like Influx DB, Druid, Prometheus; or you can find more generic databases engines that provide native time series support, or extension, let me list some of them:
Redis TimeSeries
Elastisearch
OpenTSDB: that runs at the top of Apache HBase
Warp10: that runs at the top of Apache HBase

What are NoSql Key/Value databases used for

I have been hearing alot about nosql key/value databases online now. Can you give an example of what one is used for. What kind of real world data is best for these kind of databases?
I think 'What the heck are you actually using NoSQL for' is an excellent read for real life usage cases for NoSQL databases. Let me quote some of them here:
Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
Syncing online and offline data. This is a niche CouchDB has targeted.
Fast response times under all loads.
Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
Soft real-time systems where low latency is critical. Games are one example.
Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads.
Read-only applications needing extreme speed and resiliency, simple
queries, and can tolerate slightly stale data.
Applications requiring moderate performance, read/write access, simple queries, completely
authoritative data.
Read-only application which complex query requirements.
Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
Real-time inserts, updates, and queries.
Hierarchical data like threaded discussions and parts explosion.
Dynamic table creation.
Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
Real-time page view counters.
User registration, profile, and session data.
Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
Archiving. Storing a large continual stream of data that is still accessible on-line.
Document-oriented databases with a flexible schema that can handle schema changes over time.
Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.

Advantages of databases like Greenplum or Vertica compared to MongoDB or Cassandra [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am currently working in a few projects with MongoDB and Apache Cassandra respectively. I am also using Solr a lot and I am handling "lots" of data with them (approx. 1-2TB). I've heard of Greenplum and Vertica the first time in the last week and I am not really sure, where to put them in my brain. They seem to me like Dataware House (DWH) solutions and I haven't really worked DWH. And they seem to cost lots of money (e.g. $60k for 1TB storage in Greenplum). I am currently not handling Petabyte of data and won't do so I think, but products like cassandra seem also to be able to handle this
Cassandra is the acknowledged NoSQL leader when it comes to
comfortably scaling to terabytes or petabytes of data.
via http://www.datastax.com/why-cassandra
So my question: Why should people use Greenplum & Co? Is there a huge advantage in comparison to these other products?
Thanks.
Cassandra, Greenplum and Vertica all handle huge amounts of data but in very different ways.
Some made up usecases where each database has its strengths:
Use cassandra for:
tweets.insert(key:user, data:blob);
tweets.get(key:user)
Use greenplum for:
begin;
update account set balance = balance - 10 where account_id = 1;
update account set balance = balance + 10 where account_id = 2;
commit;
Use Vertica for:
select sum(balance)
over (partition by region order by account rows unbounded preceding)
from transactions;
I work in the telecom industry. We deal with large data-sets and complex EDW(enterprise data warehouse) models.We started with Teradata and it was good for few years. Then the data increased exponentially, and as you know expansion in Teradata is expensive. So, we evaluated EMCs namely green plum, oracle exadata, hp Vertica and IBM netteza.
In speed, generation of 20 reports
went like this: 1. Vertica, 2. Netteza, 3. green plum, 4. oracle
In compression ratio: Vertica had a natural advantage. Among others IBM is good too.
The worst as per the benchmarks is emc and oracle. As always expected as its both want to sell ton of storage and hardware.
Scalability: All do scale well.
Loading time: emc is the best here, others (teradata , Vertica, oracle , IBM) are good too.
Concurrent user query :Vertica, emc, green plum, then only IBM. Oracle exadata is slow in any type of query case comparatively but much better than its old school 10g.
Price: Teradata > Oracle > IBM > HP > EMC
Note: Need to compare apple to apple, same no of cores ,ram,data volume, and reports
We chose Vertica for hardware independent pricing model, lower pricing and good performance. Now all 40+ users are happy to generate reports without waiting and it all fit in the low cost hp dl380 servers. it is great for olap /edw use case.
All this analysis is only for edw/analytics/olap case. I am still an oracle fan boy for all oltp, rich plsql, connectivity etc on any hardware or system. Exadata gives a decent mixed workload, but unreasonable in Price/performance ratio and still need to migrate 10g code to exadata best practice (sort of MMP like, bulk processing etc, and its time consuming than what they claim.
We've been working in Hadoop for 4 years, and Vertica for 2. We had massive loading and indexing problems with our tables in MySQL. We were running on fumes with our home-grown sharding solution. We could have invested heavily in developing a more sophisticated sharding solution, which would have been quite painful, imo. We could have thought harder about what data we absolutely needed to keep in a SQL database.
But at the end of the day, switching from MySQL to Vertica was what we chose. Vertica performance patterns are quite different from MySQL's, which comes with its own headaches. But it can load a lot of data very quickly, and it is good at heavy duty queries that would make MySQL's head spin.
The way I see it, Vertica is a solution when you are already invested in SQL and need a heavier duty SQL database. I'm not an expert, so I couldn't tell you what a transition to Oracle or DB2 would have been like compared to Vertica, neither in terms of integration effort or monetary cost.
Vertica offers a lot of features we've barely looked into. Those might be very attractive to others with use cases different to ours.
I'm a Vertica DBA and prior to that was a developer with Vertica. Michael Stonebreaker (the guy behind Ingres, Vertica, and other databases) has some critiques of NoSQL that are worth listening to.
Basically, here are the advantages of Vertica as I see them:
it's rather fast on large amounts of data
it's performance is similar (so I can gather) to other data warehousing solutions but it's advantage is clustering and commodity hardware. So you can scale by adding more commodity hardware. It looks cheap in terms of overall cost per TB. (Going from memory not an exact quote.)
Again, it's for data warehousing.
You get to use traditional SQL and tables. It's under the hood that's different.
I can't speak to the other products, but I'm sure a lot of them are fine too.
Edit: Here's a talk from Stonebreaker: http://www.slideshare.net/Dataversity/newsql-vs-nosql-for-new-oltp-michael-stonebraker-voltdb
Pivotal, formerly Greenplum, is the well-funded spinoff from EMC, VMware and GE. Pivotal's market are enterprises (and Homeland Cybersecurity agencies) with multi-Petabyte size databases needing complex analytics and high speed ETL. Greenplum’s origin is a PostgreSQL DB redesigned for Map Reduced MPP, with later additions for columnar-support and HDFS. It marries the best of SQL + NoSQL making NewSQL.
Features:
In 2015H1 most of their code, including Greenplum DB & HAWQ, will go
Open Source. Some advanced management & performance features at the
top of the stack will remain proprietary.
MPP (Massively Parallel Processing) share-nothing RDBMS database designed for multi-terrabyte to multi-petabyte environments.
Full SQL Compliance - supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2.
•Only SQL over HADOOP capable of handling all 99 queries used by the TPC-DS benchmark standard without rewriting. The competition cannot do many of them and are significantly slower. SIGMON whitepaper.
ACID compliance.
Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files.
Solr/Lucene integration for multi-lingual full-text search embedded in the SQL.
Incorporates Open Source Software: Spring, Cloud Foundry, Redis.io, RabbitMQ, Grails, Groovy, Open Chorus, Pig, ZooKeeper, Mahout, MADlib, MapR. Some of these are used at EBSCO.
Native connectivity to HBase, which is a popular column-store-like technology for Hadoop.
VMware's participation in $150m investment in MongoDB will likely lead to integration of petabyte-scale XML files.
Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, but will perform will even without this.
Row and/or Column-oriented data storage. It is the only database where a table can be polymorphic with both columnar and row-based partitions as defined by the DBA.
A column-store table can have a different compression algorithm per column because different datatypes have different compression characteristics to optimize their storage.
Advanced Map-Reduce-like CBO Query Optimizer – queries can be run on hundreds of thousands of nodes.
It is the only database with a dynamic distributed pipeline execution model for query processing. While older databases rely on materialized execution Greenplum doesn't have to write data to disk with every intermediate query step. It streams data to the next stage of a query plan in memory, and never has to materialize the data to disk, so it's much faster than what anybody has demonstrated on Hadoop.
Complex queries on large data sets are solved in seconds or even sub-seconds.
Data management – provides table statistics, table security.
Deep analytics – including data mining or machine learning algorithms using MADlib. Deep Semantic Textual Analytics using GPText.
Graphical Analysis - billion edge distributed in-memory graph database and algorithms using GraphLab.
Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in a single query for massive syntactical parsing and graph/matrix affinity analysis for deep search analytics.
Fully ODBC/JDBC compliant.
Distributed ETL rate of 16 TB/hr!! Integration with Talend available.
Cloud support: Pivotal plans to package its Cloud Foundry software so that it can be used to host Pivotal atop other clouds as well, including Amazon Web Services' EC2. Pivotal data management will be available for use in a variety of cloud settings and will not be dependent on a proprietary VMware system. Will target OpenStack, vSphere, vCloud Director, or private brands. IBM announced it has standardized on Cloud Foundry for its PaaS. Confluence page.
Two hardware "appliance" offerings: Isilon NAS & Greenplum DCA.
There is a lot of confusion about when to use a row database like MySQL or Oracle or a columnar DB like Infobright or Vertica or a NoSQL variant or Hadoop. We wrote a white paper to try to help sort out which technologies are best suited for which use cases - you can download Emerging Database Landscape (scroll half way down) or watch an on-demand webinar on the same topic.
Hope either is useful for you

MongoDB vs. Cassandra vs. MySQL for real-time advertising platform

I'm working on a real-time advertising platform with a heavy emphasis on performance. I've always developed with MySQL, but I'm open to trying something new like MongoDB or Cassandra if significant speed gains can be achieved. I've been reading about both all day, but since both are being rapidly developed, a lot of the information appears somewhat dated.
The main data stored would be entries for each click, incremented rows for views, and information for each campaign (just some basic settings, etc). The speed gains need to be found in inserting clicks, updating view totals, and generating real-time statistic reports. The platform is developed with PHP.
Or maybe none of these?
There are several ways to achieve this with all of the technologies listed. It is more a question of how you use them. Your ideal solution may use a combination of these, with some consideration for usage patterns. I don't feel that the information out there is that dated because the concepts at play are very fundamental. There may be new NoSQL databases and fixes to existing ones, but your question is primarily architectural.
NoSQL solutions like MongoDB and Cassandra get a lot of attention for their insert performance. People tend to complain about the update/insert performance of relational databases but there are ways to mitigate these issues.
Starting with MySQL you could review O'Reilly's High Performance MySQL, optimise the schema, add more memory perhaps run this on different hardware from the rest of your app (assuming you used MySQL for that), or partition/shard data. Another area to consider is your application. Can you queue inserts and updates at the application level before insertion into the database? This will give you some flexibility and is probably useful in all cases. Depending on how your final schema looks, MySQL will give you some help with extracting the data as long as you are comfortable with SQL. This is a benefit if you need to use 3rd party reporting tools etc.
MongoDB and Cassandra are different beasts. My understanding is that it was easier to add nodes to the latter but this has changed since MongoDB has replication etc built-in. Inserts for both of these platforms are not constrained in the same manner as a relational database. Pulling data out is pretty quick too, and you have a lot of flexibility with data format changes. The tradeoff is that you can't use SQL (a benefit for some) so getting reports out may be trickier. There is nothing to stop you from collecting data in one of these platforms and then importing it into a MySQL database for further analysis.
Based on your requirements there are tools other than NoSQL databases which you should look at such as Flume. These make use of the Hadoop platform which is used extensively for analytics. These may have more flexibility than a database for what you are doing. There is some content from Hadoop World that you might be interested in.
Characteristics of MySQL:
Database locking (MUCH easier for financial transactions)
Consistency/security (as above, you can guarantee that, for instance, no changes happen between the time you read a bank account balance and you update it).
Data organization/refactoring (you can have disorganized data anywhere, but MySQL is better with tables that represent "types" or "components" and then combining them into queries -- this is called normalization).
MySQL (and relational databases) are more well suited for arbitrary datasets and requirements common in AGILE software projects.
Characteristics of Cassandra:
Speed: For simple retrieval of large documents. However, it will require multiple queries for highly relational data – and "by default" these queries may not be consistent (and the dataset can change between these queries).
Availability: The opposite of "consistency". Data is always available, regardless of being 100% "correct".[1]
Optional fields (wide columns): This CAN be done in MySQL with meta tables etc., but it's for-free and by-default in Cassandra.
Cassandra is key-value or document-based storage. Think about what that means. TYPICALLY I give Cassandra ONE KEY and I get back ONE DATASET. It can branch out from there, but that's basically what's going on. It's more like accessing a static file. Sure, you can have multiple indexes, counter fields etc. but I'm making a generalization. That's where Cassandra is coming from.
MySQL and SQL is based on group/set theory -- it has a way to combine ANY relationship between data sets. It's pretty easy to take a MySQL query, make the query a "key" and the response a "value" and store it into Cassandra (e.g. make Cassandra a cache). That might help explain the trade-off too, MySQL allows you to always rearrange your data tables and the relationships between datasets simply by writing a different query. Cassandra not so much. And know that while Cassandra might PROVIDE features to do some of this stuff, it's not what it was built for.
MongoDB and CouchDB fit somewhere in the middle of those two extremes. I think MySQL can be a bit verbose[2] and annoying to deal with especially when dealing with optional fields, and migrations if you don't have a good model or tools. Also with scalability, I'm sure there are great technologies for scaling a MySQL database, but Cassandra will always scale, and easily, due to limitations on its feature set. MySQL is a bit more unbounded. However, NoSQL and Cassandra do not do joins, one of the critical features of SQL that allows one to combine multiple tables in a single query. So, complex relational queries will not scale in Cassandra.
[1] Consistency vs. availability is a trade-off within large distributed dataset. It takes a while to make all nodes aware of new data, and eg. Cassandra opts to answer quickly and not to check with every single node before replying. This can causes weird edge cases when you base you writes off previously read data and overwriting data. For more information look into the CAP Theorem, ACID database (in particular Atomicity) as well as Idempotent database operations. MySQL has this issue too, but the idea of high availability over correctness is very baked into Cassandra and gives it many of its scalability and speed advantages.
[2] SQL being "verbose" isn't a great reason to not use it – plus most of us aren't going to (and shouldn't) write plain-text SQL statements.
Nosql solutions are better than Mysql, postgresql and other rdbms techs for this task. Don't waste your time with Hbase/Hadoop, you've to be an astronaut to use it. I recommend MongoDB and Cassandra. Mongo is better for small datasets (if your data is maximum 10 times bigger than your ram, otherwise you have to shard, need more machines and use replica sets). For big data; cassandra is the best. Mongodb has more query options and other functionalities than cassandra but you need 64 bit machines for mongo. There are some works around for analytics in both sides. There is atomic counters in both sides. Both can scale well but cassandra is much better in scaling and high availability. Both have php clients, both have good support and community (mongo community is bigger).
Cassandra analytics project sample:Rainbird http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
mongo sample: http://www.slideshare.net/jrosoff/scalable-event-analytics-with-mongodb-ruby-on-rails
http://axonflux.com/how-superfeedr-built-analytics-using-mongodb
doubleclick developers developed mongo http://www.informationweek.com/news/software/info_management/224200878
Cassandra vs. MongoDB
Are you considering Cassandra or MongoDB as the data store for your next project? Would you like to compare the two databases? Cassandra and MongoDB are both “NoSQL” databases, but the reality is that they are very different. They have very different strengths and value propositions – so any comparison has to be a nuanced one. Let’s start with initial requirements… Neither of these databases replaces RDBMS, nor are they “ACID” databases. So If you have a transactional workload where normalization and consistency are the primary requirements, neither of these databases will work for you. You are better off sticking with traditional relational databases like MySQL, PostGres, Oracle etc. Now that we have relational databases out of the way, let’s consider the major differences between Cassandra and MongoDB that will help you make the decision. In this post, I am not going to discuss specific features but will point out some high-level strategic differences to help you make your choice.
Expressive Object Model
MongoDB supports a rich and expressive object model. Objects can have properties and objects can be nested in one another (for multiple levels). This model is very “object-oriented” and can easily represent any object structure in your domain. You can also index the property of any object at any level of the hierarchy – this is strikingly powerful! Cassandra, on the other hand, offers a fairly traditional table structure with rows and columns. Data is more structured and each column has a specific type which can be specified during creation.
Verdict: If your problem domain needs a rich data model then MongoDB is a better fit for you.
Secondary Indexes
Secondary indexes are a first-class construct in MongoDB. This makes it easy to index any property of an object stored in MongoDB even if it is nested. This makes it really easy to query based on these secondary indexes. Cassandra has only cursory support for secondary indexes. Secondary indexes are also limited to single columns and equality comparisons. If you are mostly going to be querying by the primary key then Cassandra will work well for you.
Verdict: If your application needs secondary indexes and needs flexibility in the query model then MongoDB is a better fit for you.
High Availability
MongoDB supports a “single master” model. This means you have a master node and a number of slave nodes. In case the master goes down, one of the slaves is elected as master. This process happens automatically but it takes time, usually 10-40 seconds. During this time of new leader election, your replica set is down and cannot take writes. This works for most applications but ultimately depends on your needs. Cassandra supports a “multiple master” model. The loss of a single node does not affect the ability of the cluster to take writes – so you can achieve 100% uptime for writes.
Verdict: If you need 100% uptime Cassandra is a better fit for you.
Write Scalability
MongoDB with its “single master” model can take writes only on the primary. The secondary servers can only be used for reads. So essentially if you have three node replica set, only the master is taking writes and the other two nodes are only used for reads. This greatly limits write scalability. You can deploy multiple shards but essentially only 1/3 of your data nodes can take writes. Cassandra with its “multiple master” model can take writes on any server. Essentially your write scalability is limited by the number of servers you have in the cluster. The more servers you have in the cluster, the better it will scale.
Verdict: If write scalability is your thing, Cassandra is a better fit for you.
Query Language Support
Cassandra supports the CQL query language which is very similar to SQL. If you already have a team of data analysts they will be able to port over a majority of their SQL skills which is very important to large organizations. However CQL is not full blown ANSI SQL – It has several limitations (No join support, no OR clauses) etc. MongoDB at this point has no support for a query language. The queries are structured as JSON fragments.
Verdict: If you need query language support, Cassandra is the better fit for you.
Performance Benchmarks
Let’s talk performance. At this point, you are probably expecting a performance benchmark comparison of the databases. I have deliberately not included performance benchmarks in the comparison. In any comparison, we have to make sure we are making an apples-to-apples comparison.
Database model - The database model/schema of the application being tested makes a big difference. Some schemas are well suited for MongoDB and some are well suited for Cassandra. So when comparing databases it is important to use a model that works reasonably well for both databases.
Load characteristics – The characteristics of the benchmark load are very important. E.g. In write-heavy benchmarks, I would expect Cassandra to smoke MongoDB. However, in read-heavy benchmarks, MongoDB and Cassandra should be similar in performance.
Consistency requirements - This is a tricky one. You need to make sure that the read/write consistency requirements specified are identical in both databases and not biased towards one participant. Very often in a number of the ‘Marketing’ benchmarks, the knobs are tuned to disadvantage the other side. So, pay close attention to the consistency settings.
One last thing to keep in mind is that the benchmark load may or may not reflect the performance of your application. So in order for benchmarks to be useful, it is very important to find a benchmark load that reflects the performance characteristics of your application. Here are some benchmarks you might want to look at:
- NoSQL Performance Benchmarks
- Cassandra vs. MongoDB vs. Couchbase vs. HBase
Ease of Use
If you had asked this question a couple of years ago MongoDB would be the hands-down winner. It’s a fairly simple task to get MongoDB up and running. In the last couple of years, however, Cassandra has made great strides in this aspect of the product. With the adoption of CQL as the primary interface for Cassandra, it has taken this a step further – they have made it very simple for legions of SQL programmers to use Cassandra very easily.
Verdict: Both are fairly easy to use and ramp up.
Native Aggregation
MongoDB has a built-in Aggregation framework to run an ETL pipeline to transform the data stored in the database. This is great for small to medium jobs but as your data processing needs become more complicated the aggregation framework becomes difficult to debug. Cassandra does not have a built-in aggregation framework. External tools like Hadoop, Spark are used for this.
Schema-less Models
In MongoDB, you can choose to not enforce any schema on your documents. While this was the default in prior versions in the newer version you have the option to enforce a schema for your documents. Each document in MongoDB can be a different structure and it is up to your application to interpret the data. While this is not relevant to most applications, in some cases the extra flexibility is important. Cassandra in the newer versions (with CQL as the default language) provides static typing. You need to define the type of very column upfront.
I'd also like to add Membase (www.couchbase.com) to this list.
As a product, Membase has been deployed at a number of Ad Agencies (AOL Advertising, Chango, Delta Projects, etc). There are a number of public case studies and examples of how these companies have used Membase successfully.
While it's certainly up for debate, we've found that Membase provides better performance and scalability than any other solution. What we lack in indexing/querying, we are planning on more than making up for with the integration of CouchDB as our new persistence backend.
As a company, Couchbase (the makers of Membase) has a large amount of knowledge and experience specifically serving the needs of Ad/targeting companies.
Would certainly love to engage with you on this particular use case to see if Membase is the right fit.
Please shoot me an email (perry -at- couchbase -dot- com) or visit us on the forums: http://www.couchbase.org/forums/
Perry Krug
I would look at New Relic as an example of a similar workload. They capture over 200 Billion data points a day to disk and are using MySQL 5.6 (Percona) as a backend.
A blog post is available here:
http://blog.newrelic.com/2014/06/13/store-200-billion-data-points-day-disk/