Will MongoDB handle several TB of data? I've read posts saying that Mongo does well with < 1TB of data, for larger sets I should go with HBase. Is that true?
I need to store and later process several TB of text data.
These may be of interest to you:
Wordnik: data set in the >3TB range
Craiglist: shard cluster designed to support 10TB of data.
You'll find some additional case studies on 10gen's website, although not all of them provide specific numbers on data set sizes. There are also some older discussions on Stack Overflow about this very question (see here for a blurb about a user with 12TB of data from March 2010), and you'll likely find more case studies scattered among presentations on Speaker Deck or Slideshare. In short, MongoDB can certainly handle that amount of data (people are using it to that effect today), but you'll want to heed best practices, which is where existing presentations can come in handy.
MongoDB
Tens of thousands of organizations use MongoDB to build high-performance systems at scale. Over a third of the Fortune 100 and many of the most successful and innovative web companies rely on MongoDB. They've grown from single server deployments to clusters with over 1,000 nodes, delivering millions of operations per second on over 100 billion documents and petabytes of data.
Scalability is not just about speed. It's about 3 different metrics, which often work together:
Cluster Scale. Distributing the database across 100+ nodes, often in multiple data centers
Performance Scale. Sustaining 100,000+ database read and writes per second while maintaining strict latency SLAs
Data Scale. Storing 1 billion+ documents in the database
There are many examples of MongoDB users who are pushing the limits to scalability. Here are a few, organized around each scaling dimension.
You can find reference about MongoDB: Bringing Online Big Data to
Business Intelligence & Analytics at this article
Related
I have a MongoDB Atlas cluster that serves many customers. Each customer has its own database on the cluster.
I would like to reduce my application's impact on MongoDB data transfer costs, which have been increasing for the last few days, but the billing info provided by Atlas does not break down prices per database. Therefore, I have no way of knowing which customers are costly and what are the most costly queries in terms of data transfer.
Moreover, using the prices on a daily basis and a few queries, I cannot correlate insertion of resources in my application with prices. For example, let's say my resources are Cats, one day it will cost 5$ of data transfer with 5000 Cats inserted in total in the databases, but the next day, it's going to cost 13$ with 1500 Cats inserted.
Do you know of tools or something in the Atlas dashboard I might've missed that could help me better track costs per customer, or say, a cost per Cat (in my example) so that I build a pricing model for my customers?
Thank you
You are most likely going to need separate projects and deployments.
A MongoDB client instance is generally capable of using any database on the server (subject to authorization rules and APIs provided in the language in question), therefore to get a breakdown of data transfer by database would require the server to track bytes transferred per operation and then aggregate those counts. As far as I know this isn't a feature that currently exists.
The most practical way of tracking this today is probably writing a layer on top of the driver on the client side that would look at data actually received.
I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.
What is the objective reason fo why don't most NoSQL storage solutions have some kind of "pointers" for ultra-efficient joins, like the pre-relational DBMSs had?
I mean, I partially understand the theoretical reasons for why classical RDBMSs have ditched pointers (need to update them and double sync them for memory and disk, no "disks" fast enough to be treatable like random-access for some use cases, like modern SSDs can, etc.).
But of the many NoSQL solutions out there, why do just so few of them realize that this model would be awesome (exception I know of would be OrientDB and Neo4j) for many practical cases, not only ones that need graph traversals. I mean, when you need things like multi-joins, you need to ping pong Mongo and do N queries instead of one.
Isn't the use case of a NoSQL document-db overlapping enough with the one of graph DBs that such a feature would make sense and would just provide all the practical features of SQL-joins to the NoSQL solutions with not much extra cost, and for most queries would make indexes useless, and take up much less space for huge datasets?
(...and as a bonus any NoSQL solution would be ready to use as a graph db, and doing a ~100 nodes path length traversal of a graph stored in Mongo would just automagically work fast enough)
I believe the key problem is data locality and horizontal scalability. A premise of NoSQL is that the read-heavy models of RBDMSs, i.e. those that require joins, lead to bottlenecks.
Think of Twitter: the original data model was read-heavy, but the joins you need to make are insanely large (billions of tweets x hundreds of millions of users x tens of billions of follower-followee relations that are wildly varying in size [1-10M, or whatever aplusk has these days]).
When even the ids you'll want to join don't fit in a reasonable machine's RAM, calculating the overlap of ids becomes terribly expensive. If you take the actual data into account, horizontal scalability becomes next to impossible because there's no a priori knowledge which shards / machines will need to be hit. Storing all follower pointers in every follower-list would require insane bookkeeping for trivial changes, while not exploiting creation-time locality (or at least, creation-time locality per feed).
In a multi-tenant application, you can always shard by the tenants, or by the sales region or by agents or maybe even by time: You can find some locality criterion that is good for like > 95% of the cases.
With graphs, that becomes a lot more complicated, especially those which have certain connection properties (scale-free networks with small diameter / small world phenomenon): A simple post, say by a celebrity, can quickly spread through a large portion of the entire network, meaning that practically every query must hit the one node that holds the post.
Sure, the post itself would be cached by the web servers, but add likes and comments, or favorites and retweets and the story becomes a nightmare (writes!) Add in notification emails, content ranking and filtering and you're in true horror.
doing a ~100 nodes path length traversal of a graph stored in Mongo would just automagically work fast enough
If that data happens to be on 100 different nodes, the sheer network overhead will be in the range of 50ms, even in a single datacenter with no congestion and idle machines. If this spreads across the world or individual queries take a little longer, you'll quickly end up at 5000ms. Also, the query would fail if only one machine is down.
This depends too much on the details of the network, which is why the problem should be solved by application code, not by the data store.
when you need things like multi-joins, you need to ping pong Mongo and do N queries instead of one
When you need multi-joins in MongoDB, you're using the wrong tool for your data model, or vice versa. Multi-Join means normalized means read-heavy which battles the key concept of MongoDB. However, you can store quite large association lists even in MongoDB. But the tool becomes almost irrelevant here: If you look at Facebook TAO, for instance, there's little technology dependence in that.
I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430񹫾
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.
I have been hearing alot about nosql key/value databases online now. Can you give an example of what one is used for. What kind of real world data is best for these kind of databases?
I think 'What the heck are you actually using NoSQL for' is an excellent read for real life usage cases for NoSQL databases. Let me quote some of them here:
Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
Syncing online and offline data. This is a niche CouchDB has targeted.
Fast response times under all loads.
Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
Soft real-time systems where low latency is critical. Games are one example.
Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads.
Read-only applications needing extreme speed and resiliency, simple
queries, and can tolerate slightly stale data.
Applications requiring moderate performance, read/write access, simple queries, completely
authoritative data.
Read-only application which complex query requirements.
Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
Real-time inserts, updates, and queries.
Hierarchical data like threaded discussions and parts explosion.
Dynamic table creation.
Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
Real-time page view counters.
User registration, profile, and session data.
Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
Archiving. Storing a large continual stream of data that is still accessible on-line.
Document-oriented databases with a flexible schema that can handle schema changes over time.
Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.