MongoDB: BIllions of documents in a collection - mongodb

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?

It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices

You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

Related

Should data be clustered as databases or collections [duplicate]

I am designing a system with MongoDb (64 bit version) to handle a large amount of users (around 100,000) and each user will have large amounts of data (around 1 million records).
What is the best strategy of design?
Dump all records in single collection
Have a collection for each user
Have a database for each user.
Many Thanks,
So you're looking at somewhere in the region of 100 billion records (1 million records * 100,000 users).
The preferred way to deal with large amounts of data is to create a sharded cluster that splits the data out over several servers that are presented as single logical unit via the mongo client.
Therefore the answer to your question is put all your records in a single sharded collection.
The number of shards required and configuration of the cluster is related to the size of the data and other factors such as the quantity and distribution of reads and writes. The answers to those questions are probably very specific to your unique situation, so I won't attempt to guess them.
I'd probably start by deciding how many shards you have the time and machines available to set up and testing the system on a cluster of that many machines. Based on the performance of that, you can decide whether you need more or fewer shards in your cluster
So you are looking for 100,000,000 detail records overall for 100K users?
What many people don't seem to understand is that MongoDB is good at horizontal scaling. Horizontal scaling is normally classed as scaling huge single collections of data across many (many) servers in a huge cluster.
So already if you use a single collection for common data (i.e. one collection called user and one called detail) you are suiting MongoDBs core purpose and build.
MongoDB, as mentioned, by others is not so good at scaling vertically across many collections. It has a nssize limit to begin with and even though 12K initial collections is estimated in reality due to index size you can have as little as 5K collections in your database.
So a collection per user is not feasible at all. It would be using MongoDB against its core principles.
Having a database per user involves the same problems, maybe more, as having singular collections per user.
I have never encountered some one not being able to scale MongoDB to the billions or even close to the 100s of billions (or maybe beyond) on a optimised set-up, however, I do not see why it cannot; after all Facebook is able to make MySQL scale into the 100s of billions per user (across 32K+ shards) for them and the sharding concept is similar between the two databases.
So the theory and possibility of doing this is there. It is all about choosing the right schema and shard concept and key (and severs and network etc etc etc etc).
If you were to witness problems you could go for splitting archive collections, or deleted items away from the main collection but I think that is overkill, instead you want to make sure that MongoDB knows where each segment of your huge dataset is at any given point in time on the master and ensure that this data is always hot, that way queries that don't do a global and scatter OP should be quite fast.
About a collection on each users:
By default configuration, MongoDB is limited to 12k collections. You can increase the size of this with --nssize but it's not unlimited.
And you have to count index into this 12k. (check "namespaces" concept on mongo documentation).
About a database for each user:
For a model point of view, that's very curious.
For technical, there is no limit on mongo, but you probably have a limit with file descriptor (limit from you OS/settings).
So as #Rohit says, the two last are not good.
Maybe you should explain more about your case.
Maybe you can cut users into different collections (ex: one for each first letter of name etc., or for each service of the company...).
And, of course use sharding.
Edit: maybe MongoDb is not the best database for your use case.

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

MongoDB - Can I shard all new db's (Created by application) automatically?

My team will deploy a new version of our app (Capture social media posts, hashtags etc.) they create a different DB for each user and we may have thousands of collections on each DB. I read all mongoDB shard documentation and I saw that I can only shard an collection or one DB at time, I'm missing something ?
We will start this new version fresh, without any databases and we will grow from 0 again (For now, we have 23k users) but we will escalate this number really quickly (100.000+ at the end of the year)
My question is: I really need a Shard cluster ? (My test setup have 3 shards with 3 microshards, 3 config servers and 2 mongos) for now, in production, i have a large server doing all the hard work but i dont want to scale to top, the horizontal scale is the best choice, i think.
Can I shard all my databases automatically or I really need to do that one by one doing the shard key procedure and so. ?
Thanks in advance
You are reading correctly. What you intend to do is so far away from what any sensible person would do that MongoDB doesn't offer any tools to support this. If you really want to go with this WTF solution, your application will be responsible to set up sharding for each collection it creates. This forces you to give administration permission to the application (despite what any security guides recommend).
"Will you really need a sharded cluster" - that depends on how much data you will have and how often you query it with what kind of query. But it is unlikely to work anyway, because your sharded cluster will have to manage (100,000 databases* 1.000 collections) = a hundred million collections. MongoDB is not designed for scaling in that direction. The cluster will likely be so busy with bookkeeping that you won't really see any notable performance gain.
It is also questionable if clustering would even theoretically make sense. Clustering is usually only useful when you have very large collections. But in your scenario where your data is so heavily fragmented into a million collections, each individual collection is unlikely to be very large.
If you really want to go this route, it might in fact be a better solution to separate the databases physically by assigning each user to a database server.
Or you could just build a database architecture like a normal team would with one database for all users and one collection per type of document. You would then speed up lookups by creating a compound index on user and whatever criteria you used to tell which database a document belonged to. This index might also be a good shard key.

120 mongodb collections vs single collection - which one is more efficient?

I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:
Should I create one single collection or put my data into several collections (we could call these categories I suppose).
Now I know many such questions have been asked, but I believe my case is different for 2 reasons:
If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).
So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?
Or should I just go for single collection + sharding?
Each collection is expected hold millions of documents. If use only one, it will store billions of docs.
Thanks in advance!
------- Edit:
Thanks for the great answers.
In fact the 120 collections is only a self made limit, it's not really optimal:
The data in the collections is related to web publishers. There could be millions of these (any web site can join).
I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.
So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
What do you think about this? Is there a better solution?
Sorry for not being specific enough in my original question.
Thanks in advance
Single Sharded Collection
The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.
In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.
You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.
Fixed number of collections (original answer)
If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.
NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.
The motivations for using separate collections would be:
Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.
Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.
You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).
MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.
Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).
The main problem here is that you will gain very little performance in the current MongoDB versions if you separate out collections into the same database. To get any sort of extra performance over a single collection setup you would need to move the collections out into separate databases, then you will have operational overhead for judging what database you should query etc.
So yes, you could go for 120 collections easily however, you won't really gain anything currently due to: https://jira.mongodb.org/browse/SERVER-1240 not being implemented (anytime soon).
Housing billions of documents in a single collection isn't too bad. I presume that even if you was to house this in separate collections it probably would not be on a single server either, just like sharding a single collection, so any speed reduction due to multi server setup will also not matter in this case.
In my personal opinion, using a single collection is easier on everything.

MongoDB sharding scalability - performance of queries hitting a single chunk?

In doing some preliminary tests of MongoDB sharding, I hoped and expected that the time to execute queries that hit only a single chunk of data on one shard/machine would remain relatively constant as more data was loaded. But I found a significant slowdown.
Some details:
For my simple test, I used two machines to shard and tried queries on similar collections with 2 million rows and 7 million rows. These are obviously very small collections that don’t even require sharding, yet I was surprised to already see a significant consistent slowdown for queries hitting only a single chunk. Queries included the sharding key, were for result sets ranging from 10s to 100000s of rows, and I measured the total time required to scroll through the entire result sets. One other thing: since my application will actually require much more data than can fit into RAM, all queries were timed based on a cold cache.
Any idea why this would be? Has anyone else observed the same or contradictory results?
Further details (prompted by Theo):
For this test, the rows were small (5 columns including _id), and the key was not based on _id, but rather on a many-valued text column that almost always appears in queries.
The command db.printShardingStatus() shows how many chunks there are as well as the exact key values used to split ranges for chunks. The average chunk contains well over 100,000 rows for this dataset and inspection of key value splits verifies that the test queries are hitting a single chunk.
For the purpose of this test, I was measuring only reads. There were no inserts or updates.
Update:
Upon some additional research, I believe I determined the reason for the slowdown: MongoDB chunks are purely logical, and the data within them is NOT physically located together (source: "Scaling MongoDB" by Kristina Chodorow). This is in contrast to partitioning in traditional databases like Oracle and MySQL. This seems like a significant limitation, as sharding will scale horizontally with the addition of shards/machines, but less well in the vertical dimension as data is added to a collection with a fixed number of shards.
If I understand this correctly, if I have 1 collection with a billion rows sharded across 10 shards/machines, even a query that hits only one shard/machine is still querying from a large collection of 100 million rows. If values for the sharding key happen to be located contiguously on disk, then that might be OK. But if not and I'm fetching more than a few rows (e.g. 1000s), then this seems likely to lead to lots of I/O problems.
So my new question is: why not organize chunks in MongoDB physically to enable vertical as well as horizontal scalability?
What makes you say the queries only touched a single chunk? If the result ranged up to 100 000 rows it sounds unlikely. A chunk is max 64 Mb, and unless your objects are tiny that many won't fit. Mongo has most likely split your chunks and distributed them.
I think you need to tell us more about what you're doing and the shape of your data. Were you querying and loading at the same time? Do you mean shard when you say chunk? Is your shard key something else than _id? Do you do any updates while you query your data?
There are two major factors when it comes to performance in Mongo: the global write lock and it's use of memory mapped files. Memory mapped files mean you really have to think about your usage patterns, and the global write lock makes page faults hurt really badly.
If you query for things that are all over the place the OS will struggle to page things in and out, this can be especially hurting if your objects are tiny because whole pages have to be loaded just to access a small pieces, lots of RAM will be wasted. If you're doing lots of writes that will lock reads (but usually not that badly since writes happen fairly sequentially) -- but if you're doing updates you can forget about any kind of performance, the updates block the whole database server for significant amounts of time.
Run mongostat while you're running your tests, it can tell you a lot (run mongostat --discover | grep -v SEC to see the metrics for all you shard masters, don't forget to include --port if your mongos is not running on 27017).
Addressing the questions in your update: it would be really nice if Mongo did keep chunks physically together, but it is not the case. One of the reasons is that sharding is a layer on top of mongod, and mongod is not fully aware of it being a shard. It's the config servers and mongos processes that know of shard keys and which chunks that exist. Therefore, in the current architecture, mongod doesn't even have the information that would be required to keep chunks together on disk. The problem is even deeper: Mongo's disk format isn't very advanced. It still (as of v2.0) does not have online compaction (although compaction got better in v2.0), it can't compact a fragmented database and still serve queries. Mongo has a long way to go before it's capable of what you're suggesting, sadly.
The best you can do at this point is to make sure you write the data in order so that chunks will be written sequentially. It probably helps if you create all chunks beforehand too, so that data will not be moved around by the balancer. Of course this is only possible if you have all your data in advance, and that seems unlikely.
Disclaimer: I work at Tokutek
So my new question is: why not organize chunks in MongoDB physically to enable vertical as well as horizontal scalability?
This is exactly what is done in TokuMX, a replacement server for MongoDB. TokuMX uses Fractal Tree indexes which have high write throughput and compression, so instead of storing data in a heap, data is clustered with the index. By default, the shard key is clustered, so it does exactly what you suggest, it organizes the chunks physically, by ensuring all documents are ordered by the shard key on disk. This makes range queries on the shard key fast, just like on any clustered index.