Does a MongoDB collection's document count affect performance? - mongodb

I'm not sure why I can't find any information on this, but I'd like to find out what performance consequences might exist when a MongoDB collection has a huge number of documents. I see a lot of answers about the size of documents, but not the document count.
I have a collection of small documents (about 600b each). There are about 2.3m documents in this collection at the time of this writing. They're indexed on two fields each.
I'm concerned about the scalability of this collection. Depending on how many users sign up, this collection could theoretically hit 875+ billion documents.
Will this impact query performance or the index?

875B documents at 600b each will definitely give you scaling challenges. 2.3M shouldn't be a problem on even modest hardware.
I see a lot of answers about the size of documents, but not the document count.
I would think about this in terms of the total collection size (which is influenced by the document count) to get an idea of its scalability. Higher collection size means more RAM required to satisfy the indexes. MongoDB tries to keep indexes in RAM, which makes sense because indexes perform much better when they're in RAM.
Not sure whether you meant 600b as in bits or bytes, but that's either 65 TB or 525 TB. Even with an index or two on only one field each, they're going to be large indexes difficult to fit in memory. Actual performance will probably depend entirely on your query patterns, such as whether you keep accessing the same documents (faster, stays in cache) or whether queries are relatively evenly distributed across the documents (slower, needs more memory to perform well).

Related

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

Using nested document structure in mongodb

I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}
I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.

Performance trade-offs of breaking up a document or keeping it whole

My question is about performance optimization with respect to data structure in MongoDB/NoSql
I have a collection with very large documents. I will need to iterate through the entire collection for data analytics several times per minute.
Assumptions:
- The number of documents will be < 10,000
- only a small portion of the document is used for the number crunching
- the documents will not change often
My question is: would I get significantly boost performance by creating a cache collection with only the fields needed for number crunching? Doing so would require the overhead of maintaining the cache table.
I guess it's depends on whether the documents are in the memory (Or your ram is large enough to cache them).
If no, cache will boost the performance in a significant way.

Optimizing for random reads

First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.

MongoDB -- large number of documents

This is related to my last question.
We have an app where we are storing large amounts of data per user. Because of the nature of data, previously we decided to create a new database for each user. This would have required a large no. of databases (probably millions) -- and as someone pointed out in a comment, that this indicated wrong design.
So we changed the design and now we are thinking about storing each user's entire information in one collection. This means one collection exactly maps to one user. Since there are 12,000 collections available per database, we can store 12,000 users per DB (and this limit could be increased).
But, now my question is -- is there any limit on the no. of documents a collection can have. Because of the way we need to store data per user, we expect to have a huge (tens of millions in extreme cases) no. of document per documents. Is that OK for MongoDB and design-wise?
EDIT
Thanks for the answers. I guess then it's OK to use large no of documents per collection.
The app is a specialized inventory control system. Each user has a large no. of little pieces of information related to them. Each piece of information has a category and some related stuff under that category. Moreover, no two collections need to see each other's data -- hence an index that touch more than one collection is not needed.
To adjust the number of collections/indexes you can have (~24k is the limit--~12k is what they say for collections because you have the _id index by default, but keep in mind, if you have more indexes on the collections, that will use namespace up as well), you can use the --nssize option when you start up mongod.
There are plenty of implementations around with billions of documents in a collection (and I'm sure there are several with trillions), so "tens of millions" should be fine. There are some numbers such as counts returned that have constraints of 64 bits, so after you hit 2^64 documents you might find some issues.
What sort of query and update load are you going to be looking at?
Your design still doesn't make much sense. Why store each user in a separate collection?
What indexes do you have on the data? If you are indexing by some field that has content that's common across all the users you'll get a significant saving in total index size by having a single collection with one index.
Index size is often the limiting factor not total database size when it comes to performance.
Why do you have so many documents per user? How large are they?
Craigslist put 2+ billion documents in MongoDB so that shouldn't be an issue if you have the hardware to support it and aren't being inefficient with your indexes.
If you posted more of your schema here you'd probably get better advice.