Does the azure cosmos db for mongo db RU consumption depend on collection size? - mongodb

I am using Azure cosmos db for MongoDB. A collection in the DB has 5 million+ documents. Can this cause in increased Request Units consumption? Will the cost decrease if I remove unwanted documents from the collection.
I am doing read and write queries on this collection.
Please suggest.

The main immediate cost benefit of removing unwanted documents is that you pay less for elements of the bill directly related to storage size.
Ideally you want to be using a TTL index to age out unwanted documents so it can use "spare" RU to do this automatically.
I saw good drops in continuous backup and data storage costs after recently doing a bit of a purge of these myself.
The effects on RU consumption are more limited. If currently you are already at the minimum level the system allows you to scale to and the limiting factor for you is the Current storage in GB * 10 RU/s then reducing storage size may allow you to scale down.
Apart from that if these documents are just sitting in the collection and never returned by any queries and you aren't performing operations that aggregate over all items in the collection or running queries that do not have supporting indexes then their existence won't really be impacting your RU consumption.

Related

MongoDB Atlas performance / collection and index limits

I am building a multi tenant app, where each tenant will have its own database with its own collections. All databases are stored in the same M10 cluster.
For now, a tenant represents around 56 collections and 208 indexes.
I have seen there is a recommended maximum for M10 cluster of 5000 collections and indexes (https://www.mongodb.com/docs/atlas/reference/atlas-limits/)
So if my understanding is correct, M10 cluster suits best for 18 maximum tenants (5000/(56+208)=18,93).
The documentation says The performance of a cluster might degrade if it serves a large number of collections and indexes. Does anyone have tried to exceed this limit? How big are these decreases in performance?
Apart from a hard limit on the number of collections and indexes you can have, the performance impact of having a large number of collections and indexes comes from the adverse impact of having too many data handles open. Not only this, but the maintenance also becomes a nightmare.
On the other hand, having large number of indexes will adversely impact the write operations in those collections, and will either continuously occupy the space in memory or if the memory is insufficient, will lead to continuous eviction and loading of indexes from and into the memory. To know more about the internal cache, see the official documentation here.
In conclusion, having more than 3500 indexes (for 18 tenants) and ~1000 collections will have a serious adverse impact on the performance of your overall cluster. You can monitor the same via the Cache Activity Metric, and others. Since it's anyway logical separation even if you create separate databases, you're advised to instead implement multi-tenancy via a tenant_id field in collections, instead of having different collections for different tenants.

What is the exact limit on the number of collections in a MongoDB database based on the WiredTiger(MongoDB 4.0) engine as of today(May 2019)?

I'm trying to put a forum-like structure in a MongoDB 4.0 database, which consists of multiple threads under a same "topic", each thread consists of a bunch of posts. So usually there are no limits on the numbers of the threads and posts. And I want to try fully utilizing the benefits of NoSQL features, grabbing a list of posts under any speicified thread at one time without having to scan and look up for the identical "thread_id" and "post_id" in a RDBMS table in the traditional way, so in my mind I want to put all the threads as collections in a database, as the thread_id as the code-generated collection names, and put all the posts of a thread as normal documents under that collection, so the way to access a post may look like:
forum_db【database name】.thread_id【collection name】.post_id【document ID】
But my concern is, despite of the obscure phrase saying at https://docs.mongodb.com/manual/reference/limits/#data,
Number of Collections in a Database
Changed in version 3.0.
For the MMAPv1 storage engine, the maximum number of collections in a database is a function of the size of the namespace file and the number of indexes of collections in the database.
The WiredTiger storage engine is not subject to this limitation.</pre>
Is it safe to do it in this way in terms of performance and scalability? Can we safely take it that there is no limit on the number of collections in a WiredTiger database (MongoDB 4.0+) today as there is pratically no limit on the number of documents in a collection? Many thanks in advance.
To calculate how many collections one can store in a MongoDB database, you need to figure out the number of indexes in each collection.
WiredTiger engine keeps an open file handler for each used collection (and its indexes). A large number of open file handlers can cause extremely long checkpoints operations.
Furthermore, each of the handles will take about ~22KB memory outside the WT cache; this means that just for keeping the files open, mongod process will need ~NUM_OF_FILE_HANDLES * 22KB of RAM.
High memory swapping will lead to a decrease in performance.
As you probably understand from the above, different hardware (RAM size & Disk speed) will behave differently.
From my point of view, you first need to understand the behavior of your application then calculate the required hardware for your MongoDB database server.

Is it safe to use capped collections as a way to manage space?

Basically, I want to create a chat based system and one of the features is to provide a longer history of the chat per membership level. I don't envision allowing a collection larger than 1GB, that even seems overkill. However keeping them small should also mean I don't need to worry about sharding them.
Basically each 'chat' would be a capped collection. The expectation is if they reach the file storage limit the older items would drop, which is how capped collections work. So it seems to me that creating a capped collection for each chat would be an easy way to accomplish this goal. I would just apply and store an id as the collection name so I can access it.
Is there a reason I shouldn't consider this approach?
Sounds like your data is logically split by chatId. It's not clear to me whether the scope of a chatId is per user, per chat or per "membership level" so I'll just refer to chatId in this answer.
This data could be stored in a single collection with an index on chatId allowing you to easily discriminate between each distinct chat when finding, deleting etc. As the size of that collection grows you might reach the point where it cannot support your desired non functionals. At which point, sharding would be suggested. Of course, you might never reach that point and a simple single-collection approach with sensible indexing, hosted on hardare with sufficient CPU, RAM etc might meet your needs. Without knowing anything about your volumes (current and future), write throughput, desired elapsed times for typical reads etc it's hard to say what will happen.
However, from your question it seems like an eventual need for sharding would be likely and in a bid to preempt that you are considering capping your data footprint.
It is possible to implement a cap per chatId when using a single collection (whether sharded or not), this would require something which:
Can calculate the storage footprint per chatId
For each chatId which exceeds the allowed cap delete the oldest entry and loop until the storage footprint is <= the allowed cap.
This could be triggered on a schedule or by a 'collection write events' listener.
Of course, using capped collections to limit footprint is essentially asking MongoDB to do this for you so it's simpler but there are some issues with that approach:
It might be easier to reason about and manage a system with a single collection than it is to manage a system with a large number (thousands?) of collections
Capped collections will ensure a maximum size per collection but if you cannot cap the number of discrete chatIds then you might still end up in a situation where sharding is required
Capped collections is not really a substitute for sharding; sharding is not just about splitting data into logical pieces, that data is also split across multiple hosts thereby scaling horizontally. Multiple capped collections would exist on the same Mongo node so capping will limit your footprint but it will not scale out your processing power or spread your storage needs across multiple hosts
Unless you are using the WiredTiger storage engine (on MongoDB v 3.x) the maximum number of collections per database is ~24000 (see the docs)
There are limitations to capped collections e.g.
If an update or a replacement operation changes the document size, the operation will fail.
You cannot delete documents from a capped collection
etc
So, in summary ...
If the number of discrete chatIds is in the low hundreds then the potential maximum size of your database is manageable and the total collection count is manageable. In this case, the use of capped collections would offer a nice trade off; it prevents the need for sharding with no loss of functionality.
However, if the number of discrete chatIds is in the thousands and/or if there is no possible cap on the number of discrete chatIds or if the number of discrete chatIds is such that it forces you to apply a miserly cap on each then you'll eventually find yourself having to consider sharding. If this scenario is at all likely then I would suggest starting as simple as possible; use a single collection and only move from that as/when the non functionals demand it. By "move from that" I mean something like start off by applying a manual deletion process and if that becomes ineffective (i.e. if the number of discrete chatIds is such that it forces you to apply a miserly cap on each distinct chatId) then consider sharding.

Using nested document structure in mongodb

I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}
I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.

MongoDB aggregation performance capability

I am trying to work through some performance considerations about using MongoDb for a considerable amount of documents to be used in a variety of aggregations.
I have read that a collection has 32TB capcity depending on the sizes of chunk and shard key values.
If I have 65,000 customers who each supply to us (on average) 350 sales transactions per day, that ends up being about 22,750,000 documents getting created daily. When I say a sales transaction, I mean an object which is like an invoice with a header and line items. Each document I have is an average of 2.60kb.
I also have some other data being received by these same customers like account balances and products from a catalogue. I estimate about 1,000 product records active at any one time.
Based upon the above, I approximate 8,392,475,0,00 (8.4 billion) documents in a single year with a total of 20,145,450,000 kb (18.76Tb) of data being stored in a collection.
Based upon the capacity of a MongoDb collection of 32Tb (34,359,738,368 kb) I believe it would be at 58.63% of capacity.
I want to understand how this will perform for different aggregation queries running on it. I want to create a set of staged pipeline aggregations which write to a different collection which are used as source data for business insights analysis.
Across 8.4 billion transactional documents, I aim to create this aggregated data in a different collection by a set of individual services which output using $out to avoid any issues with the 16Mb document size for a single results set.
Am I being overly ambitious here expection MongoDb to be able to:
Store that much data in a collection
Aggregate and output the results of refreshed data to drive business insights in a separate collection for consumption by services which provide discrete aspects of a customer's business
Any feedback welcome, I want to understand where the limit is of using MongoDb as opposed to other technologies for quantity data storage and use.
Thanks in advance
There is no limit on how big collection in MongoDB can be (in a replica set or a sharded cluster). I think you are confusing this with maximum collection size after reaching which it cannot be sharded.
MongoDB Docs: Sharding Operational Restrictions
For the amount of data you are planning to have it would make sense to go with a sharded cluster from the beginning.