MongoDB aggregation performance capability - mongodb

I am trying to work through some performance considerations about using MongoDb for a considerable amount of documents to be used in a variety of aggregations.
I have read that a collection has 32TB capcity depending on the sizes of chunk and shard key values.
If I have 65,000 customers who each supply to us (on average) 350 sales transactions per day, that ends up being about 22,750,000 documents getting created daily. When I say a sales transaction, I mean an object which is like an invoice with a header and line items. Each document I have is an average of 2.60kb.
I also have some other data being received by these same customers like account balances and products from a catalogue. I estimate about 1,000 product records active at any one time.
Based upon the above, I approximate 8,392,475,0,00 (8.4 billion) documents in a single year with a total of 20,145,450,000 kb (18.76Tb) of data being stored in a collection.
Based upon the capacity of a MongoDb collection of 32Tb (34,359,738,368 kb) I believe it would be at 58.63% of capacity.
I want to understand how this will perform for different aggregation queries running on it. I want to create a set of staged pipeline aggregations which write to a different collection which are used as source data for business insights analysis.
Across 8.4 billion transactional documents, I aim to create this aggregated data in a different collection by a set of individual services which output using $out to avoid any issues with the 16Mb document size for a single results set.
Am I being overly ambitious here expection MongoDb to be able to:
Store that much data in a collection
Aggregate and output the results of refreshed data to drive business insights in a separate collection for consumption by services which provide discrete aspects of a customer's business
Any feedback welcome, I want to understand where the limit is of using MongoDb as opposed to other technologies for quantity data storage and use.
Thanks in advance

There is no limit on how big collection in MongoDB can be (in a replica set or a sharded cluster). I think you are confusing this with maximum collection size after reaching which it cannot be sharded.
MongoDB Docs: Sharding Operational Restrictions
For the amount of data you are planning to have it would make sense to go with a sharded cluster from the beginning.

Related

Does the azure cosmos db for mongo db RU consumption depend on collection size?

I am using Azure cosmos db for MongoDB. A collection in the DB has 5 million+ documents. Can this cause in increased Request Units consumption? Will the cost decrease if I remove unwanted documents from the collection.
I am doing read and write queries on this collection.
Please suggest.
The main immediate cost benefit of removing unwanted documents is that you pay less for elements of the bill directly related to storage size.
Ideally you want to be using a TTL index to age out unwanted documents so it can use "spare" RU to do this automatically.
I saw good drops in continuous backup and data storage costs after recently doing a bit of a purge of these myself.
The effects on RU consumption are more limited. If currently you are already at the minimum level the system allows you to scale to and the limiting factor for you is the Current storage in GB * 10 RU/s then reducing storage size may allow you to scale down.
Apart from that if these documents are just sitting in the collection and never returned by any queries and you aren't performing operations that aggregate over all items in the collection or running queries that do not have supporting indexes then their existence won't really be impacting your RU consumption.

When writing a single document to GCP Firestore, are you billed the same amount regardless of document size?

I'm deciding on a NoSQL database. I've noticed a surprising difference between AWS billing and GCP billing for their flagship NoSQL products.
AWS DynamoDB charges $1.25/million "WRUs," or Write Request Units. 1 WRU is billed for storing a document up to 1 KB in size. If you write a document that is larger than 1 KB, DynamoDB bills additional WRUs.
GCP Firestore charges $1.8/million "Document Writes." No mention is made of document size limitations, outside the limits page, which says that each document can be up to 1 MiB in size.
So, if I'm thinking about this correctly, if I stored 1 million 4KiB documents in DynamoDB, it would cost me 4 million WRUs, which adds up to $5. If I did the same in Firestore, it would only cost me 1 million writes, which is $1.8.
If I write 1 million 400KiB documents to DynamoDB, it would cost 400 million WRUs, which adds up to $500. But this same operation in Firestore would still be 1 million writes, which is still only $1.8.
I'm surprised by this. This large disparity in price is, in my experience, not common for cloud compute platforms.
When writing to GCP Firestore, are you billed the same amount regardless of document size?
Firestore charges for:
Storage data
Document write operations
Document read operations
Bandwidth consumed by read operations, aka Network egress
There is (as you noticed) no charge for the size of document writes, which indeed leads to favorable results in the comparison you make with writing relatively large documents.
In my experience most write operations performed from client-side application code result in documents that are quite small (a few KB at most) though, so you'll want to validate your expected document size first and then compare again.

MongoDB performances - how many databases, collections?

I am looking to use MongoDB to store time-series data. For sake of discussion imagine I have a finite numbers of sensors deployed (e.g. 10-100-1000 sensors). Each sensors has a dozen of "metrics" (e.g. temp, humidity, etc) which are collected every minute and then stored.
There is a front end which then displays charts for each sensors or aggregate on selected intervals.
What is the best approach, performance wise, to store this? Specifically:
performance-wise, does it matter if I use a single database or more? I could create 1
db for each sensor or just use a single huge db for everything.
performance-wise, does it matters if I partition the data by each
sensor or by metrics?
performance-wise, should i make a collection just for the sensors
info and then collections for data or just merge the two in the same
collection?
Thanks a lot
Approach 1(A): Creating a single database for everything. (With single collection)
Pros:
Less maintenance: Backup, creating database users, restore etc
Cons:
You may see database level lock for creating indexes on large database
To perform operations on specific sensor data, you need to add additional indexes to fetch only sensor specific collection
You're bound to create not more than 64 indexes on a single collection. Although sounds bad indexing strategy.
Approach 1(B): Creating a single database for everything. (With 1 collection for each sensor)
Pros:
Less maintenance: Backup, creating database users, restore etc
Minimizes the need for creating indexes to identify sensor specific data from entire monolithic collection
Every sensor specific query will be only targeted on a specific collection. Does not require to pull large working set into memory as compared to a single large collection.
Building index on relatively smaller collection is more feasible than that of the large collection in single DB
Cons:
You may end up creating too many indexes. (Sum of total number of indexes on all collections).
More maintenance is required for a large number of indexes.
WiredTiger creates 1 file for a collection and 1 for index internally. If your use case grows with a large number of sensors. You may end up using 64K open file limit.
Performance-wise, does it matters if I partition the data by each sensor or by metrics?
This depends on the access patterns expected from your analytics app.
Performance-wise, should i make a collection just for the sensors info and then collections for data or just merge the two in the same collection?
Creating a collection for sensor metadata and sensor data may be needful. It will minimize duplicating sensor metadata in each and every collected sensor data.
You may like to read Williams blog post here on designing this pattern.
As always, it's better to design a sample schema and test your queries within your test environment.

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

Using nested document structure in mongodb

I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}
I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.