Mongo TTL vs Capped collections for efficiency - mongodb

I’m inserting data into a collection to store user history (about 100 items / second), and querying the last hour of data using the aggregation framework (once a minute)
In order to keep my collection optimal, I'm considering two possible options:
Make a standard collection with a TTL index on the creation date
Make a capped collection and query the last hour of data.
Which would be the more efficient solution? i.e. less demanding on the mongo boxes - in terms of I/O, memory usage, CPU etc. (I currently have 1 primary and 1 secondary, with a few hidden nodes. In case that makes a difference)
(I’m ok with adding a bit of a buffer on my capped collection to store 3-4 hours of data on average, and if users become very busy at certain times not getting the full hour of data)

Using a capped collection will be more efficient. Capped collections preserve the order of records by not allowing documents to be deleted or to update them in ways to increase their size, so it can always append to the current end of the collection. This makes insertion simpler and more efficient than with a standard collection.
A TTL-index needs to maintain an additional index for the TTL-field which needs to be updated with every insert, which is an additional slowdown on inserts (this point is of course irrelevant when you would also add an index on the timestamp when using a capped collection). Also, the TTL is enforced by a background job which runs at regular intervals and takes up performance. The job is low-priority and MongoDB is allowed to delay it when there are more high-priority tasks to do. That means you can not rely on the TTL being enforced accurately. So when exact accuracy of the time interval matters, you will have to include the time interval in your query even when you have a TTL set.
The big drawback of capped collections is that it is hard to anticipate how large they really need to be. If your application scales up and you receive a lot more or a lot larger documents than anticipated, you will begin to lose data. You should generally only use capped collections for cases where losing older documents prematurely is not that big of a deal.

Related

MongoDB Index - Performance considerations on collections with high write-to-read ratio

I have a MongoDB Collection with around 100k inserts every day. Each document consumes around 1 MB space and has a lot of elements.
The Data is mainly stored for analytics purpose and read a couple times each day. I want to speed up the queries by adding indexes to a few fields which are usually used for filtering but stumpled accross this statement in the mongodb documentation:
Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
source
I was wondering if this should bother me with 100k inserts vs. a couple of big read operations and if this would add a lot of overhead to the insert operations?
If yes, should i separete reads from writes in in separate collections and duplicate the data or are there any other solutions for this?

Is it safe to use capped collections as a way to manage space?

Basically, I want to create a chat based system and one of the features is to provide a longer history of the chat per membership level. I don't envision allowing a collection larger than 1GB, that even seems overkill. However keeping them small should also mean I don't need to worry about sharding them.
Basically each 'chat' would be a capped collection. The expectation is if they reach the file storage limit the older items would drop, which is how capped collections work. So it seems to me that creating a capped collection for each chat would be an easy way to accomplish this goal. I would just apply and store an id as the collection name so I can access it.
Is there a reason I shouldn't consider this approach?
Sounds like your data is logically split by chatId. It's not clear to me whether the scope of a chatId is per user, per chat or per "membership level" so I'll just refer to chatId in this answer.
This data could be stored in a single collection with an index on chatId allowing you to easily discriminate between each distinct chat when finding, deleting etc. As the size of that collection grows you might reach the point where it cannot support your desired non functionals. At which point, sharding would be suggested. Of course, you might never reach that point and a simple single-collection approach with sensible indexing, hosted on hardare with sufficient CPU, RAM etc might meet your needs. Without knowing anything about your volumes (current and future), write throughput, desired elapsed times for typical reads etc it's hard to say what will happen.
However, from your question it seems like an eventual need for sharding would be likely and in a bid to preempt that you are considering capping your data footprint.
It is possible to implement a cap per chatId when using a single collection (whether sharded or not), this would require something which:
Can calculate the storage footprint per chatId
For each chatId which exceeds the allowed cap delete the oldest entry and loop until the storage footprint is <= the allowed cap.
This could be triggered on a schedule or by a 'collection write events' listener.
Of course, using capped collections to limit footprint is essentially asking MongoDB to do this for you so it's simpler but there are some issues with that approach:
It might be easier to reason about and manage a system with a single collection than it is to manage a system with a large number (thousands?) of collections
Capped collections will ensure a maximum size per collection but if you cannot cap the number of discrete chatIds then you might still end up in a situation where sharding is required
Capped collections is not really a substitute for sharding; sharding is not just about splitting data into logical pieces, that data is also split across multiple hosts thereby scaling horizontally. Multiple capped collections would exist on the same Mongo node so capping will limit your footprint but it will not scale out your processing power or spread your storage needs across multiple hosts
Unless you are using the WiredTiger storage engine (on MongoDB v 3.x) the maximum number of collections per database is ~24000 (see the docs)
There are limitations to capped collections e.g.
If an update or a replacement operation changes the document size, the operation will fail.
You cannot delete documents from a capped collection
etc
So, in summary ...
If the number of discrete chatIds is in the low hundreds then the potential maximum size of your database is manageable and the total collection count is manageable. In this case, the use of capped collections would offer a nice trade off; it prevents the need for sharding with no loss of functionality.
However, if the number of discrete chatIds is in the thousands and/or if there is no possible cap on the number of discrete chatIds or if the number of discrete chatIds is such that it forces you to apply a miserly cap on each then you'll eventually find yourself having to consider sharding. If this scenario is at all likely then I would suggest starting as simple as possible; use a single collection and only move from that as/when the non functionals demand it. By "move from that" I mean something like start off by applying a manual deletion process and if that becomes ineffective (i.e. if the number of discrete chatIds is such that it forces you to apply a miserly cap on each distinct chatId) then consider sharding.

Updating large number of records in a collection

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB
We have a replica set containing three replicas.
Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?
1) 100 million total records and 100,000 matching records for the update query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.
Let me give you a couple of hints based on my global knowledge and experience:
Use shorter field names
MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.
Pros:
Less size of the documents, so less disk space
More documennt to fit in RAM (more caching)
Size of the do indexes will be less in some scenario
Cons:
Less readable names
Optimize on index size
The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.
Understand padding factor
For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.
We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.
You can see the collection padding factor using:
db.collection.stats().paddingFactor
Add a padding manually
In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.
Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.
You can get details about the query causing document move:
db.system.profile.find({ moved: { $exists : true } })
Large number of collections VS large number of documents in few collection
Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.
Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.
Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.
Denormalization of data
Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.
Pros:
index size will be very less as number of index entries will be less
query will be very fast which includes fetching all necessary details
document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections
Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).

MongoDB fast deletion best approach

My application currently use MySQL. In order to support very fast deletion, I organize my data in partitions, according to timestamp. Then when data becomes obsolete, I just drop the whole partition.
It works great, and cleaning up my DB doesn't harm my application performance.
I would want to replace MySQL with MongoDB, and I'm wondering if there's something similiar in MongoDB, or would I just need to delete the records one by one (which, I'm afraid, will be really slow and will make my DB busy, and slow down queries response time).
In MongoDB, if your requirement is to delete data to limit the collection size, you should use a capped collection.
On the other hand, if your requirement is to delete data based on a timestamp, then a TTL index might be exactly what you're looking for.
From official doc regarding capped collections:
Capped collections automatically remove the oldest documents in the collection without requiring scripts or explicit remove operations.
And regarding TTL indexes:
Implemented as a special index type, TTL collections make it possible to store data in MongoDB and have the mongod automatically remove data after a specified period of time.
I thought, even though I am late and an answer has already been accepted, I would add a little more.
The problem with capped collections is that they regularly reside upon one shard in a cluster. Even though, in latter versions of MongoDB, capped collections are shardable they normally are not. Adding to this a capped collection MUST be allocated on the spot, so if you wish to have a long history before clearing the data you might find your collection uses up significantly more space than it should.
TTL is a good answer however it is not as fast as drop(). TTL is basically MongoDB doing the same thing, server-side, that you would do in your application of judging when a row is historical and deleting it. If done excessively it will have a detrimental effect on performance. Not only that but it isn't good at freeing up space to your $freelists which is key to stopping fragmentation in MongoDB.
drop()ing a collection will literally just "drop" the collection on the spot, instantly and gracefully giving that space back to MongoDB (not the OS) giving you absolutely no fragmentation what-so-ever. Not only that but the operation is a lot faster, 90% of the time, than most other alternatives.
So I would stick by my comment:
You could factor the data into time series collections based on how long it takes for data to become historical, then just drop() the collection
Edit
As #Zaid pointed out, even with the _id field capped collections are not shardable.
One solution to this is using TokuMX which supports partitioning:
https://www.percona.com/blog/2014/05/29/introducing-partitioned-collections-for-mongodb-applications/
Advantages over capped collections: capped collections use a fixed amount of space (even when you don't have this much data) and they can't be resized on-the-fly. Partitioned collections usage depends on data; you can add and remove partitions (for newly inserted data) as you see fit.
Advantages over TTL: TTL is slow, it just takes care of removing old data automatically. Partitions are fast - removing data is basically just a file removal.
HOWEVER: after getting acquired by Percona, development of TokuMX appears to have stopped (would love to be corrected on this point). Unfortunately MongoDB doesn't support this functionality and with TokuMX on its way out it looks like we will be stranded without proper solution.

Atomic counters Postgres vs MongoDB

I'm building a very large counter system. To be clear, the system is counting the number of times a domain occurs in a stream of data (that's about 50 - 100 million elements in size).
The system will individually process each element and make a database request to increment a counter for that domain and the date it is processed on. Here's the structure:
stats_table (or collection)
-----------
id
domain (string)
date (date, YYYY-MM-DD)
count (integer)
My initial inkling was to use MongoDB because of their atomic counter feature. However as I thought about it more, I figured Postgres updates already occur atomically (at least that's what this question leads me to believe).
My question is this: is there any benefit of using one database over the other here? Assuming that I'll be processing around 5 million domains a day, what are the key things I need to be considering here?
All single operations in Postgres are automatically wrapped in transactions and all operations on a single document in MongoDB are atomic. Atomicity isn't really a reason to preference one database over the other in this case.
While the individual counts may get quite high, if you're only storing aggregate counts and not each instance of a count, the total number of records should not be too significant. Even if you're tracking millions of domains, either Mongo or Postgres will work equally well.
MongoDB is a good solution for logging events, but I find Postgres to be preferable if you want to do a lot of interesting, relational analysis on the analytics data you're collecting. To do so efficiently in Mongo often requires a high degree of denormalization, so I'd think more about how you plan to use the data in the future.