MongoDB hanging count query on collection with large objects - mongodb

I have a collection with 10.000 objects. Each object's size is around 500kb since they include images in them. For statistics, I need to count objects with their creation time. Even though I have indexes, counting the whole collection takes more than 15 seconds. When I remove the image field (i.e the object becomes a simple JSON object), the query immediately returns. I do not understand why size of the objects affects performance this much. Here is a sample query I have been using:
const aggregation = [
{"$match": {"createTime": {"$gte": "2019-01-01T00:00:00.000Z"}}},
{"$match": {"createTime": {"$lte": "2020-01-01T23:59:59.999Z"}}},
{"$count": "value"}];
myCollection.aggregate(aggregation).then(foo);
Is there a way to make the query faster?
One solution I could think of is to store images in a separate collection. This will definitely make the query faster but I am wondering the reason behind this performance drop.

500KB * 10000 documents is 5.1GB to examine. That might take a few seconds, especially if your cache is smaller than that.
Try doing this with a count query instead.
Assuming there is an index on createTime, and no document in the collection contains an array for that field (i.e. the index is not multikey), this query should be able to be fully covered.
This means that they query executor should use a COUNTSCAN stage to find the number of matching documents by scanning the index, and never need to look at a single document, which means document size no longer matters, and it should cut down on your disk IO, cache churn, and CPU utilization as well.
db.myCollection.count({"createTime": {"$gte": "2019-01-01T00:00:00.000Z"},"createTime": {"$lte": "2020-01-01T23:59:59.999Z"}})`

Related

Mongo find() to return all documents is slow for certain documents

I'm experiencing an issue when using PyMongo to iterate over all documents in a particular collection. The loop needs to scan about 450k documents, and it is nearly instant on almost every document except for a handful where a single iteration takes 10-90 seconds.
for testscriptexec in testscriptexecs.find({}, {"tsExecId": 1,"involvedOrgs": 1, "qualifiedName": 1, "endTime": 1, "status": 1}):
I'm trying to figure out what is slowing down the Cursor on certain documents. I determined that the long delays always occur on the same documents.
I compared the JSON export for a slow document and compared it to a fast one and I do not see anything that should be slowing down the indexed search on _id. The documents are not particularly large and the fields that I'm actually pulling are exactly the same size.
The collection has an index on _id, as well as a few other indices that are not relevant to this code.
What are some things that could be causing this query to hang on certain iterations of a find by ID?
These questions are always a bit subjective, but one thought is MongoDB returns data in batches, so that could explain what you are seeing.
You could rule this in or out by tweaking the batch_size parameter on your find() https://pymongo.readthedocs.io/en/stable/api/pymongo/cursor.html#pymongo.cursor.Cursor.batch_size

How to improve terrible MongoDB query performance when aggregating with arrays

I have a data schema consisting of many updates (hundreds of thousands+ per entity) that are assigned to entities. I'm representing this with a single top-level document for each of the entities and an array of updates under each of them. The schema for those top-level documents looks like this:
{
"entity_id": "uuid",
"updates": [
{ "timestamp": Date(...), "value": 10 },
{ "timestamp": Date(...), "value": 11 }
]
}
I'm trying to create a query that returns the number of entities that have received an update within the past n hours. All updates in the updates array are guaranteed to be sorted by virtue of the manner in which they're updated by my application. I've created the following aggregation to do this:
db.getCollection('updates').aggregate([
{"$project": {last_update: {"$arrayElemAt": ["$updates", -1]}}},
{"$replaceRoot": {newRoot: "$last_update"}},
{"$match": {timestamp: {"$gte": new Date(...)}}},
{"$count": "count"}
])
For some reason that I don't understand, the query I just pasted takes an absurd amount of time to complete. It exhausts the 15-second timeout on the client I use, as a matter of fact.
From a time complexity point of view, this query looks incredibly cheap (which is part of the way I designed this schema that way I did). It looks to be linear with respect to the total number of top-level documents in the collection which are then filtered down, of which there are less than 10,000.
The confusing part is that it doesn't seem to be the $project step which is expensive. If I run that one alone, the query completes in under 2 seconds. However, just adding the $match step makes it time out and shows large amounts of CPU and IO usage on the server the database is running on. My best guess is that it's doing some operations on the full update array for some reason, which makes no sense since the first step explicitly limits it to only the last element.
Is there any way I can improve the performance of this aggregation? Does having all of the updates in a single array like this somehow cause Mongo to not be able to create optimal queries even if the array access patterns are efficient themselves?
Would it be better to do what I was doing previously and store each update as a top-level document tagged with the id of its parent entity? This is what I was doing previously, but performance was quite bad and I figured I'd try this schema instead in an effort to improve it. So far, the experience has been the opposite of what I was expecting/hoping for.
Use indexing, it will enhance the performance of your query.
https://docs.mongodb.com/manual/indexes/
For that use the mongo compass to check which index is used most then one by one index them to improve the performance of it.
After that fetch on the fields which you require in the end, with projection in aggregation.
I hope this might solve your issue. But i would suggest that go for indexing first. Its a huge PLUS in case of large data fetching.
You need to support your query with an index and simplify it as much as possible.
You're querying against the timestamp field of the first element of the updates field, so add an index for that:
db.updates.createIndex({'updates.0.timestamp': 1})
You're just looking for a count, so get that directly:
db.updates.count({'updates.0.timestamp': {$gte: new Date(...)}})

Updating large number of records in a collection

I have collection called TimeSheet having few thousands records now. This will eventually increase to 300 million records in a year. In this collection I embed few fields from another collection called Department which is mostly won't get any updates and only rarely some records will be updated. By rarely I mean only once or twice in a year and also not all records, only less than 1% of the records in the collection.
Mostly once a department is created there won't any update, even if there is an update, it will be done initially (when there are not many related records in TimeSheet)
Now if someone updates a department after a year, in a worst case scenario there are chances collection TimeSheet will have about 300 million records totally and about 5 million matching records for the department which gets updated. The update query condition will be on a index field.
Since this update is time consuming and creates locks, I'm wondering is there any better way to do it? One option that I'm thinking is run update query in batches by adding extra condition like UpdatedDateTime> somedate && UpdatedDateTime < somedate.
Other details:
A single document size could be about 3 or 4 KB
We have a replica set containing three replicas.
Is there any other better way to do this? What do you think about this kind of design? What do you think if there numbers I given are less like below?
1) 100 million total records and 100,000 matching records for the update query
2) 10 million total records and 10,000 matching records for the update query
3) 1 million total records and 1000 matching records for the update query
Note: The collection names department and timesheet, and their purpose are fictional, not the real collections but the statistics that I have given are true.
Let me give you a couple of hints based on my global knowledge and experience:
Use shorter field names
MongoDB stores the same key for each document. This repetition causes a increased disk space. This can have some performance issue on a very huge database like yours.
Pros:
Less size of the documents, so less disk space
More documennt to fit in RAM (more caching)
Size of the do indexes will be less in some scenario
Cons:
Less readable names
Optimize on index size
The lesser the index size is, the more it gets fit in RAM and less the index miss happens. Consider a SHA1 hash for git commits for example. A git commit is many times represented by first 5-6 characters. Then simply store the 5-6 characters instead of the all hash.
Understand padding factor
For updates happening in the document causing costly document move. This document move causing deleting the old document and updating it to a new empty location and updating the indexes which is costly.
We need to make sure the document don't move if some update happens. For each collection there is a padding factor involved which tells, during document insert, how much extra space to be allocated apart from the actual document size.
You can see the collection padding factor using:
db.collection.stats().paddingFactor
Add a padding manually
In your case you are pretty sure to start with a small document that will grow. Updating your document after while will cause multiple document moves. So better add a padding for the document. Unfortunately, there is no easy way to add a padding. We can do it by adding some random bytes to some key while doing insert and then delete that key in the next update query.
Finally, if you are sure that some keys will come to the documents in the future, then preallocate those keys with some default values so that further updates don't cause growth of document size causing document moves.
You can get details about the query causing document move:
db.system.profile.find({ moved: { $exists : true } })
Large number of collections VS large number of documents in few collection
Schema is something which depends on the application requirements. If there is a huge collection in which we query only latest N days of data, then we can optionally choose to have separate collection and old data can be safely archived. This will make sure that caching in RAM is done properly.
Every collection created incur a cost which is more than cost of creating collection. Each of the collection has a minimum size which is a few KBs + one index (8 KB). Every collection has a namespace associated, by default we have some 24K namespaces. For example, having a collection per User is a bad choice since it is not scalable. After some point Mongo won't allow us to create new collections of indexes.
Generally having many collections has no significant performance penalty. For example, we can choose to have one collection per month, if we know that we are always querying based on months.
Denormalization of data
Its always recommended to keep all the related data for a query or sequence of queries in the same disk location. You something need to duplicate the information across different documents. For example, in a blog post, you'll want to store post's comments within the post document.
Pros:
index size will be very less as number of index entries will be less
query will be very fast which includes fetching all necessary details
document size will be comparable to page size which means when we bring this data in RAM, most of the time we are not bringing other data along the page
document move will make sure that we are freeing a page, not a small tiny chunk in the page which may not be used in further inserts
Capped Collections
Capped collection behave like circular buffers. They are special type of fixed size collections. These collection can receive very high speed writes and sequential reads. Being fixed size, once the allocated space is filled, the new documents are written by deleting the older ones. However document updates are only allowed if the updated document fits the original document size (play with padding for more flexibility).

Does providing a projection argument to find() limit the data that is added to Mongo's working set?

In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: http://docs.mongodb.org/manual/applications/indexes/#create-indexes-that-support-covered-queries these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.ensureIndex({a:1,b:1});
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( http://docs.mongodb.org/manual/applications/map-reduce/ ).
Note that for updates I believe only atomic operations: http://docs.mongodb.org/manual/tutorial/isolate-sequence-of-operations/ (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).

skip a mongo capped collection

I have a very large capped collection in mongodb. Given that the capped collection structure is predictable (i.e. sort is predefined, memory footprint is predefined, etc), is there a better way to get a cursor on the LATEST item inserted instead of iterating?
In other words, what I'm doing right now is to get the size of my collection (n), and then create a cursor that sets skip=n-1 to put me at the end of the collection. Then I iterate on the cursor and handle all new additions to the collection.
The problem with this approach is that my collection is huge. lets say 11m records. that takes 20 minutes to skip. Which means that when my cursor starts emitting data, its 20 minutes behind.
Try db.cappedCollection.find().limit(1).sort({$natural:-1}) .
Have you tried indexing the collection and using $gt - this should be faster although the index will have some impact on the speed of the writes to the collection.