Mongo find() to return all documents is slow for certain documents - find

I'm experiencing an issue when using PyMongo to iterate over all documents in a particular collection. The loop needs to scan about 450k documents, and it is nearly instant on almost every document except for a handful where a single iteration takes 10-90 seconds.
for testscriptexec in testscriptexecs.find({}, {"tsExecId": 1,"involvedOrgs": 1, "qualifiedName": 1, "endTime": 1, "status": 1}):
I'm trying to figure out what is slowing down the Cursor on certain documents. I determined that the long delays always occur on the same documents.
I compared the JSON export for a slow document and compared it to a fast one and I do not see anything that should be slowing down the indexed search on _id. The documents are not particularly large and the fields that I'm actually pulling are exactly the same size.
The collection has an index on _id, as well as a few other indices that are not relevant to this code.
What are some things that could be causing this query to hang on certain iterations of a find by ID?

These questions are always a bit subjective, but one thought is MongoDB returns data in batches, so that could explain what you are seeing.
You could rule this in or out by tweaking the batch_size parameter on your find() https://pymongo.readthedocs.io/en/stable/api/pymongo/cursor.html#pymongo.cursor.Cursor.batch_size

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

MongoDB hanging count query on collection with large objects

I have a collection with 10.000 objects. Each object's size is around 500kb since they include images in them. For statistics, I need to count objects with their creation time. Even though I have indexes, counting the whole collection takes more than 15 seconds. When I remove the image field (i.e the object becomes a simple JSON object), the query immediately returns. I do not understand why size of the objects affects performance this much. Here is a sample query I have been using:
const aggregation = [
{"$match": {"createTime": {"$gte": "2019-01-01T00:00:00.000Z"}}},
{"$match": {"createTime": {"$lte": "2020-01-01T23:59:59.999Z"}}},
{"$count": "value"}];
myCollection.aggregate(aggregation).then(foo);
Is there a way to make the query faster?
One solution I could think of is to store images in a separate collection. This will definitely make the query faster but I am wondering the reason behind this performance drop.
500KB * 10000 documents is 5.1GB to examine. That might take a few seconds, especially if your cache is smaller than that.
Try doing this with a count query instead.
Assuming there is an index on createTime, and no document in the collection contains an array for that field (i.e. the index is not multikey), this query should be able to be fully covered.
This means that they query executor should use a COUNTSCAN stage to find the number of matching documents by scanning the index, and never need to look at a single document, which means document size no longer matters, and it should cut down on your disk IO, cache churn, and CPU utilization as well.
db.myCollection.count({"createTime": {"$gte": "2019-01-01T00:00:00.000Z"},"createTime": {"$lte": "2020-01-01T23:59:59.999Z"}})`

How to correctly build indexes on MongoDB when every field will be searchable and sortable?

I am designing a MongoDB collection that will have 50 million documents and every field in the document will be searchable and sortable. The searching and sorting logics will be sent from the frontend so could be a lot of fields searchings and sorting combinations. I've made some tests and concluded that when there is searching and sorting only in indexed fields the query runs very fast, but when searching or sorting non-indexed fields the query runs very slow.
Considering that will have a lot of possible searching/sorting combinations, how can I build indexes in this collection in this case to get a better performance?
Indexing comes at a cost of extra memory space and possible increased execution time of database write(insert and update) operations. However, like you rightly pointed out, indexing makes database reads(and sorting) super fast.
Creating indexes is easy and straight forward, however, you need to consider the tradeoffs, most times, this is usually the read-write ration of the fields in your documents.
If you frequently read(or sort) documents from a very large collection(like the 50million examples you mentioned), it makes a lot of sense to add indexing to all the fields you use to identify(or sort) your documents, you just need to ensure you don't run out of memory space in the DB. Not indexing the fields would be very frustrating, just imagine if you need to get the last document by a field that is not indexed, you would have to search through 49,999,999 documents to find it.
I hope this helps.

How to improve terrible MongoDB query performance when aggregating with arrays

I have a data schema consisting of many updates (hundreds of thousands+ per entity) that are assigned to entities. I'm representing this with a single top-level document for each of the entities and an array of updates under each of them. The schema for those top-level documents looks like this:
{
"entity_id": "uuid",
"updates": [
{ "timestamp": Date(...), "value": 10 },
{ "timestamp": Date(...), "value": 11 }
]
}
I'm trying to create a query that returns the number of entities that have received an update within the past n hours. All updates in the updates array are guaranteed to be sorted by virtue of the manner in which they're updated by my application. I've created the following aggregation to do this:
db.getCollection('updates').aggregate([
{"$project": {last_update: {"$arrayElemAt": ["$updates", -1]}}},
{"$replaceRoot": {newRoot: "$last_update"}},
{"$match": {timestamp: {"$gte": new Date(...)}}},
{"$count": "count"}
])
For some reason that I don't understand, the query I just pasted takes an absurd amount of time to complete. It exhausts the 15-second timeout on the client I use, as a matter of fact.
From a time complexity point of view, this query looks incredibly cheap (which is part of the way I designed this schema that way I did). It looks to be linear with respect to the total number of top-level documents in the collection which are then filtered down, of which there are less than 10,000.
The confusing part is that it doesn't seem to be the $project step which is expensive. If I run that one alone, the query completes in under 2 seconds. However, just adding the $match step makes it time out and shows large amounts of CPU and IO usage on the server the database is running on. My best guess is that it's doing some operations on the full update array for some reason, which makes no sense since the first step explicitly limits it to only the last element.
Is there any way I can improve the performance of this aggregation? Does having all of the updates in a single array like this somehow cause Mongo to not be able to create optimal queries even if the array access patterns are efficient themselves?
Would it be better to do what I was doing previously and store each update as a top-level document tagged with the id of its parent entity? This is what I was doing previously, but performance was quite bad and I figured I'd try this schema instead in an effort to improve it. So far, the experience has been the opposite of what I was expecting/hoping for.
Use indexing, it will enhance the performance of your query.
https://docs.mongodb.com/manual/indexes/
For that use the mongo compass to check which index is used most then one by one index them to improve the performance of it.
After that fetch on the fields which you require in the end, with projection in aggregation.
I hope this might solve your issue. But i would suggest that go for indexing first. Its a huge PLUS in case of large data fetching.
You need to support your query with an index and simplify it as much as possible.
You're querying against the timestamp field of the first element of the updates field, so add an index for that:
db.updates.createIndex({'updates.0.timestamp': 1})
You're just looking for a count, so get that directly:
db.updates.count({'updates.0.timestamp': {$gte: new Date(...)}})

Mongodb Index or not to index

quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.