I have MongoDB 4.4 cluster and a database with collection of 200k documents and 55 indexes for different queries.
The following query:
db.getCollection('tasks').find({
"customer": "gZuu5ZptDEtC6dq2Z",
"finished": true,
"$or": [
{
"scoreCalculated": {
"$exists": true
},
},
{
"workflowProcessed": {
"$exists": true
},
}
]
}).sort({
"scoreCalculated": -1,
"workflowProcessed": -1,
"createdAt": -1
})
is executed at average of less than 1 second. Explain.
But if I change sort direction to
.sort({
"scoreCalculated": 1,
"workflowProcessed": 1,
"createdAt": 1
})
the execution time grows to several seconds (up to 10). Explain.
The first explain shows that apiGetTasks index is used. But it has ascending sort and I don't get why it is not used when I turn sort direction to ascending. Adding same index with descending sort doesn't change anything.
Could you please help me to understand why the second query is so slow?
Please share the indexes.
55 indexes are way too much you should decrease it because it can harm your performance (every time you want to use an index you need to load it to RAM instead letting Mongo utilize the RAM in order to optimize queries).
Moreover, the number of total exterminated docs is 210780 which is all your collection. So you need to rethink how to build efficiently indexes that help you optimize queries.
Read Mongo indexes docs here
Related
Helo,
I have a mongodb collection with around 20k documents. The deleted documents are stored in the collection as well, meaning they are "soft" deleted.
Now, i want to query the documents based on "status" key. The status can be either "open", "closed" or "deleted". I need the records with status as "closed"
I see that the document count fulfilling my criteria is 25 only. However, the documents scanned (after applying index) are 18k.
Hence, my query takes around 2 minutes to execute and many a time, it times out.
My first questions is:
1. Should a query executing on 20k documents take so much time? 20k isn't such a huge count right?
2. Can someone guide me with optimizing the query further, if it can be?
Pushing the deleted records in a separate archive collection is the last thing i would like to do.
Here is my current query:
**
db.collectionname.find({$and: [{ $and: [{ status: {$ne: 'open'} },{ status: {$ne: 'deleted'} }] },
{ 'submittedDate': { $gte: new Date("2019-02-01T00:00:00.000Z"), $lte: new Date("2019-02-02T00:00:00.000Z") } }
] })
**
You must have a single field index as {status: 1}. Replace the index with a compound index {status: 1, submittedDate: 1} will improve your performance. 20k is nothing if properly indexed. If you have only 3 status, replace your query as follows.
db.collectionname.find({status: 'closed',
'submittedDate': { $gte: new Date("2019-02-01T00:00:00.000Z"),
$lte: new Date("2019-02-02T00:00:00.000Z") }})
I have a query which is routinely taking around 30 seconds to run for a collection with 1 million documents. This query is to form part of a search engine, where the requirement is that every search completes in less than 5 seconds. Using a simplified example here (the actual docs has embedded documents and other attributes), let's say I have the following:
1 millions docs of a Users collections where each looks as follows:
{
name: Dan,
age: 30,
followers: 400
},
{
name: Sally,
age: 42,
followers: 250
}
... etc
Now, lets I'm wanting to return the IDs of 10 users with a follower count between 200 and 300, sorted by age in descending order. This can be achieved with the following:
db.users.find({
'followers': { $gt: 200, $lt: 300 },
}).
projection({ '_id': 1 }).
sort({ 'age': -1 }).
limit(10)
I have the following compound Index created, which winningPlan tells me is being used:
db.users.createIndex({ 'followed_by': -1, 'age': -1 })}
But this query is still taking ~30 seconds as it's having to examine thousands of docs, near equal to the amount of docs in this case that match the find query. I have experimented with different indexes (with different positions and sort orders) with no luck.
So my question is, what else can I do to either reduce the number of documents examined with the query, or speed up the the process of having to examine the docs?
The query is taking long both in production and on my local dev environment, somewhat ruling many network and hardware factors. currentOp shows that the query is not waiting for locks while running, or that there are any other queries running at the same time.
For me, it looks like you have an incorrect index: { 'followed_by': -1, 'age': -1 } for your query. You should have an index { 'followers': 1} (but take into consideration cardinality of that field). But even with that index, you will need to do inmem sort. Anyway, it should be much faster in the way you have high cardinality because you will not need to scan the whole collection for filtering step as you do with index prefix followed_by.
I need compound index for my collection, but I'm not sure about keys order
My item:
{
_id,
location: {
type: "Point",
coordinates: [<lng>, <lat>]
},
isActive: true,
till: ISODate("2016-12-29T22:00:00.000Z"),
createdAt : ISODate("2016-10-31T12:02:51.072Z"),
...
}
My main query is:
db.collection.find({
$and: [
{
isActive: true
}, {
'till': {
$gte: new Date()
}
},
{
'location': { $geoWithin: { $box: [ [ SWLng,SWLat], [ NELng, NELat] ] } }
}
]
}).sort({'createdAt': -1 })
In human, I need all active items on visible part of my map that are not expired, newly added - first.
Is it normal to create this index:
db.collection.createIndex( { "isActive": 1, "till": -1, "location": "2dsphere", "createdAt": -1 } )
And what is the best order for performance, for disk usage? Or maybe I have to create several indexes...
Thank you!
The order of fields in an index should be:
fields on which you will query for exact values.
fields on which you will sort.
fields on which you will query for a range of values.
In your case it would be:
db.collection.createIndex( { "isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" } )
However, indexes on boolean fields are often not very useful, as on average MongoDB will still need to access half of your documents. So I would advise you to do the following:
duplicate the collection for test purposes
create index, you would like to test (i.e. {"isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" })
in the mongo shell create variable:
var exp = db.testCollection.explain('executionStats')
execute the following query exp.find({'you query'}) it will return statistics describing the execution of the winning plan
analyze the keys like: "nReturned", "totalKeysExamined","totalDocsExamined"
drop the index, create new one (i.e. {"createdAt": -1, "till": -1, "location": "2dsphere"}), execute exp.find({'you query'}) compare the result to the previous one
In Mongo, many things depend upon data and its access patterns. There are few things to consider while creating index on your collection-
How the data will be accessed from application. (You already know the main query so this part is almost done)
The data size and cardinality and data span.
Operations on the data. (how often reads and writes will happen and in what pattern)
A particular query can use only one index at a time.
Index usage is not static. Mongo keeps changing index used by heuristics and it tries to do it in optimized way. So if you see index1 being used at soem time, it may happen that mongo uses index2 after some time when some/enough different type/cardinality of data is entered.
Indices can be good and worse as well for your application performance. It is best to test them via shell/compass before using them in production.
var ex = db.<collection>.explain("executionStats")
Above line when entered in mongo shell gives you the cursor on explainable object which can be used further to check performance issues.
ex.find(<Your query>).sort(<sort predicate>)
Points to note in above output are
"executionTimeMillis"
"totalKeysExamined"
"totalDocsExamined"
"stage"
"nReturned"
We strive for minimum for first three items (executionTimeMillis, totalKeysExamined and totalDocsExamined) and "stage" is one important thing to tell what is happening. If Stage is "COLLSCAN" then it means it is looking for every document to fulfil the query, if Stage is "SORT" then it means it is doing in-memory sorting. Both are not good.
Coming to your query, there are few things to consider-
If "till" is going to have a fix value like End of month date for all the items entered during a month then it's not a good idea to have index on it. DB will have to scan many documents even after this index. Moreover there will be only 12 entries for this index for a year given it is month end date.
If "till" is a fix value after "createdAt" then it is not good to have index on both.
Indexing "isActive" is not good because there are only two values it can take.
So please try with the actual data and execute below indices and determine which index should fit considering time, no. of docs. examined etc.
1. {"location": "2dsphere" , "createdAt": -1}
2. {"till":1, "location": "2dsphere" , "createdAt": -1}
Apply both indices on collection and execute ex.find().sort() where ex is explainable cursor. Then you need to analyze both outputs and compare to decide best.
I have a collection
orders
{
"_id": "abcd",
"last_modified": ISODate("2016-01-01T00:00:00Z"),
"suborders": [
{
"suborder_id": "1",
"last_modified: ISODate("2016-01-02T00: 00: 00Z")
}, {
"suborder_id":"2",
"last_modified: ISODate("2016-01-03T00:00:00Z")
}
]
}
I have two indexes on this collection:
{"last_modified":1}
{"suborders.last_modified": 1}
when I use range queries on last_modified, indexes are properly used, and results are returned instantly. eg query: db.orders.find({"last_modified":{$gt:ISODate("2016-09-15"), $lt:ISODate("2016-09-16")}});
However, when I am querying on suborders.last_modified, the query takes too long to execute. eq query:db.orders.find({"suborders.last_modified":{$gt:ISODate("2016-09-15"), $lt:ISODate("2016-09-16")}});
Please help debug this.
The short answer is to use min and max to set the index bounds correctly. For how to approach debugging, read on.
A good place to start for query performance issues is to attach .explain() at the end of your queries. I made a script to generate documents like yours and execute the queries you provided.
I used mongo 3.2.9 and both queries do use the created indices with this setup. However, the second query was returning many more documents (approximately 6% of all the documents in the collection). I suspect that is not your intention.
To see what is happening lets consider a small example in the mongo shell:
> db.arrayFun.insert({
orders: [
{ last_modified: ISODate("2015-01-01T00:00:00Z") },
{ last_modified: ISODate("2016-01-01T00:00:00Z") }
]
})
WriteResult({ "nInserted" : 1 })
then query between May and July of 2015:
> db.arrayFun.find({"orders.last_modified": {
$gt: ISODate("2015-05-01T00:00:00Z"),
$lt: ISODate("2015-07-01T00:00:00Z")
}}, {_id: 0})
{ "orders" : [ { "last_modified" : ISODate("2015-01-01T00:00:00Z") }, { "last_modified" : ISODate("2016-01-01T00:00:00Z") } ] }
Although neither object in the array has last_modified between May and July, it found the document. This is because it is looking for one object in the array with last_modified greater than May and one object with last_modified less than July. These queries cannot intersect multikey index bounds, which happens in your case. You can see this in the indexBounds field of explain("allPlansExecution") output, in particular one of the lower bound or upper bound Date will not be what you specified. This means that a large number of documents may need to be scanned to complete the query depending on your data.
To find objects in the array that have last_modified between two bounds, I tried using $elemMatch.
db.orders.find({"suborders": {
$elemMatch:{
last_modified:{
"$gt":ISODate("2016-09-15T00:00:00Z"),
"$lt":ISODate("2016-09-16T00:00:00Z")
}
}
}})
In my test this returned about 0.5% of all documents. However, it was still running slow. The explain output showed it was still not setting the index bounds correctly (only using one bound).
What ended up working best was to manually set the index bounds with min and max.
db.subDocs.find()
.min({"suborders.last_modified":ISODate("2016-09-15T00:00:00Z")})
.max({"suborders.last_modified":ISODate("2016-09-16T00:00:00Z")})
Which returned the same documents as $elemMatch but used both bounds on the index. It ran in 0.021s versus 2-4s for elemMatch and the original find.
I've gone through many articles about indexes in MongoDB but still have no clear understanding of indexing strategy for my particular case.
I have the following collection with more than 10 million items:
{
active: BOOLEAN,
afp: ARRAY_of_integers
}
Previously I was using aggregation with this pipeline:
pipeline = [
{'$match': {'active': True}},
{'$project': {
'sub': {'$setIsSubset': [afp, '$afp']}
}},
{'$match': {'sub': True}}
]
Queries were pretty slow, so I started experimenting without aggregation:
db.collection.find({'active': True, 'afp': {'$all': afp}})
The latter query using $all runs much faster with the same indexes - no idea why...
I've tried these indexes (not much variations possible though):
{afp: 1}
{active: 1}
{active: 1, afp: 1}
{afp: 1, active: 1}
I don't know why, but the latest index gives the best performance - any ideas about the reason would be much appreciated.
Then I decided to add constraints in order to possibly improve speed.
Considering that number of integers in "afp" field can be different, there's no reason to scan documents having less number of integers than in the query. So I created one more field for all documens called "len_afp" which contains that number (afp length indeed).
Now documents look like this:
{
active: BOOLEAN,
afp: ARRAY_of_integers
len_afp: INTEGER
}
Query is:
db.collection.find({
'active': True,
'afp': {'$all': afp},
'len_afp: {'$gte': len_afp}
})
Also I've added three new indexes:
{afp: 1, len_afp: 1, active: 1}
{afp: 1, active: 1, len_afp: 1}
{len_afp: 1, afp: 1, active: 1}
The latest index gives the best performance - again for unknown reason.
So my question is: what the logic is behind fields order in compound indexes, and how this logic has to be considered when creating indexes?
Also it's interesting why $all works faster than $setIsSubset with all the same other conditions?
Can index intersection be used for my case instead of compound indexes?
Still the performance is pretty low - what am I doing wrong?
Can sharding help in my particular case (will it work using aggregation, or $all, or $gte)?
Sorry for huge question and Thank you in advance!