I've gone through many articles about indexes in MongoDB but still have no clear understanding of indexing strategy for my particular case.
I have the following collection with more than 10 million items:
{
active: BOOLEAN,
afp: ARRAY_of_integers
}
Previously I was using aggregation with this pipeline:
pipeline = [
{'$match': {'active': True}},
{'$project': {
'sub': {'$setIsSubset': [afp, '$afp']}
}},
{'$match': {'sub': True}}
]
Queries were pretty slow, so I started experimenting without aggregation:
db.collection.find({'active': True, 'afp': {'$all': afp}})
The latter query using $all runs much faster with the same indexes - no idea why...
I've tried these indexes (not much variations possible though):
{afp: 1}
{active: 1}
{active: 1, afp: 1}
{afp: 1, active: 1}
I don't know why, but the latest index gives the best performance - any ideas about the reason would be much appreciated.
Then I decided to add constraints in order to possibly improve speed.
Considering that number of integers in "afp" field can be different, there's no reason to scan documents having less number of integers than in the query. So I created one more field for all documens called "len_afp" which contains that number (afp length indeed).
Now documents look like this:
{
active: BOOLEAN,
afp: ARRAY_of_integers
len_afp: INTEGER
}
Query is:
db.collection.find({
'active': True,
'afp': {'$all': afp},
'len_afp: {'$gte': len_afp}
})
Also I've added three new indexes:
{afp: 1, len_afp: 1, active: 1}
{afp: 1, active: 1, len_afp: 1}
{len_afp: 1, afp: 1, active: 1}
The latest index gives the best performance - again for unknown reason.
So my question is: what the logic is behind fields order in compound indexes, and how this logic has to be considered when creating indexes?
Also it's interesting why $all works faster than $setIsSubset with all the same other conditions?
Can index intersection be used for my case instead of compound indexes?
Still the performance is pretty low - what am I doing wrong?
Can sharding help in my particular case (will it work using aggregation, or $all, or $gte)?
Sorry for huge question and Thank you in advance!
Related
I'm trying to migrate some MySQL code into mongodb (intermediate to senior in MySQL, beginner to intermediate in MongoDB). I need to perform severals updates an hour that imply all documents in the collection. Updates don't overlap document-to-document but they use inner values (see code bellow).
Before diving into actual migration of the code, I've done a benchmark :
Let's take this structure :
const MySchema = new Schema({
amount: {type: Number, required: true, default: 0},
myField: {type: Number, required: true, default: 0}
});
For the benchmark, I need to perform an update that will do the following : myField = myField + (amount * 2). The real calculation will be much more complex, so it's critical that a "simple" operation like this works.
Bellow, the implementation (nb: mongoose used) :
await MyModel.updateMany({amount: {$gte: 0}}, [
{$set: {
myField: {
$add: [
'$myField',
{$multiply: ['$amount', 2]}
]
}
}}
]);
Please note that the "foreach()" approach is not something I have in mind because the number of row will increase significantly, all I need is the fastest method to do so
I seeded 50k entries with random amounts,
The execution of this query take between 1200ms and 1400ms
The equivalent in MySQL took about 75ms, so a ratio of 1:18 in the worst case
The filter {amount: {$gte: 0}} can be omitted (it decrease execution time by 200ms).
Now the question :
Is it "normal" considering that's a simple update, that MySQL performs faster than MongoDB on a basic 50k rows set ?
Does something in my implementation is missing to go as low as 75ms update, or should I consider mongoDB is not intended for this type of usage ?
I have MongoDB 4.4 cluster and a database with collection of 200k documents and 55 indexes for different queries.
The following query:
db.getCollection('tasks').find({
"customer": "gZuu5ZptDEtC6dq2Z",
"finished": true,
"$or": [
{
"scoreCalculated": {
"$exists": true
},
},
{
"workflowProcessed": {
"$exists": true
},
}
]
}).sort({
"scoreCalculated": -1,
"workflowProcessed": -1,
"createdAt": -1
})
is executed at average of less than 1 second. Explain.
But if I change sort direction to
.sort({
"scoreCalculated": 1,
"workflowProcessed": 1,
"createdAt": 1
})
the execution time grows to several seconds (up to 10). Explain.
The first explain shows that apiGetTasks index is used. But it has ascending sort and I don't get why it is not used when I turn sort direction to ascending. Adding same index with descending sort doesn't change anything.
Could you please help me to understand why the second query is so slow?
Please share the indexes.
55 indexes are way too much you should decrease it because it can harm your performance (every time you want to use an index you need to load it to RAM instead letting Mongo utilize the RAM in order to optimize queries).
Moreover, the number of total exterminated docs is 210780 which is all your collection. So you need to rethink how to build efficiently indexes that help you optimize queries.
Read Mongo indexes docs here
I have a mongodb index with close to 100k documents. On each document, there are the following 3 fields.
arrayX: [ObjectId]
someID: ObjectId
timestamp: Date
I have created a compound index for the 3 fields in that order.
When I try to then fire an aggregate query (written below in pseudocode), as
match(
and(
arrayX: (elematch: A),
someId: Y
)
)
sort (timestamp: 1)
it does not end up using the compound index.
The way I know this is when I use .explain(), the winningPlan stage is FETCH, the inputStage is IXSCAN and the indexname is timestamp_1
which means its only using the other single key index i created for the timestamp field.
What's interesting is that if I remove the sort stage, and keep everything the exact same, mongodb ends up using the compound index.
What gives?
Multi-key indexes are not useful for sorting. I would expect that a plan using the other index was listed in rejectedPlans.
If you run explain with the allPlansExecution option, the response will also show you the execution times for each plan, among other things.
Since the multi-key index can't be used for sorting the results, that plan would require a blocking sort stage. This means that all of the matching documents must be retrieved and then sorted in memory before sending a response.
On the other hand, using the timestamp_1 index means that documents will be encountered in a presorted order while traversing the index. The tradeoff here is that there is no blocking sort stage, but every document must be examined to see if it matches the query.
For data sets that are not huge, or when the query will match a significant fraction of the collection, the plan without a blocking sort will return results faster.
You might test creating another index on { someID:1, timestamp:1 } as this might reduce the number of documents scanned while still avoiding the blocking sort.
The reason the compound index is selected when you remove the sort stage is because that stage probably accounts for the majority of the execution time.
The fields in the executionStats section of the explain output are explained in Explain Results. Comparing the estimated execution times for each stage may help you determine where you can tune the queries.
I am using documents like this (based on the question post) for discussion:
{
_id: 1,
fld: "One",
arrayX: [ ObjectId("5e44f9ed221e963909537848"), ObjectId("5e44f9ed221e963909537849") ],
someID: ObjectId("5e44f9e7221e963909537845"),
timestamp: ISODate("2020-02-12T01:00:00.0Z")
}
The Indexes:
I created two indexes, as mentioned in the question post:
{ timestamp: 1 } and { arrayX:1, someID:1, timestamp:1 }
The Query:
db.test.find(
{
someID: ObjectId("5e44f9e7221e963909537845"),
arrayX: ObjectId("5e44f9ed221e963909537848")
}
).sort( { timestamp: 1 } )
In the above query I am not using $elemMatch. A query filter using $elemMatch with single field equality condition can be written without the $elemMatch. From $elemMatch Single Query Condition:
If you specify a single query predicate in the $elemMatch expression,
$elemMatch is not necessary.
The Query Plan:
I ran the query with explain, and found that the query uses the arrayX_1_someID_1_timestamp_1index. The index is used for the filter as well as the sort operations of the query.
Sample plan details:
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"arrayX" : 1,
"someID" : 1,
"timestamp" : 1
},
"indexName" : "arrayX_1_someID_1_timestamp_1",
...
The IXSCAN specifies that the query uses the index. The FETCH stage specifies that the document is retrieved for getting other details using the index id. This means that both the query's filter as well as the sort use the index. The way to know that sort uses an index is the plan will not have a SORT stage - as in this case.
Reference:
From Sort and Non-prefix Subset of an Index:
An index can support sort operations on a non-prefix subset of the
index key pattern. To do so, the query must include equality
conditions on all the prefix keys that precede the sort keys.
I'm trying to understand how to best work with indices in MongoDB. Lets say that I have a collection of documents like this one:
{
_id: 1,
keywords: ["gap", "casual", "shorts", "oatmeal"],
age: 21,
brand: "Gap",
color: "Black",
gender: "female",
retailer: "Gap",
style: "Casual Shorts",
student: false,
location: "US",
}
and I regularly run a query to find all documents that match a set of criteria for each of those fields, something like:
db.items.find({ age: { $gt: 13, $lt: 40 },
brand: { $in: ['Gap', 'Target'] },
retailer: { $in: ['Gap', 'Target'] },
gender: { $in: ['male', 'female'] },
style: { $in: ['Casual Shorts', 'Jeans']},
location: { $in: ['US', 'International'] },
color: { $in: ['Black', 'Green'] },
keywords: { $all: ['gap', 'casual'] }
})
I'm trying to figure what sort of index I can create to improve the speed of a query such as this. Should I create a compound index like this:
db.items.ensureIndex({ age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
or is there a better set of indices I can create to optimize this query?
Should I create a compound index like this:
db.items.ensureIndex({age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
You can create an index like the one above, but you're indexing almost the entire collection. Indexes take space; the more fields in the index, the more space is used. Usually RAM, although they can be swapped out. They also incur write penalty.
Your index seems wasteful, since probably indexing just a few of those fields will make MongoDB scan a set of documents that is close to the expected result of the find operation.
Is there a better set of indices I can create to optimize this query?
Like I said before, probably yes. But this question is very difficult to answer without knowing details of the collection, like the amount of documents it has, which values each field can have, how those values are distributed in the collection (50% gender male, 50% gender female?), how they correlate to each other, etc.
There are a few indexing strategies, but normally you should strive to create indexes with high selectivity. Choose "small" field combinations that will help MongoDB locate the desired documents scanning a "reasonable" amount of them. Again, "small" and "reasonable" will depend on the characteristics of the collection and query you are performing.
Since this is a fairly complex subject, here are some references that should help you building more appropriate indexes.
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
http://docs.mongodb.org/manual/faq/indexes/#how-do-you-determine-what-fields-to-index
http://docs.mongodb.org/manual/tutorial/create-queries-that-ensure-selectivity/
And use cursor.explain to evaluate your indexes.
http://docs.mongodb.org/manual/reference/method/cursor.explain/
Large index like this one will penalize you on writes. It is better to index just what you need, and let Mongo's optimiser do most of the work for you. You can always give him an hint or, in last resort, reindex if you application or data usage changes drastically.
Your query will use the index for fields that have one (fast), and use a table scan (slow) on the remaining documents.
Depending on your application, a few stand alone indexes could be better. Adding more indexes will not improve performance. With the write penality, it could even make it worse (YMMV).
Here is a basic algorithm for selecting fields to put in an index:
What single field is in a query the most often?
If that single field is present in a query, will a table scan be expensive?
What other field could you index to further reduce the table scan?
This index seems to be very reasonable for your query. MongoDB calls the query a covered query for this index, since there is no need to access the documents. All data can be fetched from the index.
from the docs:
"Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query. An index can also cover an aggregation pipeline operation on unsharded collections."
Some remarks:
This index will only be used by queries that include a filter on age. A query that only filters by brand or retailer will probably not use this index.
Adding an index on only one or two of the most selective fields of your query will already bring a very significant performance boost. The more fields you add the larger the index size will be on disk.
You may want to generate some random sample data and measure the performance of this with different indexes or sets of indexes. This is obviously the safest way to know.
I do much find requests on collection like this:
{'$and': [{'time': {'$lt': 1375214400}},
{'time': {'$gte': 1375128000}},
{'$or': [{'uuid': 'test'},{'uuid': 'test2'}]}
]}
Which index i must create: compound or two single or both?
uuid - name of data collector.
time - timestamp
I want to retrieve data, collected by one or few collectors in specified time interval.
Your query would be better written without the $and and using $in instead of $or:
{
'time': {'$lt': 1375214400, '$gte': 1375128000},
'uuid': {'$in': ['test', 'test2']}
}
Then it's pretty clear you need a compound index that covers both time and uuid for best query performance. But it's important to always confirm your index is being used as you expect by using explain().