MongoDB Compound Index

MongoDB Compound Index - mongodb

I need compound index for my collection, but I'm not sure about keys order
My item:
{
_id,
location: {
type: "Point",
coordinates: [<lng>, <lat>]
},
isActive: true,
till: ISODate("2016-12-29T22:00:00.000Z"),
createdAt : ISODate("2016-10-31T12:02:51.072Z"),
...
}
My main query is:
db.collection.find({
$and: [
{
isActive: true
}, {
'till': {
$gte: new Date()
}
},
{
'location': { $geoWithin: { $box: [ [ SWLng,SWLat], [ NELng, NELat] ] } }
}
]
}).sort({'createdAt': -1 })
In human, I need all active items on visible part of my map that are not expired, newly added - first.
Is it normal to create this index:
db.collection.createIndex( { "isActive": 1, "till": -1, "location": "2dsphere", "createdAt": -1 } )
And what is the best order for performance, for disk usage? Or maybe I have to create several indexes...
Thank you!

The order of fields in an index should be:
fields on which you will query for exact values.
fields on which you will sort.
fields on which you will query for a range of values.
In your case it would be:
db.collection.createIndex( { "isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" } )
However, indexes on boolean fields are often not very useful, as on average MongoDB will still need to access half of your documents. So I would advise you to do the following:
duplicate the collection for test purposes
create index, you would like to test (i.e. {"isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" })
in the mongo shell create variable:
var exp = db.testCollection.explain('executionStats')
execute the following query exp.find({'you query'}) it will return statistics describing the execution of the winning plan
analyze the keys like: "nReturned", "totalKeysExamined","totalDocsExamined"
drop the index, create new one (i.e. {"createdAt": -1, "till": -1, "location": "2dsphere"}), execute exp.find({'you query'}) compare the result to the previous one

In Mongo, many things depend upon data and its access patterns. There are few things to consider while creating index on your collection-
How the data will be accessed from application. (You already know the main query so this part is almost done)
The data size and cardinality and data span.
Operations on the data. (how often reads and writes will happen and in what pattern)
A particular query can use only one index at a time.
Index usage is not static. Mongo keeps changing index used by heuristics and it tries to do it in optimized way. So if you see index1 being used at soem time, it may happen that mongo uses index2 after some time when some/enough different type/cardinality of data is entered.
Indices can be good and worse as well for your application performance. It is best to test them via shell/compass before using them in production.
var ex = db.<collection>.explain("executionStats")
Above line when entered in mongo shell gives you the cursor on explainable object which can be used further to check performance issues.
ex.find(<Your query>).sort(<sort predicate>)
Points to note in above output are
"executionTimeMillis"
"totalKeysExamined"
"totalDocsExamined"
"stage"
"nReturned"
We strive for minimum for first three items (executionTimeMillis, totalKeysExamined and totalDocsExamined) and "stage" is one important thing to tell what is happening. If Stage is "COLLSCAN" then it means it is looking for every document to fulfil the query, if Stage is "SORT" then it means it is doing in-memory sorting. Both are not good.
Coming to your query, there are few things to consider-
If "till" is going to have a fix value like End of month date for all the items entered during a month then it's not a good idea to have index on it. DB will have to scan many documents even after this index. Moreover there will be only 12 entries for this index for a year given it is month end date.
If "till" is a fix value after "createdAt" then it is not good to have index on both.
Indexing "isActive" is not good because there are only two values it can take.
So please try with the actual data and execute below indices and determine which index should fit considering time, no. of docs. examined etc.
1. {"location": "2dsphere" , "createdAt": -1}
2. {"till":1, "location": "2dsphere" , "createdAt": -1}
Apply both indices on collection and execute ex.find().sort() where ex is explainable cursor. Then you need to analyze both outputs and compare to decide best.

Related

Slow mongodb query on changing sort direction

I have MongoDB 4.4 cluster and a database with collection of 200k documents and 55 indexes for different queries.
The following query:
db.getCollection('tasks').find({
"customer": "gZuu5ZptDEtC6dq2Z",
"finished": true,
"$or": [
{
"scoreCalculated": {
"$exists": true
},
},
{
"workflowProcessed": {
"$exists": true
},
}
]
}).sort({
"scoreCalculated": -1,
"workflowProcessed": -1,
"createdAt": -1
})
is executed at average of less than 1 second. Explain.
But if I change sort direction to
.sort({
"scoreCalculated": 1,
"workflowProcessed": 1,
"createdAt": 1
})
the execution time grows to several seconds (up to 10). Explain.
The first explain shows that apiGetTasks index is used. But it has ascending sort and I don't get why it is not used when I turn sort direction to ascending. Adding same index with descending sort doesn't change anything.
Could you please help me to understand why the second query is so slow?

Please share the indexes.
55 indexes are way too much you should decrease it because it can harm your performance (every time you want to use an index you need to load it to RAM instead letting Mongo utilize the RAM in order to optimize queries).
Moreover, the number of total exterminated docs is 210780 which is all your collection. So you need to rethink how to build efficiently indexes that help you optimize queries.
Read Mongo indexes docs here

How to optimize mongo query with two parallel array?

I have a query like this:
xml_db.find(
{
'high_performer': {
'$nin': [some_value]
},
'low_performer': {
'$nin': [some_value]
},
'expiration_date': {
'$gte': datetime.now().strftime('%Y-%m-%d')
},
'source': 'some_value'
}
)
I have tried to create an index with those fields but getting error:
pymongo.errors.OperationFailure: cannot index parallel arrays [low_performer] [high_performer]
So, how to efficiently run this query?

Compound indexing ordering should follow the equality --> sort --> range rule. A good description of this can be found in this response.
This means that the first field in the index would be source, followed by the range filters (expiration_date, low_performer and high_performer).
As you noticed, one of the "performer" fields cannot be included in the index since only a single array can be indexed. You should use your knowledge of the data set to determine which filter (low_performer or high_performer) would be more selective and choose that filter to be included in the index.
Assuming that high_performer is more selective, the only remaining step would be to determine the ordering between expiration_date and high_performer. Again, you should use your knowledge of the data set to make this determination based on selectivity.
Assuming expiration_date is more selective, the index to create would then be:
{ "source" : 1, "expiration_date" : 1, "high_performer" : 1 }

Mongo range query slow on objects inside array

I have a collection
orders
{
"_id": "abcd",
"last_modified": ISODate("2016-01-01T00:00:00Z"),
"suborders": [
{
"suborder_id": "1",
"last_modified: ISODate("2016-01-02T00: 00: 00Z")
}, {
"suborder_id":"2",
"last_modified: ISODate("2016-01-03T00:00:00Z")
}
]
}
I have two indexes on this collection:
{"last_modified":1}
{"suborders.last_modified": 1}
when I use range queries on last_modified, indexes are properly used, and results are returned instantly. eg query: db.orders.find({"last_modified":{$gt:ISODate("2016-09-15"), $lt:ISODate("2016-09-16")}});
However, when I am querying on suborders.last_modified, the query takes too long to execute. eq query:db.orders.find({"suborders.last_modified":{$gt:ISODate("2016-09-15"), $lt:ISODate("2016-09-16")}});
Please help debug this.

The short answer is to use min and max to set the index bounds correctly. For how to approach debugging, read on.
A good place to start for query performance issues is to attach .explain() at the end of your queries. I made a script to generate documents like yours and execute the queries you provided.
I used mongo 3.2.9 and both queries do use the created indices with this setup. However, the second query was returning many more documents (approximately 6% of all the documents in the collection). I suspect that is not your intention.
To see what is happening lets consider a small example in the mongo shell:
> db.arrayFun.insert({
orders: [
{ last_modified: ISODate("2015-01-01T00:00:00Z") },
{ last_modified: ISODate("2016-01-01T00:00:00Z") }
]
})
WriteResult({ "nInserted" : 1 })
then query between May and July of 2015:
> db.arrayFun.find({"orders.last_modified": {
$gt: ISODate("2015-05-01T00:00:00Z"),
$lt: ISODate("2015-07-01T00:00:00Z")
}}, {_id: 0})
{ "orders" : [ { "last_modified" : ISODate("2015-01-01T00:00:00Z") }, { "last_modified" : ISODate("2016-01-01T00:00:00Z") } ] }
Although neither object in the array has last_modified between May and July, it found the document. This is because it is looking for one object in the array with last_modified greater than May and one object with last_modified less than July. These queries cannot intersect multikey index bounds, which happens in your case. You can see this in the indexBounds field of explain("allPlansExecution") output, in particular one of the lower bound or upper bound Date will not be what you specified. This means that a large number of documents may need to be scanned to complete the query depending on your data.
To find objects in the array that have last_modified between two bounds, I tried using $elemMatch.
db.orders.find({"suborders": {
$elemMatch:{
last_modified:{
"$gt":ISODate("2016-09-15T00:00:00Z"),
"$lt":ISODate("2016-09-16T00:00:00Z")
}
}
}})
In my test this returned about 0.5% of all documents. However, it was still running slow. The explain output showed it was still not setting the index bounds correctly (only using one bound).
What ended up working best was to manually set the index bounds with min and max.
db.subDocs.find()
.min({"suborders.last_modified":ISODate("2016-09-15T00:00:00Z")})
.max({"suborders.last_modified":ISODate("2016-09-16T00:00:00Z")})
Which returned the same documents as $elemMatch but used both bounds on the index. It ran in 0.021s versus 2-4s for elemMatch and the original find.

MongoDB with 1B documents, what is most optimum filter to return recently updated documents

I have a production mongo database of over 1B documents in a single collection sharded on _id across multiple servers. I'm trying to replicate recently updated records from this collection into Red Shift.
Shard keys:
db.sample_collection.ensureIndex({_id: "hashed"})
sh.shardCollection("sample_collection.sample_object", {_id: "hashed"})
Example 'sample_object' Document
{
"_id" : ObjectId("527a6c9226d6b7770ab05345"),
"p": ISODate("2013-10-27T14:30:18.000Z"),
"a" : {
"ln" : "Doe",
"id" : NumberLong(3),
"fn" : "John",
},
"co" : {
"ct" : 2,
"it" : [
{'t': 'loreum', 'u' : NumberLong(300), 'd': ISODate("2013-10-28T14:30:18.000Z")},
{'t': 'loreum', 'u' : NumberLong(400), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
"li" : {
"ct" : 2,
"it" : [
{'u' : NumberLong(500), 'd': ISODate("2013-10-30T14:30:18.000Z")},
{'u' : NumberLong(501), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
}
Option #1:
I'm in the process of analyzing this data and I need to query for documents that were "updated" between a period.
i.e., I want to return all the objects that have been p (published) or an li.it (item) or co.it (item) added between '2014-07-01' and '2014-07-03'.
What would be the most performant way of doing this?
Option #2:
Another option that I'm evaluating is whether I want to add an 'u' property with an updated date to account for when the document was updated
(ie., li or co item added)
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
Would filtering on 'u' be more performant that Option 1? I'm looking at this option as using COPY FROM JSON from a mongoexport

Option #1 (multiple dates)
There isn't a good option to index this, as it looks like you would ideally want a compound index that includes p (date) plus two date arrays (lt.it and co.it). A compound index can only include at most one array field. Even if you could do this, the index would be very large given the suggested number of dates and the query would involve checking multiple fields to infer the last updated date.
Option #2 (single updated date)
Adding an indexed u (latest updated date) is definitely a better approach to allow a simple and performant query.
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
You can use the $exists operator to find documents that do not have this field set yet.
Caveat on hashed shard key
To elaborate on Neil's comment: a hashed shard key gives you good write distribution at the expense of being able to do range queries (all queries become scatter-gather). If your common queries are range-based on date (and you are concerned about performance) then you could possibly chose a more appropriate shard key to support those queries. However, since shard keys are immutable and you want to query on an "updated" date, it doesn't sound like a change of shard key will help your use case.

MongoDB complex indices

I'm trying to understand how to best work with indices in MongoDB. Lets say that I have a collection of documents like this one:
{
_id: 1,
keywords: ["gap", "casual", "shorts", "oatmeal"],
age: 21,
brand: "Gap",
color: "Black",
gender: "female",
retailer: "Gap",
style: "Casual Shorts",
student: false,
location: "US",
}
and I regularly run a query to find all documents that match a set of criteria for each of those fields, something like:
db.items.find({ age: { $gt: 13, $lt: 40 },
brand: { $in: ['Gap', 'Target'] },
retailer: { $in: ['Gap', 'Target'] },
gender: { $in: ['male', 'female'] },
style: { $in: ['Casual Shorts', 'Jeans']},
location: { $in: ['US', 'International'] },
color: { $in: ['Black', 'Green'] },
keywords: { $all: ['gap', 'casual'] }
})
I'm trying to figure what sort of index I can create to improve the speed of a query such as this. Should I create a compound index like this:
db.items.ensureIndex({ age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
or is there a better set of indices I can create to optimize this query?

Should I create a compound index like this:
db.items.ensureIndex({age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
You can create an index like the one above, but you're indexing almost the entire collection. Indexes take space; the more fields in the index, the more space is used. Usually RAM, although they can be swapped out. They also incur write penalty.
Your index seems wasteful, since probably indexing just a few of those fields will make MongoDB scan a set of documents that is close to the expected result of the find operation.
Is there a better set of indices I can create to optimize this query?
Like I said before, probably yes. But this question is very difficult to answer without knowing details of the collection, like the amount of documents it has, which values each field can have, how those values are distributed in the collection (50% gender male, 50% gender female?), how they correlate to each other, etc.
There are a few indexing strategies, but normally you should strive to create indexes with high selectivity. Choose "small" field combinations that will help MongoDB locate the desired documents scanning a "reasonable" amount of them. Again, "small" and "reasonable" will depend on the characteristics of the collection and query you are performing.
Since this is a fairly complex subject, here are some references that should help you building more appropriate indexes.
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
http://docs.mongodb.org/manual/faq/indexes/#how-do-you-determine-what-fields-to-index
http://docs.mongodb.org/manual/tutorial/create-queries-that-ensure-selectivity/
And use cursor.explain to evaluate your indexes.
http://docs.mongodb.org/manual/reference/method/cursor.explain/

Large index like this one will penalize you on writes. It is better to index just what you need, and let Mongo's optimiser do most of the work for you. You can always give him an hint or, in last resort, reindex if you application or data usage changes drastically.
Your query will use the index for fields that have one (fast), and use a table scan (slow) on the remaining documents.
Depending on your application, a few stand alone indexes could be better. Adding more indexes will not improve performance. With the write penality, it could even make it worse (YMMV).
Here is a basic algorithm for selecting fields to put in an index:
What single field is in a query the most often?
If that single field is present in a query, will a table scan be expensive?
What other field could you index to further reduce the table scan?

This index seems to be very reasonable for your query. MongoDB calls the query a covered query for this index, since there is no need to access the documents. All data can be fetched from the index.
from the docs:
"Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query. An index can also cover an aggregation pipeline operation on unsharded collections."
Some remarks:
This index will only be used by queries that include a filter on age. A query that only filters by brand or retailer will probably not use this index.
Adding an index on only one or two of the most selective fields of your query will already bring a very significant performance boost. The more fields you add the larger the index size will be on disk.
You may want to generate some random sample data and measure the performance of this with different indexes or sets of indexes. This is obviously the safest way to know.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse