How can I create compound index in mongo where one of the fields maybe not present or be null?
For example in below documents if I create a compound index name+age. How can I still achieve this with age being not present or null in some documents?
{
name: "Anurag",
age: "21",
},
{
name: "Nitin",
},
You can create partial Index as follow:
db.contacts.createIndex(
{ name: 1 },
{ partialFilterExpression: { age: { $exists: true } } }
)
Explained:
As per the documentation partial indexes only index the documents in a collection that meet a specified filter expression. By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance. In this particular case imagine your collection have 100k documents , but only 5 documents have the "age" field existing , in this case the partial index will include only those 5 fields in the index optimizing the index storage space and providing better performance.
For the query optimizer to choose this partial index, the query predicate must include a condition on the name field as well as a non-null match on the age field.
Following example queries will be able to use the index:
db.contacts.find({name:"John"})
db.contacts.find({name:"John",age:{$gt:20}})
db.contacts.find({name:"John",age:30})
Following example query is a "covered query" based on this index:
db.contacts.find({name:"John",age:30},{_id:0,name:1,age:1})
( this query will be highly efficient since it return the data directly from the index )
Following example queries will not be able to use the index:
db.contacts.find({name:"John",age:{$exists:false}})
db.contacts.find({name:"John",age:null})
db.contacts.find({age:20})
Please, note you need to perform some analysis on if you need to search on the age field together with the name , since name field has a very good selectivity and this index will not be used in case you search only by age , maybe a good option is to create additional sparse/partial index only on the age field so you could fetch a list with contacts by certain age if this a possible search use case.
Related
Collection: appointments
Schema:
{
_id: ObjectId();
userId: string;
calType: string;
status: string;
appointment_start_date_time: string; //UTC ISO string
appointment_end_date_time: string; //UTC ISO string
}
Example:
{
_id: ObjectId('6332b21960f8083d24f3140b')
userId: "6272ccb3-4050-429c-b427-eb104f340962"
calType: "MY Personal Cal"
status: "CONFIRMED"
appointment_start_date_time: "2022-07-08T03:30:00.000Z"
appointment_end_date_time: "2022-07-08T04:00:00.000Z"
}
I want to create a compound index on userId, calType, status, appointment_start_date_time
Based on Mongo Db's ESR rule I would like to determine the arrangement of my keys.
The documentation conveniently gives an example of 3 keys in compound index where the first key is for equality, second for sort and third for range. But in my case I have more than 3 keys.
I would like to know how would the index keys be arranged for a more efficient compound index. In my case userId, calType, status will be used for equality based match whereas appointment_start_date_time will be used for sorting.
Potential queries which I will be making on this collection will be:
All appointments where userId = x, calType = y, status = z sort by appointment_start_date_time ASC
All appointments where userId = x, status = z
All appointments where calType = y, status = z
All appointments where userId = x sort by appointment_start_date_time ASC or DSC
What is the standard when we have multiple keys for equality and one for sorting/range?
None of your sample queries use a ranged filter. Assuming none of these fields contain arrays, applying the ESR rule:
Queries 1 and 2 could be optimally served by an index on
{userId:1, status:1, calType:1, appointment_start_date_time:1}
Query 3 would be best server by this index:
{calType:1, status:1}
Query 4 would be best served by:
{userId:1, appointment_start_date_time:1}
In these optimal cases, the MongoDB server could seek to the first matching index key, scan to the last key in a single pass, and encounter the documents in already sorted order.
It may also be possible to get acceptable performance for queries 1,2, and 4 using the index:
{userId:1, appointment_start_date_time:1, status:1, calType:1}
Using this index, query 4 would still be optimal, but query 1 and 2 would require and additional index seek for each status/calType pair. This would be somewhat less performant than the optimal case, but would still be better than an in-memory sort.
This is executed immediately:
db.mycollection.find({ strField: 'AAA'}).count()
And this takes a lot to finish:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $exists: true }}).count()
This is how I created my index:
db.mycollection.createIndex({strField: 1, dateTimeField: 1}, { sparse: true })
But it doesn't work even using hint(indexName)
Why this happens and how to fix it?
The { $exists: true } query predicate is problematic, especially if there are documents in the collection for which that field does not exist.
When MongoDB creates an index entry for a document, it collects all of the field values according to the index spec, and concatenates them.
If a field is not present in the document, the index stores null in that field's position.
If the field is explicitly set to null, it also stores null in that field's position.
This means that these 2 documents will have identical entries in the index:
{ strField: 'AAA', dateTimeField: null}
{ strField: 'AAA'}
Note that even with the index being sparse, both documents will be indexed since at least one of the indexes fields exists in each document.
When testing {dateTimeFied:{$exists:true}}, the first document will match, while the second will not.
When processing a count query using an index, if the query can be satisfied by scanning a single range of the index, the query executor can use a count_scan stage, and get the correct result without loading a single document from disk.
Because the executor cannot definitively tell from the index whether or not the field exists, it cannot use a count_scan, and must instead use an ordinary ixscan followed by a fetch stage, and load all of the matching documents from disk in order to arrive at the correct count.
In the case of the first query, the executor would have been able to use a count_scan, while the second would have had to examine all of the documents. You should be able to see this by running explain with the executionStats option on each query.
One way to avoid this pitfall is to take advantage of the fact that MongoDB query operators are type-sensitive. This means that this query will match any document where dateTimeField is greater than epoch 0, and a timestamp:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $gte: new ISODate("1970-01-01T00:00:00Z") }}).count()
This will allow the query executor to count all of the documents that have the matching string and contain a date, but will exclude documents that contain a dateTimeField with a numeric or string value.
I am working on optimising my queries in mongodb.
In normal sql query there is an order in which where clauses are applied. For e.g. select * from employees where department="dept1" and floor=2 and sex="male", here first department="dept1" is applied, then floor=2 is applied and lastly sex="male".
I was wondering does it happen in a similar way in mongodb.
E.g.
DbObject search = new BasicDbObject("department", "dept1").put("floor",2).put("sex", "male");
here which match clause will be applied first or infact does mongo work in this manner at all.
This question basically arises from my background with SQL databases.
Please help.
If there are no indexes we have to scan the full collection (collection scan) in order to find the required documents. In your case if you want to apply with order [department, floor and sex] you should create this compound index:
db.employees.createIndex( { "department": 1, "floor": 1, "sex" : 1 } )
As documentation: https://docs.mongodb.org/manual/core/index-compound/
db.products.createIndex( { "item": 1, "stock": 1 } )
The order of the fields in a compound index is very important. In the
previous example, the index will contain references to documents
sorted first by the values of the item field and, within each value of
the item field, sorted by values of the stock field.
I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?
Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.
Let's say I have the following document schema:
{
_id: ObjectId(...),
name: "Kevin",
weight: 500,
hobby: "scala",
favoriteFood : "chicken",
pet: "parrot",
favoriteMovie : "Diehard"
}
If I create a compound index on name-weight, I will be able to specify a strict parameter (name == "Kevin"), and a range on weight (between 50 and 200). However, I would not be able to do the reverse: specify a weight and give a "range" of names.
Of course compound index order matters where a range query is involved.
If only exact queries will be used (example: name == "Kevin", weight == 100, hobby == "C++"), then does the order actually matter for compound indexes?
When you have an exact query, the order should not matter. But when you want to be sure, the .explain() method on database cursors is your friend. It can tell you which indexes are used and how they are used when you perform a query in the mongo shell.
Important fields of the document returned by explain are:
indexOnly: when it's true, the query was completely covered by the index
n and nScanned: The first one tells you the number of found documents, the second how many documents had to be examined because the indexes couldn't sort them out. The latter shouldn't be notably higher than the first.
millis: number of milliseconds the query took to perform