I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?
Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.
Related
How can you create an index in mongodb to only index presence of a field (whether it's null/undefined or it actually has a value). I understand null is considered a value, however for this case I would like to consider null the same as undefined so the index will only care whether the property is null/undefined or not.
Example:
I have a data structure with a very big internal property:
document = {
_id: 1
info: { /* a fairly large object here */ }
}
index = { info: 1 }
If I were to make a query as follows:
collection.find({ info: null })
This would work fine in terms of search performance. Issue is that the object is large, so the index ends up very large (hundreds of MB with only 500k documents)
I have looked into conditional indexes (partial and sparse), but they only work if you want to index subset of documents. I want to index all documents but only on the presence of the field, not its entire value
Other practical option is adding a boolean flag hasInfo: Boolean and just index that, but will have to back-populate data and keep redundant properties on the collection
This is executed immediately:
db.mycollection.find({ strField: 'AAA'}).count()
And this takes a lot to finish:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $exists: true }}).count()
This is how I created my index:
db.mycollection.createIndex({strField: 1, dateTimeField: 1}, { sparse: true })
But it doesn't work even using hint(indexName)
Why this happens and how to fix it?
The { $exists: true } query predicate is problematic, especially if there are documents in the collection for which that field does not exist.
When MongoDB creates an index entry for a document, it collects all of the field values according to the index spec, and concatenates them.
If a field is not present in the document, the index stores null in that field's position.
If the field is explicitly set to null, it also stores null in that field's position.
This means that these 2 documents will have identical entries in the index:
{ strField: 'AAA', dateTimeField: null}
{ strField: 'AAA'}
Note that even with the index being sparse, both documents will be indexed since at least one of the indexes fields exists in each document.
When testing {dateTimeFied:{$exists:true}}, the first document will match, while the second will not.
When processing a count query using an index, if the query can be satisfied by scanning a single range of the index, the query executor can use a count_scan stage, and get the correct result without loading a single document from disk.
Because the executor cannot definitively tell from the index whether or not the field exists, it cannot use a count_scan, and must instead use an ordinary ixscan followed by a fetch stage, and load all of the matching documents from disk in order to arrive at the correct count.
In the case of the first query, the executor would have been able to use a count_scan, while the second would have had to examine all of the documents. You should be able to see this by running explain with the executionStats option on each query.
One way to avoid this pitfall is to take advantage of the fact that MongoDB query operators are type-sensitive. This means that this query will match any document where dateTimeField is greater than epoch 0, and a timestamp:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $gte: new ISODate("1970-01-01T00:00:00Z") }}).count()
This will allow the query executor to count all of the documents that have the matching string and contain a date, but will exclude documents that contain a dateTimeField with a numeric or string value.
I have a query like this:
xml_db.find(
{
'high_performer': {
'$nin': [some_value]
},
'low_performer': {
'$nin': [some_value]
},
'expiration_date': {
'$gte': datetime.now().strftime('%Y-%m-%d')
},
'source': 'some_value'
}
)
I have tried to create an index with those fields but getting error:
pymongo.errors.OperationFailure: cannot index parallel arrays [low_performer] [high_performer]
So, how to efficiently run this query?
Compound indexing ordering should follow the equality --> sort --> range rule. A good description of this can be found in this response.
This means that the first field in the index would be source, followed by the range filters (expiration_date, low_performer and high_performer).
As you noticed, one of the "performer" fields cannot be included in the index since only a single array can be indexed. You should use your knowledge of the data set to determine which filter (low_performer or high_performer) would be more selective and choose that filter to be included in the index.
Assuming that high_performer is more selective, the only remaining step would be to determine the ordering between expiration_date and high_performer. Again, you should use your knowledge of the data set to make this determination based on selectivity.
Assuming expiration_date is more selective, the index to create would then be:
{ "source" : 1, "expiration_date" : 1, "high_performer" : 1 }
I am using the below query on my MongoDB collection which is taking more than an hour to complete.
db.collection.find({language:"hi"}).sort({_id:-1}).skip(5000).limit(1)
I am trying to to get the results in a batch of 5000 to process in either ascending or descending order for documents with "hi" as a value in language field. So i am using this query in which i am skipping the processed documents every time by incrementing the "skip" value.
The document count in this collection is just above 20 million.
An index on the field "language" is already created.
MongoDB Version i am using is 2.6.7
Is there a more appropriate index for this query which can get the result faster?
When you want to sort descending, you should create a multi-field index which uses the field(s) you sort on as descending field(s). You do that by setting those field(s) to -1.
This index should greatly increase the performance of your sort:
db.collection.ensureIndex({ language: 1, _id: -1 });
When you also want to speed up the other case - retrieving sorted in ascending order - create a second index like this:
db.collection.ensureIndex({ language: 1, _id: 1 });
Keep in mind that when you do not sort your results, you receive them in natural order. Natural order is often insertion order, but there is no guarantee for that. There are various events which can cause the natural order to get messed up, so when you care about the order you should always sort explicitly. The only exception to this rule are capped collections which always maintain insertion order.
In order to efficiently "page" through results in the way that you want, it is better to use a "range query" and keep the last value you processed.
You desired "sort key" here is _id, so that makes things simple:
First you want your index in the correct order which is done with .createIndex() which is not the deprecated method:
db.collection.createIndex({ "language": 1, "_id": -1 })
Then you want to do some simple processing, from the start:
var lastId = null;
var cursor = db.collection.find({language:"hi"});
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
That's the first batch. Now when you move on to the next one:
var cursor = db.collection.find({ "language":"hi", "_id": { "$lt": lastId });
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
So that the lastId value is always considered when making the selection. You store this between each batch, and continue on from the last one.
That is much more efficient than processing with .skip(), which regardless of the index will "still" need to "skip" through all data in the collection up to the skip point.
Using the $lt operator here "filters" all the results you already processed, so you can move along much more quickly.
Let's say I have the following document schema:
{
_id: ObjectId(...),
name: "Kevin",
weight: 500,
hobby: "scala",
favoriteFood : "chicken",
pet: "parrot",
favoriteMovie : "Diehard"
}
If I create a compound index on name-weight, I will be able to specify a strict parameter (name == "Kevin"), and a range on weight (between 50 and 200). However, I would not be able to do the reverse: specify a weight and give a "range" of names.
Of course compound index order matters where a range query is involved.
If only exact queries will be used (example: name == "Kevin", weight == 100, hobby == "C++"), then does the order actually matter for compound indexes?
When you have an exact query, the order should not matter. But when you want to be sure, the .explain() method on database cursors is your friend. It can tell you which indexes are used and how they are used when you perform a query in the mongo shell.
Important fields of the document returned by explain are:
indexOnly: when it's true, the query was completely covered by the index
n and nScanned: The first one tells you the number of found documents, the second how many documents had to be examined because the indexes couldn't sort them out. The latter shouldn't be notably higher than the first.
millis: number of milliseconds the query took to perform