I have a mongodb collection, "features", having 3 fields: name, active, weight.
I will sort features by weight descending:
db.features.find({active:true},{name:1, weight:1}).sort({weight:-1})
for optimization, i create index for it:
db.features.ensureIndex({'active': 1, 'weight': -1})
I can see it works well when using explain() in query.
However, when i query it by weight ascending, i suppose the index i just created will not work and i need to create another index on weight ascending.
Query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
when i use explain() to show how index working, i find it prints out:
"cursor" : "BtreeCursor active_1_weight_-1 reverse",
does the index reverse mean the query is optimized by the index?
generally, do i need to create 2 index like ascending on weight and descending on weight if i will sort it by weight ascending in some case and descending in other cases?
I know I'm late but I would like to add a little more detail. When you use explain() and it outputs cursor: BtreeCursor, it doesn't always guarantee that only the index is used to satisfy your query. You also have to check the "indexOnly" option in the results of explain(). If indexOnly is outputted as true it means that your query was satisfied using the index only and the documents in the collection was not referred to at all. This is called 'covered index query' http://docs.mongodb.org/manual/applications/indexes/
But if the results of explain are cursor: BtreeCursor and indexOnly:false, it means that in addition to using the index, the collection was also referred to. In you case, for the query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
Mongo would have used the index 'active': 1, 'weight': -1 to satisfy the initial part of the query i.e. db.features.find({active:true}) and would have done the sort without using the index. So to know exactly, you have to look at the indexOnly result within explain().
As you can see from this document, when explain() outputs BtreeCursor, it means that an index was used. When an index is used, indexBounds will be set to indicate the key bounds for scanning in the index. However, if the putput showed BasicCursor, it indicates a table scan style operation.
So based on what you've said, from the explain() results, you can see that you're using a BTree Cursor on the index named active_1_weight_-1 and the reverse means that you're iterating over the index in reverse order.
So no, you don't need to create separate indexes.
this is very confusing. in mongodb class there is an example see below.
notice BtreeCursor reverse is used ONLY for the purpose of sorting in skip and limit command
NOT for the purpose of locating the record.
the lesson is if nscan =40k and n=10 means btree index is not used in locating record.
so when u see btreecursor index reverse does not necessay mean index get used to locating the reocrd.
Suppose you have a collection called tweets whose documents contain information about thecreated_at time of the tweet and the user's followers_count at the time they issued the tweet. What can you infer from the following explain output?
db.tweets.find({"user.followers_count":{$gt:1000}}).sort({"created_at" : 1 }).limit(10).skip(5000).explain()
{
"cursor" : "BtreeCursor created_at_-1 reverse",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 46462,
"nscanned" : 46462,
"nscannedObjectsAllPlans" : 49763,
"nscannedAllPlans" : 49763,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 205,
"indexBounds" : {
"created_at" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost.localdomain:27017"
}
This query performs a collection scan. yes
The query uses an index to determine the order in which to return result documents. yes
The query uses an index to determine which documents match. no
The query returns 46462 documents no
Assuming you are using 2.0+ then reverse traversal is not more costly to MongoDB, so for this case you don't need to create separate indexes for the forward/reverse sort. You can confirm by creating it and using hint() if you wish (the optimizer will cache the current index for a while, so will not automatically select the other index).
Related
I have a collection with >100k of documents.
A sample document will be like
{
"created_at" : 1545039649,
"priority" : 3,
"id" : 68,
"name" : "document68"
}
db.mycol.find().sort({created_at:1})
and
db.mycol.find().sort({priority:1})
results in error.
Error: error: {
"ok" : 0,
"errmsg" : "Executor error during find command: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.",
"code" : 96,
"codeName" : "OperationFailed"
}
Then I indexed these fields.
db.mycol.createIndex({'priority':1})
db.mycol.createIndex({'created_at':1}, {sparse:true})
Added sparse index to created_at as it is a mandatory field.
Now
db.mycol.find().sort({priority:1})
gives the result. But
db.mycol.find().sort({created_at:1})
still results in the same error.
The sparse index can only be used when you filter by created_at: {$exists: true}.
The reason being that all the other records are not part of the index (but they are still supposed to appear in the result -- probably at the end).
Maybe you don't have to make the index sparse (which only makes sense when most of the records do not have the field -- otherwise you don't save much space in index storage anyway)? created_at sounds like most records would have it.
Added sparse index to created_at as it is a mandatory field.
Actually, it is the other way around: You only want a sparse index when the field is optional (and quite rare).
I'm trying to fully sort a collection with millions of rows by a single field.
As far i know, ObjectId contains 4 bytes of timestamp. And my timestamp is 4 bytes integer indexed field. So i suppose sort by _id and timestamp should be simular, but here's results
db.coll.find().sort("_id", pymongo.ASCENDING)
# takes 25 minutes to run
and
db.coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 2 hours to run
why is this happening, and is here the way to optimize that?
Thanks
UPDATE
The timestamp field i'm trying to sort with is already indexed as i pointed
collection stats
"size" : 55881082188,
"count" : 126048972,
"avgObjSize" : 443,
"storageSize" : 16998031360,
"capped" : false,
"nindexes" : 2,
"totalIndexSize" : 2439606272,
and I dedicated to mongodb proccess 4gb of ram (tried to increase to 8gb but speed didn't increased)
UPDATE 2
It's turned out how much sorting on field order follows insertion (natural) order, so much the sorting speed is faster
I tried to
db.new_coll.create_index([("timestamp", pymongo.ASCENDING)])
for el in db.coll.find().sort("timestamp", pymongo.ASCENDING):
del el['_id']
db.new_coll.insert(el)
# and now
db.new_coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 25 minutes vs 2 hours as in previous example
Sorting by _id is faster because of the way _id field value is generated.
Words from Documentation
One of the main reasons ObjectId’s are generated in the fashion
mentioned above by the drivers is that is contains a useful behavior
due to the way sorting works. Given that it contains a 4 byte
timestamp (resolution of seconds) and an incrementing counter as well
as some more unique identifiers such as the machine id once can use
the _id field to sort documents in the order of creation just by
simply sorting on the _id field. This can be useful to save the space
needed by an additional timestamp if you wish to track the time of
creation of a document.
I have also tried explaining the query and noticed that nscannedObjects and nscannedObjectsAllPlans is 0 when sorting is done using _id.
> db.coll.find({},{_id:1}).sort({_id:1}).explain();
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 353,
"nscannedObjects" : 0,
"nscanned" : 353,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 353,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "server",
"filterSet" : false
}
_id field is auto created which stores a 12 byte ObjectId value upon insertion of a document into collection of MongoDB database representing unique value into a BSON document belonging to collection.
According to documentation of MongoDB
The 12-byte ObjectId value consists of:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
Indexes defined on fields of collection speed up retrieval process of data stored into database collections as values belonging to indexed field are sorted into specific sort order and scanning of documents stops once matching value is found thereby minimizing number of documents to scan.
An unique index is defined on _id field during creation of collection and hence sorting data by _id field facilitates fast retrieval of data from collection.
Indexes.
When you use the MongoDB sort() method, you can specify the sort order—ascending (1) or descending (-1)—for the result set. If you do not index for the sort field, MongoDB will sort the results at query time. Sorting at query time uses CPU resources and delays the response to the application. However, when an index includes all fields used to select and sort the result set in proper order, MongoDB does not need to sort at query time. Instead, results are already sorted in the index, and can be returned immediately.
Please check here for more details.
https://mobile.developer.com/db/indexing-tips-for-improving-your-mongodb-performance.html
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/
We need to create a compound index in the same order as the parameters are being queried. Does this order matter performance-wise at all?
Imagine we have a collection of all humans on earth with an index on sex (99.9% of the time "male" or "female", but string nontheless (not binary)) and an index on name.
If we would want to be able to select all people of a certain sex with a certain name, e.g. all "male"s named "John", is it better to have a compound index with sex first or name first? Why (not)?
Redsandro,
You must consider Index Cardinality and Selectivity.
1. Index Cardinality
The index cardinality refers to how many possible values there are for a field. The field sex only has two possible values. It has a very low cardinality. Other fields such as names, usernames, phone numbers, emails, etc. will have a more unique value for every document in the collection, which is considered high cardinality.
Greater Cardinality
The greater the cardinality of a field the more helpful an index will be, because indexes narrow the search space, making it a much smaller set.
If you have an index on sex and you are looking for men named John. You would only narrow down the result space by approximately %50 if you indexed by sex first. Conversely if you indexed by name, you would immediately narrow down the result set to a minute fraction of users named John, then you would refer to those documents to check the gender.
Rule of Thumb
Try to create indexes on high-cardinality keys or put high-cardinality keys first in the compound index. You can read more about it in the section on compound indexes in the book:
MongoDB The Definitive Guide
2. Selectivity
Also, you want to use indexes selectively and write queries that limit the number of possible documents with the indexed field. To keep it simple, consider the following collection. If your index is {name:1}, If you run the query { name: "John", sex: "male"}. You will have to scan 1 document. Because you allowed MongoDB to be selective.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Consider the following collection. If your index is {sex:1}, If you run the query {sex: "male", name: "John"}. You will have to scan 4 documents.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Imagine the possible differences on a larger data set.
A little explanation of Compound Indexes
It's easy to make the wrong assumption about Compound Indexes. According to MongoDB docs on Compound Indexes.
MongoDB supports compound indexes, where a single index structure
holds references to multiple fields within a collection’s documents.
The following diagram illustrates an example of a compound index on
two fields:
When you create a compound index, 1 Index will hold multiple fields. So if we index a collection by {"sex" : 1, "name" : 1}, the index will look roughly like:
["male","Rick"] -> 0x0c965148
["male","John"] -> 0x0c965149
["male","Sean"] -> 0x0cdf7859
["male","Bro"] ->> 0x0cdf7859
...
["female","Kate"] -> 0x0c965134
["female","Katy"] -> 0x0c965126
["female","Naji"] -> 0x0c965183
["female","Joan"] -> 0x0c965191
["female","Sara"] -> 0x0c965103
If we index a collection by {"name" : 1, "sex" : 1}, the index will look roughly like:
["John","male"] -> 0x0c965148
["John","female"] -> 0x0c965149
["John","male"] -> 0x0cdf7859
["Rick","male"] -> 0x0cdf7859
...
["Kate","female"] -> 0x0c965134
["Katy","female"] -> 0x0c965126
["Naji","female"] -> 0x0c965183
["Joan","female"] -> 0x0c965191
["Sara","female"] -> 0x0c965103
Having {name:1} as the Prefix will serve you much better in using compound indexes. There is much more that can be read on the topic, I hope this can offer some clarity.
I'm going to say I did an experiment on this myself, and found that there seems to be no performance penalty for using the poorly distinguished index key first. (I'm using mongodb 3.4 with wiredtiger, which may be different than mmap). I inserted 250 million documents into a new collection called items. Each doc looked like this:
{
field1:"bob",
field2:i + "",
field3:i + ""
"field1" was always equal to "bob". "field2" was equal to i, so it was completely unique. First I did a search on field2, and it took over a minute to scan 250 million documents. Then I created an index like so:
`db.items.createIndex({field1:1,field2:1})`
Of course field1 is "bob" on every single document, so the index should have to search a number of items before finding the desired document. However, this was not the result I got.
I did another search on the collection after the index finished creating. This time I got results which I listed below. You'll see that "totalKeysExamined" is 1 each time. So perhaps with wired tiger or something they have figured out how to do this better. I have read the wiredtiger actually compresses index prefixes, so that may have something to do with it.
db.items.find({field1:"bob",field2:"250888000"}).explain("executionStats")
{
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 4,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
...
"docsExamined" : 1,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
...
"indexName" : "field1_1_field2_1",
"isMultiKey" : false,
...
"indexBounds" : {
"field1" : [
"[\"bob\", \"bob\"]"
],
"field2" : [
"[\"250888000\", \"250888000\"]"
]
},
"keysExamined" : 1,
"seeks" : 1
}
}
Then I created an index on field3 (which has the same value as field 2). Then I searched:
db.items.find({field3:"250888000"});
It took the same 4ms as the one with the compound index. I repeated this a number of times with different values for field2 and field3 and got insignificant differences each time. This suggests that with wiredtiger, there is no performance penalty for having poor differentiation on the first field of an index.
Note that multiple equality predicates do not have to be ordered from most selective to least selective. This guidance has been provided in the past however it is erroneous due to the nature of B-Tree indexes and how in leaf pages, a B-Tree will store combinations of all field’s values. As such, there is exactly the same number of combinations regardless of key order.
https://www.alexbevi.com/blog/2020/05/16/optimizing-mongodb-compound-indexes-the-equality-sort-range-esr-rule/
This blog article disagrees with the accepted answer. The benchmark in the other answer also shows that it doesn't matter. The author of that article is a "Senior Technical Services Engineer at MongoDB" which sounds like a creditable person to me on this topic, so I guess the order really doesn't affect performance after all on equality fields. I'll follow the ESR rule instead.
Also consider prefixes. Filtering for { a: 1234 } won't work with a index of { b: 1, a: 1 }: https://docs.mongodb.com/manual/core/index-compound/#prefixes
Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?
In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.
I have a reasonably large dataset of over 3 million documents that have tags similar to StackOverflow that uses tags for each question. The schema that I use for storing the tags is as follows:
{"id": 12345, "tags":["tag1", "tag2", "tag3"]}, {"id": 12346, "tags":["tag2", "tag3"]}
I have a multi-key index created on tags field. When I am performing queries using $in or $nin operators to find the intersection, union of the tags, the performance is around 7 seconds on a server class machine. Is there anything that I can do to improve the speed of query search?
EDIT 1:
Here is the explain plan as requested. What I observed is that the queries returned much faster after I restarted my server and just ran just the mongodb server. The queries performed much faster(< 50ms). I suspect the indexes were not cached in memory, although I had ample unused ram available and my index (800MB) could easily fit in memory.
db.tagsCollection.find( { "tags" : { $in : ['tag1', 'tag2'], $nin : ['tag4', '
tag5', 'tag6', 'tag7'] } } ).explain();
{
"cursor" : "BtreeCursor tags_1 multi",
"nscanned" : 6145193,
"nscannedObjects" : 6145192,
"n" : 969386,
"millis" : 19640,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"tags" : [
[
"tag1",
"tag1"
],
[
"tag2",
"tag2"
]
]
}
}
Note
This is what I thought of as an optimization ( though you might need to test it )
Instead of storing tags,store a small key which identifies all the tags particular document has.
say for post#125 the tags are : PHP, MongoDb , database .
a) clean the tags like convert all of them to small case etc
and then sort them alphabetically .
current tags will be : database,mongodb,php
b) Have a seperate collection which stores integer to tag mapping :
{ "_id" : 1 , "t" : "mongodb" }
{ "_id" : 2 , "t" : "php" } and so on store all the possible tags for your website
c) to store a document, create the tag key using tags to number map from previous collection.
so curent database,mongodb,php will become something like 1-12-2
d) store your document like :
{ "id" : 12345 , "tags" : [1,12,3] }
QUERYING :
The use of integers instead of strings on an indexed field would reduce the index size by great extent, and also make querying faster as compared to a string index.
Not sure about amount of performance gain, but still worth a try to compare to your current implementation.
Check the size of your multi-key tags index using db.col.stats(). If it doesn't fit in RAM then you might be disk-bound and incurring some disk IO cost. If the index fits entirely in memory then I'm not sure what else you can do, apart from throw more hardware at it, unless you can optimise the queries themselves.
Do you need to search through all the data, or can you query a subset that's filtered by another indexed field? Or can you eliminate the $nin queries, which will tend to be slower because the have to iterate every tag, where as $in only has to iterate until it finds a match.
If you want performance to be super fast and dont have space contraints, I would suggest to have separate collection of tags with video id array and have an index on tag name.
Here is another suggestion but I've had not a chance to test it.
{
tags:{
items:[ 'a', 'b', 'c' ],
mixed:{
a:1, // hash value for a tag
b:2, // hash value for b tag
c:3 // hash value for c tag
}
}
}
and search query is
db.demo.find({ 'tags.mixed.a':1, 'tags.mixed.b':2 })
if possible have to create compound index for tags.mixed