Sparse index does not improve sort in MongoDB? - mongodb

I have a collection with >100k of documents.
A sample document will be like
{
"created_at" : 1545039649,
"priority" : 3,
"id" : 68,
"name" : "document68"
}
db.mycol.find().sort({created_at:1})
and
db.mycol.find().sort({priority:1})
results in error.
Error: error: {
"ok" : 0,
"errmsg" : "Executor error during find command: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.",
"code" : 96,
"codeName" : "OperationFailed"
}
Then I indexed these fields.
db.mycol.createIndex({'priority':1})
db.mycol.createIndex({'created_at':1}, {sparse:true})
Added sparse index to created_at as it is a mandatory field.
Now
db.mycol.find().sort({priority:1})
gives the result. But
db.mycol.find().sort({created_at:1})
still results in the same error.

The sparse index can only be used when you filter by created_at: {$exists: true}.
The reason being that all the other records are not part of the index (but they are still supposed to appear in the result -- probably at the end).
Maybe you don't have to make the index sparse (which only makes sense when most of the records do not have the field -- otherwise you don't save much space in index storage anyway)? created_at sounds like most records would have it.
Added sparse index to created_at as it is a mandatory field.
Actually, it is the other way around: You only want a sparse index when the field is optional (and quite rare).

Related

Why mongodb sort by _id is much faster than sort by any other indexed field?

I'm trying to fully sort a collection with millions of rows by a single field.
As far i know, ObjectId contains 4 bytes of timestamp. And my timestamp is 4 bytes integer indexed field. So i suppose sort by _id and timestamp should be simular, but here's results
db.coll.find().sort("_id", pymongo.ASCENDING)
# takes 25 minutes to run
and
db.coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 2 hours to run
why is this happening, and is here the way to optimize that?
Thanks
UPDATE
The timestamp field i'm trying to sort with is already indexed as i pointed
collection stats
"size" : 55881082188,
"count" : 126048972,
"avgObjSize" : 443,
"storageSize" : 16998031360,
"capped" : false,
"nindexes" : 2,
"totalIndexSize" : 2439606272,
and I dedicated to mongodb proccess 4gb of ram (tried to increase to 8gb but speed didn't increased)
UPDATE 2
It's turned out how much sorting on field order follows insertion (natural) order, so much the sorting speed is faster
I tried to
db.new_coll.create_index([("timestamp", pymongo.ASCENDING)])
for el in db.coll.find().sort("timestamp", pymongo.ASCENDING):
del el['_id']
db.new_coll.insert(el)
# and now
db.new_coll.find().sort("timestamp", pymongo.ASCENDING)
# takes 25 minutes vs 2 hours as in previous example
Sorting by _id is faster because of the way _id field value is generated.
Words from Documentation
One of the main reasons ObjectId’s are generated in the fashion
mentioned above by the drivers is that is contains a useful behavior
due to the way sorting works. Given that it contains a 4 byte
timestamp (resolution of seconds) and an incrementing counter as well
as some more unique identifiers such as the machine id once can use
the _id field to sort documents in the order of creation just by
simply sorting on the _id field. This can be useful to save the space
needed by an additional timestamp if you wish to track the time of
creation of a document.
I have also tried explaining the query and noticed that nscannedObjects and nscannedObjectsAllPlans is 0 when sorting is done using _id.
> db.coll.find({},{_id:1}).sort({_id:1}).explain();
{
"cursor" : "BtreeCursor _id_",
"isMultiKey" : false,
"n" : 353,
"nscannedObjects" : 0,
"nscanned" : 353,
"nscannedObjectsAllPlans" : 0,
"nscannedAllPlans" : 353,
"scanAndOrder" : false,
"indexOnly" : true,
"nYields" : 2,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"_id" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "server",
"filterSet" : false
}
_id field is auto created which stores a 12 byte ObjectId value upon insertion of a document into collection of MongoDB database representing unique value into a BSON document belonging to collection.
According to documentation of MongoDB
The 12-byte ObjectId value consists of:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
Indexes defined on fields of collection speed up retrieval process of data stored into database collections as values belonging to indexed field are sorted into specific sort order and scanning of documents stops once matching value is found thereby minimizing number of documents to scan.
An unique index is defined on _id field during creation of collection and hence sorting data by _id field facilitates fast retrieval of data from collection.
Indexes.
When you use the MongoDB sort() method, you can specify the sort order—ascending (1) or descending (-1)—for the result set. If you do not index for the sort field, MongoDB will sort the results at query time. Sorting at query time uses CPU resources and delays the response to the application. However, when an index includes all fields used to select and sort the result set in proper order, MongoDB does not need to sort at query time. Instead, results are already sorted in the index, and can be returned immediately.
Please check here for more details.
https://mobile.developer.com/db/indexing-tips-for-improving-your-mongodb-performance.html
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

Created indexes on a mongodb collection, still fails while sorting a large data set

My Query below:
db.chats.find({ bid: 'someID' }).sort({start_time: 1}).limit(10).skip(82560).pretty()
I have indexes on chats collection on the fields in this order
{
"cid" : 1,
"bid" : 1,
"start_time" : 1
}
I am trying to perform sort, but when I write a query and check the result of explain(), I still get the winningPlan as
{
"stage":"SKIP",
"skipAmount":82560,
"inputStage":{
"stage":"SORT",
"sortPattern":{
"start_time":1
},
"limitAmount":82570,
"inputStage":{
"stage":"SORT_KEY_GENERATOR",
"inputStage":{
"stage":"COLLSCAN",
"filter":{
"ID":{
"$eq":"someID"
}
},
"direction":"forward"
}
}
}
}
I was expecting not to have a sort stage in the winning plan as I have indexes created for that collection.
Having no indexes will result into the following error
MongoError: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM [duplicate]
However I managed to make the sort work by increasing the size allocation on ram from 32mb to 64mb, looking for help in adding indexes properly
The order of fields in an index matters. To sort query results by a field that is not at the start of the index key pattern the query must include equality conditions on all the prefix keys that precede the sort keys. The cid field is not in the query nor used for sorting, so you must leave it out. Then you put the bid field first in the index definition as you use it in the equality condition. The start_time goes after that to be used in sorting. Finally, the index must look like this:
{"bid" : 1, "start_time" : 1}
See the documentation for further reference.

Mongodb sparse index and general index

I have created a collection with 100 documents (fields x & y), and created a normal index on fieldx and a sparse index on field y, as shown below :
for(i=1;i<100;i++)db.coll.insert({x:i,y:i})
db.coll.createIndex({x:1})
db.coll.createIndex({y:1},{sparse:true})
Then, I added a few docs without fields x & y as shown below:
for(i=1;i<100;i++)db.coll.insert({z:"stringggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg"})
Looking at db.coll.stats(), I found the sizes of the indexes:
storageSize:36864
_id:32768
x_1:32768
y_1:16384
As per the definition of sparse index, only documents containing the indexed field y are considered, hence y_1 occupies less space. But _id & x_1 indexes seem to contain all the documents in them.
If I perform a query - db.coll.find({z:99}).explain('executionStats')
It is doing a COLLSCAN and fetching the record. If this is the case, I am not clear on why MongoDB stores all the documents under _id & x_1 indexes, as it is a waste of storage space. Please help me understand. Pardon my ignorance if i missed something.
Thank you for your help.
In a "normal" index, missing fields are indexed with a null value. For example, if you have index of {a:1} and you insert {b:10} into the collection, the document will be indexed as a: null.
You can see this behaviour using a unique index:
> db.test.createIndex({a:1}, {unique:true})
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
> db.test.insert({b:1})
WriteResult({ "nInserted" : 1 })
> db.test.insert({c:1})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "E11000 duplicate key error collection: test.test index: a_1 dup key: { : null }"
}
})
Both {b:1} and {c:1} are indexed as a: null, hence the duplicate key error message.
In your collection, you have 200 documents:
100 documents with {x:..., y:...}
100 documents with {z:...}
And your indexes are:
{x:1} (normal index)
{y:1} (sparse index)
The documents will be indexed as follows:
200 documents will be in the _id index, which is always created by MongoDB
200 documents will be in the {x:1} index, from {x:.., y:..} and {z:..} documents
100 documents will be in the {y:1} index
Note that the index sizes you posted shows the same ratio as the numbers above.
Regarding your questions:
The _id index is for MongoDB internal use, see Default _id index. You cannot drop this index, and attempts to remove it could render your database inaccessible.
The x_1 index is there because you told MongoDB to build it. It contains all the documents in your collection because it's a normal index. In the case of your collection, half of the values in the index are null.
The sparse y_1 index is half the size of the x_1 index because only 100 out of 200 documents contain the y field.
The query db.coll.find({z:99}) does not use any index because you don't have an index on the z field, hence it's doing a collection scan.
For more information about indexing, please see Create Indexes to Support Your Queries

How does the order of compound indexes matter in MongoDB performance-wise?

We need to create a compound index in the same order as the parameters are being queried. Does this order matter performance-wise at all?
Imagine we have a collection of all humans on earth with an index on sex (99.9% of the time "male" or "female", but string nontheless (not binary)) and an index on name.
If we would want to be able to select all people of a certain sex with a certain name, e.g. all "male"s named "John", is it better to have a compound index with sex first or name first? Why (not)?
Redsandro,
You must consider Index Cardinality and Selectivity.
1. Index Cardinality
The index cardinality refers to how many possible values there are for a field. The field sex only has two possible values. It has a very low cardinality. Other fields such as names, usernames, phone numbers, emails, etc. will have a more unique value for every document in the collection, which is considered high cardinality.
Greater Cardinality
The greater the cardinality of a field the more helpful an index will be, because indexes narrow the search space, making it a much smaller set.
If you have an index on sex and you are looking for men named John. You would only narrow down the result space by approximately %50 if you indexed by sex first. Conversely if you indexed by name, you would immediately narrow down the result set to a minute fraction of users named John, then you would refer to those documents to check the gender.
Rule of Thumb
Try to create indexes on high-cardinality keys or put high-cardinality keys first in the compound index. You can read more about it in the section on compound indexes in the book:
MongoDB The Definitive Guide
2. Selectivity
Also, you want to use indexes selectively and write queries that limit the number of possible documents with the indexed field. To keep it simple, consider the following collection. If your index is {name:1}, If you run the query { name: "John", sex: "male"}. You will have to scan 1 document. Because you allowed MongoDB to be selective.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Consider the following collection. If your index is {sex:1}, If you run the query {sex: "male", name: "John"}. You will have to scan 4 documents.
{_id:ObjectId(),name:"John",sex:"male"}
{_id:ObjectId(),name:"Rich",sex:"male"}
{_id:ObjectId(),name:"Mose",sex:"male"}
{_id:ObjectId(),name:"Sami",sex:"male"}
{_id:ObjectId(),name:"Cari",sex:"female"}
{_id:ObjectId(),name:"Mary",sex:"female"}
Imagine the possible differences on a larger data set.
A little explanation of Compound Indexes
It's easy to make the wrong assumption about Compound Indexes. According to MongoDB docs on Compound Indexes.
MongoDB supports compound indexes, where a single index structure
holds references to multiple fields within a collection’s documents.
The following diagram illustrates an example of a compound index on
two fields:
When you create a compound index, 1 Index will hold multiple fields. So if we index a collection by {"sex" : 1, "name" : 1}, the index will look roughly like:
["male","Rick"] -> 0x0c965148
["male","John"] -> 0x0c965149
["male","Sean"] -> 0x0cdf7859
["male","Bro"] ->> 0x0cdf7859
...
["female","Kate"] -> 0x0c965134
["female","Katy"] -> 0x0c965126
["female","Naji"] -> 0x0c965183
["female","Joan"] -> 0x0c965191
["female","Sara"] -> 0x0c965103
If we index a collection by {"name" : 1, "sex" : 1}, the index will look roughly like:
["John","male"] -> 0x0c965148
["John","female"] -> 0x0c965149
["John","male"] -> 0x0cdf7859
["Rick","male"] -> 0x0cdf7859
...
["Kate","female"] -> 0x0c965134
["Katy","female"] -> 0x0c965126
["Naji","female"] -> 0x0c965183
["Joan","female"] -> 0x0c965191
["Sara","female"] -> 0x0c965103
Having {name:1} as the Prefix will serve you much better in using compound indexes. There is much more that can be read on the topic, I hope this can offer some clarity.
I'm going to say I did an experiment on this myself, and found that there seems to be no performance penalty for using the poorly distinguished index key first. (I'm using mongodb 3.4 with wiredtiger, which may be different than mmap). I inserted 250 million documents into a new collection called items. Each doc looked like this:
{
field1:"bob",
field2:i + "",
field3:i + ""
"field1" was always equal to "bob". "field2" was equal to i, so it was completely unique. First I did a search on field2, and it took over a minute to scan 250 million documents. Then I created an index like so:
`db.items.createIndex({field1:1,field2:1})`
Of course field1 is "bob" on every single document, so the index should have to search a number of items before finding the desired document. However, this was not the result I got.
I did another search on the collection after the index finished creating. This time I got results which I listed below. You'll see that "totalKeysExamined" is 1 each time. So perhaps with wired tiger or something they have figured out how to do this better. I have read the wiredtiger actually compresses index prefixes, so that may have something to do with it.
db.items.find({field1:"bob",field2:"250888000"}).explain("executionStats")
{
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 4,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
...
"docsExamined" : 1,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
...
"indexName" : "field1_1_field2_1",
"isMultiKey" : false,
...
"indexBounds" : {
"field1" : [
"[\"bob\", \"bob\"]"
],
"field2" : [
"[\"250888000\", \"250888000\"]"
]
},
"keysExamined" : 1,
"seeks" : 1
}
}
Then I created an index on field3 (which has the same value as field 2). Then I searched:
db.items.find({field3:"250888000"});
It took the same 4ms as the one with the compound index. I repeated this a number of times with different values for field2 and field3 and got insignificant differences each time. This suggests that with wiredtiger, there is no performance penalty for having poor differentiation on the first field of an index.
Note that multiple equality predicates do not have to be ordered from most selective to least selective. This guidance has been provided in the past however it is erroneous due to the nature of B-Tree indexes and how in leaf pages, a B-Tree will store combinations of all field’s values. As such, there is exactly the same number of combinations regardless of key order.
https://www.alexbevi.com/blog/2020/05/16/optimizing-mongodb-compound-indexes-the-equality-sort-range-esr-rule/
This blog article disagrees with the accepted answer. The benchmark in the other answer also shows that it doesn't matter. The author of that article is a "Senior Technical Services Engineer at MongoDB" which sounds like a creditable person to me on this topic, so I guess the order really doesn't affect performance after all on equality fields. I'll follow the ESR rule instead.
Also consider prefixes. Filtering for { a: 1234 } won't work with a index of { b: 1, a: 1 }: https://docs.mongodb.com/manual/core/index-compound/#prefixes

mongodb index (reverse) optimization

I have a mongodb collection, "features", having 3 fields: name, active, weight.
I will sort features by weight descending:
db.features.find({active:true},{name:1, weight:1}).sort({weight:-1})
for optimization, i create index for it:
db.features.ensureIndex({'active': 1, 'weight': -1})
I can see it works well when using explain() in query.
However, when i query it by weight ascending, i suppose the index i just created will not work and i need to create another index on weight ascending.
Query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
when i use explain() to show how index working, i find it prints out:
"cursor" : "BtreeCursor active_1_weight_-1 reverse",
does the index reverse mean the query is optimized by the index?
generally, do i need to create 2 index like ascending on weight and descending on weight if i will sort it by weight ascending in some case and descending in other cases?
I know I'm late but I would like to add a little more detail. When you use explain() and it outputs cursor: BtreeCursor, it doesn't always guarantee that only the index is used to satisfy your query. You also have to check the "indexOnly" option in the results of explain(). If indexOnly is outputted as true it means that your query was satisfied using the index only and the documents in the collection was not referred to at all. This is called 'covered index query' http://docs.mongodb.org/manual/applications/indexes/
But if the results of explain are cursor: BtreeCursor and indexOnly:false, it means that in addition to using the index, the collection was also referred to. In you case, for the query:
db.features.find({active:true},{name:1, weight:1}).sort({weight:1}).explain()
Mongo would have used the index 'active': 1, 'weight': -1 to satisfy the initial part of the query i.e. db.features.find({active:true}) and would have done the sort without using the index. So to know exactly, you have to look at the indexOnly result within explain().
As you can see from this document, when explain() outputs BtreeCursor, it means that an index was used. When an index is used, indexBounds will be set to indicate the key bounds for scanning in the index. However, if the putput showed BasicCursor, it indicates a table scan style operation.
So based on what you've said, from the explain() results, you can see that you're using a BTree Cursor on the index named active_1_weight_-1 and the reverse means that you're iterating over the index in reverse order.
So no, you don't need to create separate indexes.
this is very confusing. in mongodb class there is an example see below.
notice BtreeCursor reverse is used ONLY for the purpose of sorting in skip and limit command
NOT for the purpose of locating the record.
the lesson is if nscan =40k and n=10 means btree index is not used in locating record.
so when u see btreecursor index reverse does not necessay mean index get used to locating the reocrd.
Suppose you have a collection called tweets whose documents contain information about thecreated_at time of the tweet and the user's followers_count at the time they issued the tweet. What can you infer from the following explain output?
db.tweets.find({"user.followers_count":{$gt:1000}}).sort({"created_at" : 1 }).limit(10).skip(5000).explain()
{
"cursor" : "BtreeCursor created_at_-1 reverse",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 46462,
"nscanned" : 46462,
"nscannedObjectsAllPlans" : 49763,
"nscannedAllPlans" : 49763,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 205,
"indexBounds" : {
"created_at" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost.localdomain:27017"
}
This query performs a collection scan. yes
The query uses an index to determine the order in which to return result documents. yes
The query uses an index to determine which documents match. no
The query returns 46462 documents no
Assuming you are using 2.0+ then reverse traversal is not more costly to MongoDB, so for this case you don't need to create separate indexes for the forward/reverse sort. You can confirm by creating it and using hint() if you wish (the optimizer will cache the current index for a while, so will not automatically select the other index).