Say I have a MongoDB collection with documents like this:
{ "_id": ObjectId("the_object_id"),
"type": "BLOG_POST",
"state": "IN_PROGRESS",
"createDate":ISODate("2017-02-15T01:01:01.000Z"),
"users": {
"posted": ["user1", "user2", "user3"],
"favorited": ["user1", "user4", "user5", "user6"],
"other_fields": "other data",
},
"many_more_fields": "a bunch of other data"
}
I have a query like this:
db.collection.find({"$and":[
{"type": "BLOG_POST"},
{"$or": [ {"users.posted":"userX"}, {"users.favorited":"userX"} ] },
{"state": {"$ne":"COMPLETED"}}
]}).sort({"createDate":1})
The collection currently only has indexes on the _id field and some fields not included in this query or example.
As far as the cardinality goes, documents with:
type=BLOG_POST is approximately 75% of the collection, state $ne "COMPLETED" is approximately 50% of the collection, and users are in the users.posted or users.favorited at most 2% of the collection.
What would the best index or set of indexes be for this use case?
It is my understanding that we cannot index both users.posted and users.favorited in the same index because they are both arrays. In the future we may be able to make a new array of users.userswhocare which is a set of both of the fields, but assume we can't make that change in the short term.
I also thought that the $ne on state means that an index on state will generally not be used. Is the query planner able to the state field at the end of an index to handle the $ne condition?
I had the idea of an index {"type":1, "createDate":1, "state":1}, so that the query would hit on the type, use the createDate for the sort, and handle the $ne with last bit of the index. It would still have to pick up 35%-40% of the documents to test for the users. Not good, but an improvement over the current collection scan.
Alternatively I could create two indexes, one like {"users.posted":1, "type":1, "createDate":1, "state":1} and {"users.favorited":1, "type":1, "createDate":1, "state":1}.
Will the query planner use the intersection of these two indexes to more quickly find the documents of interest?
We are currently using MongoDB 3.2.5. If there are differences in the answer between MongoDB 3.2 and 3.4, I would love to know them.
After some analysis, I found that adding multiple queries with users.posted and users.favorited as the first item in the respective indexes both performed better and was chosen by the MongoDB query planner.
I created indexes like:
db.collection.createIndex({"users.posted":1, "type":1, "createDate":1, "state":1})
db.collection.createIndex({"users.favorited":1, "type":1, "createDate":1, "state":1})
Due to the cardinality of users.posted and users.favorited being high (either one will encompass no more than 2% of the collection, most of the time less than 0.5%), the MongoDB query planner used both with index intersection.
I tested this against an index like:
db.collection.createIndex({"type":1, "createDate":1, "state":1}).
Reviewing the explain plans of against both queries using both explain() and explain("executionStats"), the query planner used index scans for the {"$or": [ {"users.posted":"userX"}, {"users.favorited":"userX"} ] } part of the query as the first stage. This led to the fewest totalKeysExamined and totalDocsExamined.
Related
Mongo's documentation on $or operator says:
When evaluating the clauses in the
$or
expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an
$or
expression, all the clauses in the
$or
expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So effectively, if you want the query to be efficient, both of the attributes used in the $or condition should be indexed.
However, I'm not sure if this applies to "findOne" operations as well since I can't use Mongo's explain functionality for a findOne operation. It seems logical to me that if you only care about returning one document, you would check an indexed condition first, since you could bail after just finding one without needing to care about the non-indexed fields.
Example
Let's pretend in the below that "email" is indexed, but username is not. It's a contrived example, but bear with me.
db.users.findOne(
{
$or: [
{ email : 'someUser#gmail.com' },
{ username: 'some-user' }
]
}
)
Would the above use the email index, or would it do a full collection scan until it finds a document that matches the criteria?
I could not find anything documenting what would be expected here. Does someone know the answer?
Well I feel a little silly - it turns out that the docs I found saying you can only run explain on "find" may have been just referring to when you're in MongoDB Compass.
I ran the below (userEmail not indexed)
this.userModel
.findOne()
.where({ $or: [{ _id: id }, { userEmail: email }] })
.explain();
this.userModel
.getModelInstance()
.findOne()
.where({ _id: id })
.explain();
...and found that the first does do a COLLSCAN and the second's plan was IDHACK (basically it does a normal ID lookup).
When running performance tests, I can see that the first is about 4x slower in a collection with 20K+ documents (depends on where in the natural order the document you're finding is)
I have simple query like this: {"field": {$nin: ["value1","value2","valueN"]}}.
The problem is large amount of unique values to exclude (using $nin operator). It's about 50000 unique values to filter and about 1Kb of query length.
Question: Is there elegant and performant way to do such operations?
Example.
Collection daily_stat with 56M of docs. Each day increases collection with 100K docs. Example of document
{
"day": "2020-04-15",
"username": "uniq_name",
"total": 12345
}
I run next query:
{
"date": "2020-04-15",
"username": {
$nin: [
"name1",
"name2",
"...",
"name50000"
]
}
}
MongoDB version: 3.6.12
I would say the big $nin array is the elegant solution. If there is an index on field then it will also be performant -- but only in terms of quickly excluding those docs not to be returned in the cursor. If you have, say, 10 million docs in a collection and you do a find() to exclude 50000, you are still dragging 9,950,000 records out of the DB and across the wire; that is non-trivial.
if you can find a pattern in the values you pass in you can try with regex. Example given below
db.persons.find({'field':{$nin:[/san/i]}},{_id:0,"field":1})
more details on regex in
https://docs.mongodb.com/manual/reference/operator/query/regex/
Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.
I have a MongoDB collection that was indexed on userId - the index is "userId_1". At some point, we decided to shard the collection, and the shard key was chosen as the userId. As a result, another index was created on the collection - the index is "userId_hashed".
As a result, I've now got two indexes on this fairly large collection (about 100GB at the moment), and I'm concerned it may be wasteful to be maintaining both of these indexes on the collection. They seem like they could be redundant with one another.
I know I need to keep the "userId_hashed" index for sharding, and I know that a hashed index has the constraint that you can only filter with equality checks, and not ranges. Given that "userId" identifies a unique user in the system, querying for userId in a range doesn't make sense, or at least, isn't an operation we need to support.
So, can I just delete the "userId_1" index? Will queries on that field effectively use the "userId_hashed" index correctly and efficiently? Here's an example of the type of query we most often run against this collection:
db.history.find({ "userId": "abc123" }).sort({ "_id": -1 }).limit(10)
Or for paging:
db.history.find({ "userId": "abc123", "_id": { $lt: ObjectId("blah1") }}).sort({ "_id": -1 }).limit(10)
I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}