MongoDB indexing on variable query - mongodb

I have a collection of user generated posts. They contain the following fields
_id: String
groupId: String // id of the group this was posted in
authorId: String
tagIds: [String]
latestActivity: Date // updated whenever someone comments on this post
createdAt: Date
numberOfVotes: Number
...some more...
My queries always look something like this...
Posts.find({
groupId: {$in: [...]},
authorId: 'xyz', // only SOMETIMES included
tagIds: {$in: [...]}, // only SOMETIMES included
}, {
sort: {latestActivity/createdAt/numberOfVotes: +1/-1, _id: -1}
});
So I'm always querying on the groupId, but only sometimes adding tagIds or userIds. I'm also switching out the field on which this is sorted. How would my best indexing strategy look like?
From what I've read so far here on SO, I would probably create multiple compound indices and have them always start with {groupId: 1, _id: -1} - because they are included in every query, they are good prefix candidates.
Now, I'm guessing that creating a new index for every possible combination wouldn't be a good idea memory wise. Therefore, should I just keep it like that and only index groupId and _id?
Thanks.

You are going in the right direction. With compound indexes, you want the most selective indexes on the left and the ranges on the right. {groupId: 1, _id: -1} satisfies this.
It's also important to remember that compound indexes are used when the keys are in the query from left to right. So, one compound index can cover many common scenarios. If, for example, your index was {groupId: 1, authorId:1, tagIds: 1} and your query was Posts.find({groupId: {$in: [...]},authorId: 'xyz'}), that index would get used (even though tagIds was absent). Also, Posts.find({groupId: {$in: [...]},tagIds: {$in: [...]}}) would use this index (the first and third field of the index was used, so if there isn't a more specific index found by Mongo, this index would be used) . However, Posts.find({authorId: 'xyz',tagIds: {$in: [...]}}) would not use the index because the first field in the index was missing.
Given all of that, I would suggest starting with {groupId: 1, authorId:1, tagIds: 1, _id: -1}. groupId is the only non-optional field in your queries, so it goes on the left before the optional ones. It looks like authorId is more selective than tagIds, so it should go on the left after groupId. You're sorting by _id so that should go on the right. Be sure to Analyze Query performance on the different ways you query the data. Make sure they are all choosing this index (otherwise you'll need to make more tweaks or possibly a second compound index). You could then make other indexes and force the query to use it to do some a-b testing on performance.

Related

MongoDB Single or Compound index for sorted query?

I'm using mongoose and I have a query like this:
const stories = await Story.find({ genre: 'romance' }).sort({ createdAt: -1 })
I want to set an index on Story so that this kind of query becomes faster.
Which one of these is the best approach and why:
1. Create one Compound index with both fields:
Story.createIndex({genre: 1, createdAt: -1})
2. Create two separate indexes on each field:
Story.createIndex({genre: 1})
Story.createIndex({createdAt: -1})
If "genre" is always going to be part of the search field, using a compound index will always result in better performance.
1.) A Compound index consisting of the field being searched and the field that it is being sorted on can satisfy both the conditions.
2.) Creating more than one indexes assumes that, both the indexes will be used while fulfilling the query, that is not true. Index intersection concept is only applicable in a few circumstances. In this particular instance, since we have one field in the search criteria and another in sort, index intersection will not be employed by Mongo. (Link).
So in this situation, I would go with the compound index.
As long as all your queries that need to use the createdAt field, use the genre field as well you should use the compound index.
Lets compare the two options:
Queries: As long as what i stated above holds both queries will behave the same, There is no difference between those two when it comes to query execution speed.
Memory: A compound index will use less memory, which is crucial if you have limited RAM space. lets see the difference with an example:
Lets have 3 documents:
{
name: "john",
last_name: "mayer"
}
{
name: "john",
last_name: "cake"
}
{
name: "banana",
last_name: "pie"
}
Now if we run db.collection.stats() on option 1 the compound index we get:
totalIndexSize: 53248.0
On the contrary for option 2:
totalIndexSize: 69632.0
Inserting: full disclosure I have no idea how each is affected. from small tests it seems that a compound index is slightly quicker, however I could not really find documentation on this nor did I investigate deeper.

In MongoDB, which index would be more efficient? One that queries an array with two values, or one that uses an $or statement?

Let's say I have a document that looks like this:
{
_id: ObjectId("5260ca3a1606ed3e76bf3835"),
event_id: "20131020_NFL_SF_TEN",
team: {
away: "SF",
home: "TEN"
}
}
I want to query for any game with "SF" as the away team or home team. So I put an index on team.away and team.home and run an $or query to find all San Francisco games.
Another option:
{
_id: ObjectId("5260ca3a1606ed3e76bf3835"),
event_id: "20131020_NFL_SF_TEN",
team: [
{
name: "SF",
loc: "AWAY"
},
{
name: "TEN",
loc: "HOME"
}
]
}
In the array above, I could put an index on team.name instead of two indexes as before. Then I would query team.name for any game with "SF" inside.
Which query would be more efficient? Thanks!
I believe that you would want to use the second example you gave with the single index on team.name.
There are some special considerations that you need to know when working with the $or operator. Quoting from the documentation (with some additional formatting):
When using indexes with $or queries, remember that each clause of an $or query will execute in parallel. These clauses can each use their own index.
db.inventory.find ( { $or: [ { price: 1.99 }, { sale: true } ] } )
For this query, you would create one index on price:db.inventory.ensureIndex({ price: 1 },
and another index on sale:db.inventory.ensureIndex({ sale: 1 } )
rather than a compound index.
Taking your first example into consideration, it doesn't make much sense to index a field that you are not going to specifically query. When you say that you don't mind if SF is playing on an away or home game, you would always include both the away and home fields in your query, so you're using two indexes where all you need to query is one value - SF.
It seems appropriate to mention at this stage that you should always consider the majority of your queries when thinking about the format of your documents. Think about the queries that you are planning to make most often and build your documents accordingly. It's always better to handle 80% of the cases as best you can rather than trying to solve all the possibilities (which might lead to worse performance overall).
Looking at your second example, of nested documents, as you said, you would only need to use one index (saving valuable space on your server).
Some more relevant quotes from the $or docs (again with added formatting):
Also, when using the $or operator with the sort() method in a query, the query will not use the indexes on the $or fields. Consider the following query which adds a sort() method to the above query:
db.inventory.find ({ $or: [{ price: 1.99 }, { sale: true }] }).sort({item:1})
This modified query will not use the index on price nor the index on sale.
So the question now is - are you planning to use the sort() function? If the answer is yes then you should be aware that your indexes might turn out to be useless! :(
The take-away from this is pretty much "it depends!". Consider the queries you plan to make, and consider what document structure and indexes will be most beneficial to you according to your usage projections.

How to index an $or query with sort

Suppose I have a query that looks something like this:
db.things.find({
deleted: false,
type: 'thing',
$or: [{
'creator._id': someid
}, {
'parent._id': someid
}, {
'somerelation._id': someid
}]
}).sort({
'date.created': -1
})
That is, I want to find documents that meets one of those three conditions and sort it by newest. However, $or queries do not use indexes in parallel when used with a sort. Thus, how would I index this query?
http://docs.mongodb.org/manual/core/indexes/#index-behaviors-and-limitations
You can assume the following selectivity:
deleted - 99%
type - 25%
creator._id, parent._id, somerelation._id - < 1%
Now you are going to need more than one index for this query; there is no doubt about that.
The question is what indexes?
Now you have to take into consideration that none of your $ors will be able to sort their data cardinally in an optimal manner using the index due to a bug in MongoDBs query optimizer: https://jira.mongodb.org/browse/SERVER-1205 .
So you know that the $or will have some performance problems with a sort and that putting the sort field into the $or clause indexes is useless atm.
So considering this the first index you want is one that covers the base query you are making. As #Leonid said you could make this into a compound index, however, I would not do it the order he has done it. Instead, I would do:
db.col.ensureIndex({type:-1,deleted:-1,date.created:-1})
I am very unsure about the deleted field being in the index at all due to its super low selectivity; it could, in fact, create a less performant operation (this is true for most databases including SQL) being in the index rather than being taken out. This part will need testing by you; maybe the field should be last (?).
As to the order of the index, again I have just guessed. I have said DESC for all fields because your sort is DESC, but you will need to explain this yourself here.
So that should be able to handle the master clause of your query. Now to deal with those $ors.
Each $or will use an index separately, and the MongoDB query optimizer will look for indexes for them separately too as though they are separate queries altogether, so something worth noting here is a little snag about compound indexes ( http://docs.mongodb.org/manual/core/indexes/#compound-indexes ) is that they work upon prefixes ( an example note here: http://docs.mongodb.org/manual/core/indexes/#id5 ) so you can't make one single compound index to cover all three clauses, so a more optimal method of declaring indexes on the $or (considering the bug above) is:
db.col.ensureindex({creator._id:1});
db.col.ensureindex({aprent._id:1});
db.col.ensureindex({somrelation._id:1});
It should be able to get you started on making optimal indexes for your query.
I should stress however that you need to test this yourself.
Mongodb can use only one index per query, so I can't see the way to use indexes to query someid in your model.
So, the best approach is to add special field for this task:
ids = [creator._id, parent._id, somerelation._id]
In this case you'll be able to query without using $or operator:
db.things.find({
deleted: false,
type: 'thing',
ids: someid
}).sort({
'date.created': -1
})
In this case your index will look something like this:
{deleted:1, type:1, ids:1, 'date.created': -1}
If you had flexibility to adjust the schema, I would suggest adding a new field, associatedIds : [ ] which would hold creator._id, parent._id, some relation._id - you can update that field atomically when you update the main corresponding field, but now you can have a compound index on this field, type and created_date which eliminates the need for $or in your query entirely.
Considering your requirement for indexing , I would suggest you to use $orderBy operator along side your $or query. By that I mean you should be able to index on the criteria's in your $or expressions used in your $or query and then you can $orderBy to sort the result.
For example:
db.things.find({
deleted: false,
type: 'thing',
$or: [{
'creator._id': someid
}, {
'parent._id': someid
}, {
'somerelation._id': someid
}]
},{$orderBy:{'date.created': -1}})
The above query would require compound indexes on each of the fields in the $or expressions combined with the sort object specified in the orderBy criteria.
for example:
db.things.ensureIndex{'parent._id': 1,"date.created":-1}
and so on for other fields.
It is a good practice to specify "limit" for the result to prevent mongodb from performing a huge in memory sort.
Read More on $orderBy operator here

Sorting on Multiple fields mongo DB

I have a query in mongo such that I want to give preference to the first field and then the second field.
Say I have to query such that
db.col.find({category: A}).sort({updated: -1, rating: -1}).limit(10).explain()
So I created the following index
db.col.ensureIndex({category: 1, rating: -1, updated: -1})
It worked just fined scanning as many objects as needed, i.e. 10.
But now I need to query
db.col.find({category: { $ne: A}}).sort({updated: -1, rating: -1}).limit(10)
So I created the following index:
db.col.ensureIndex({rating: -1, updated: -1})
but this leads to scanning of the whole document and when I create
db.col.ensureIndex({ updated: -1 ,rating: -1})
It scans less number of documents:
I just want to ask to be clear about sorting on multiple fields and what is the order to be preserved when doing so. By reading the MongoDB documents, it's clear that the field on which we need to perform sorting should be the last field. So that is the case I assumed in my $ne query above. Am I doing anything wrong?
The MongoDB query optimizer works by trying different plans to determine which approach works best for a given query. The winning plan for that query pattern is then cached for the next ~1,000 queries or until you do an explain().
To understand which query plans were considered, you should use explain(1), eg:
db.col.find({category:'A'}).sort({updated: -1}).explain(1)
The allPlans detail will show all plans that were compared.
If you run a query which is not very selective (for example, if many records match your criteria of {category: { $ne:'A'}}), it may be faster for MongoDB to find results using a BasicCursor (table scan) rather than matching against an index.
The order of fields in the query generally does not make a difference for the index selection (there are a few exceptions with range queries). The order of fields in a sort does affect the index selection. If your sort() criteria does not match the index order, the result data has to be re-sorted after the index is used (you should see scanAndOrder:true in the explain output if this happens).
It's also worth noting that MongoDB will only use one index per query (with the exception of $ors).
So if you are trying to optimize the query:
db.col.find({category:'A'}).sort({updated: -1, rating: -1})
You will want to include all three fields in the index:
db.col.ensureIndex({category: 1, updated: -1, rating: -1})
FYI, if you want to force a particular query to use an index (generally not needed or recommended), there is a hint() option you can try.
That is true but there are two layers of ordering you have here since you are sorting on a compound index.
As you noticed when the first field of the index matches the first field of sort it worked and the index was seen. However when working the other way around it does not.
As such by your own obersvations the order needed to be preserved is query order of fields from first to last. The mongo analyser can sometimes move around fields to match an index but normally it will just try and match the first field, if it cannot it will skip it.
try this code it will sort data first based on name then keeping the 'name' in key holder it will sort 'filter'
var cursor = db.collection('vc').find({ "name" : { $in: [ /cpu/, /memo/ ] } }, { _id: 0, }).sort( { "name":1 , "filter": 1 } );
Sort and Index Use
MongoDB can obtain the results of a sort operation from an index which
includes the sort fields. MongoDB may use multiple indexes to support
a sort operation if the sort uses the same indexes as the query
predicate. ... Sort operations that use an index often have better
performance than blocking sorts.
db.restaurants.find().sort( { "borough": 1, "_id": 1 } )
more information :
https://docs.mongodb.com/manual/reference/method/cursor.sort/

Retrieving a sequential documents based on _id

I've got a scenario where documents are indexed in elastic search, and I need to retrieve the matched document in mongo along with the preceding and following documents as sorted by a timestamp. The idea being to retrieve the context of the document along with the original document.
I am able to do this successfully now if I use a sequential _id. As an example, using the following data:
[
{_id: 1, value: 'Example One' },
{_id: 2, value: 'Example Two' },
{_id: 3, value: 'Example Three' },
{_id: 4, value: 'Example Four' },
{_id: 5, value: 'Example Five' },
{_id: 6, value: 'Example Six' },
...
]
if I search for 'Four' in ES, I get back the document _id of 4, since it's sequential I can create a mongo query to pull a range between id - 2 and id + 2, in this case 2 - 6. This works well, as long as I do not ever delete documents. When I delete a document I'll have to re-index the entire series to eliminate the gap. I'm looking for a way of achieving the same results, but also being able to delete documents without having to update all of the documents.
I'm open to using other technologies to achieve this, I am not necessarily tied to mongodb.
I can get the desired results using something like the following:
collection.find( {_id: { $gte: matchedId } } ).limit(3);
collection.find( {_id: { $lt: matchedId } } ).sort({$natural: -1}).limit(2);
Not quite as nice as using an explicit range, but no need to recalculate anything on document deletion.
Yes, I am aware of the limitations of natural order, and it is not a problem for my particular use case.
This problem has nothing to do with MongoDB in particular and is not different from using a different database here (e.g. a RDBMS). You will have to loop for document ids smaller/larger than the current id and find the first two matching. Yes, this means that you need to perform multiple queries. The only other option is to implement a chained list on top of MongoDB where you store pointers to the right and left neighbour node. And yes, in case of a deletion you need to adjust pointers (basic datastructure algorithms....). The downside is: you will need multiple operations in order to perform the changes. Since MongoDB is not transaction you may run into inconsistent previous/next pointers....that's why MongoDB completely sucks here.