MongoDB complex indices - mongodb

I'm trying to understand how to best work with indices in MongoDB. Lets say that I have a collection of documents like this one:
{
_id: 1,
keywords: ["gap", "casual", "shorts", "oatmeal"],
age: 21,
brand: "Gap",
color: "Black",
gender: "female",
retailer: "Gap",
style: "Casual Shorts",
student: false,
location: "US",
}
and I regularly run a query to find all documents that match a set of criteria for each of those fields, something like:
db.items.find({ age: { $gt: 13, $lt: 40 },
brand: { $in: ['Gap', 'Target'] },
retailer: { $in: ['Gap', 'Target'] },
gender: { $in: ['male', 'female'] },
style: { $in: ['Casual Shorts', 'Jeans']},
location: { $in: ['US', 'International'] },
color: { $in: ['Black', 'Green'] },
keywords: { $all: ['gap', 'casual'] }
})
I'm trying to figure what sort of index I can create to improve the speed of a query such as this. Should I create a compound index like this:
db.items.ensureIndex({ age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
or is there a better set of indices I can create to optimize this query?

Should I create a compound index like this:
db.items.ensureIndex({age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
You can create an index like the one above, but you're indexing almost the entire collection. Indexes take space; the more fields in the index, the more space is used. Usually RAM, although they can be swapped out. They also incur write penalty.
Your index seems wasteful, since probably indexing just a few of those fields will make MongoDB scan a set of documents that is close to the expected result of the find operation.
Is there a better set of indices I can create to optimize this query?
Like I said before, probably yes. But this question is very difficult to answer without knowing details of the collection, like the amount of documents it has, which values each field can have, how those values are distributed in the collection (50% gender male, 50% gender female?), how they correlate to each other, etc.
There are a few indexing strategies, but normally you should strive to create indexes with high selectivity. Choose "small" field combinations that will help MongoDB locate the desired documents scanning a "reasonable" amount of them. Again, "small" and "reasonable" will depend on the characteristics of the collection and query you are performing.
Since this is a fairly complex subject, here are some references that should help you building more appropriate indexes.
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
http://docs.mongodb.org/manual/faq/indexes/#how-do-you-determine-what-fields-to-index
http://docs.mongodb.org/manual/tutorial/create-queries-that-ensure-selectivity/
And use cursor.explain to evaluate your indexes.
http://docs.mongodb.org/manual/reference/method/cursor.explain/

Large index like this one will penalize you on writes. It is better to index just what you need, and let Mongo's optimiser do most of the work for you. You can always give him an hint or, in last resort, reindex if you application or data usage changes drastically.
Your query will use the index for fields that have one (fast), and use a table scan (slow) on the remaining documents.
Depending on your application, a few stand alone indexes could be better. Adding more indexes will not improve performance. With the write penality, it could even make it worse (YMMV).
Here is a basic algorithm for selecting fields to put in an index:
What single field is in a query the most often?
If that single field is present in a query, will a table scan be expensive?
What other field could you index to further reduce the table scan?

This index seems to be very reasonable for your query. MongoDB calls the query a covered query for this index, since there is no need to access the documents. All data can be fetched from the index.
from the docs:
"Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query. An index can also cover an aggregation pipeline operation on unsharded collections."
Some remarks:
This index will only be used by queries that include a filter on age. A query that only filters by brand or retailer will probably not use this index.
Adding an index on only one or two of the most selective fields of your query will already bring a very significant performance boost. The more fields you add the larger the index size will be on disk.
You may want to generate some random sample data and measure the performance of this with different indexes or sets of indexes. This is obviously the safest way to know.

Related

In mongo how to efficiently sort and find get the last value matching the query?

Let's say I have some records as below:
{
id: 1,
age: 22,
name: 'A',
class: 'Y'
},
{
id: 2,
age: 25,
name: 'B',
class: 'D'
},
{
id: 3,
age: 30,
name: 'C',
class: 'Y'
},
{
id: 4,
age: 40,
name: 'D',
class: 'B'
}
Now I need to get the last (closest) record which has an age less than 28. For this, I can use the following code:
const firstUsersYoungerThan28 = await Users.find({
class: 'C',
age: {
$lt: 28
}
})
.sort({
age: -1
})
.limit(1)
.lean();
const firstUserYoungerThan28 = firstUsersYoungerThan28[0];
Let's say the collection have millions of records. My question is, is this the most efficient way? Is there a better way to do this?
My biggest concern is, does my app load the records to the memory in order to sort them in the first place?
See the documentation about cursor.sort():
Limit Results
You can use sort() in conjunction with limit() to return the first (in
terms of the sort order) k documents, where k is the specified limit.
If MongoDB cannot obtain the sort order via an index scan, then
MongoDB uses a top-k sort algorithm. This algorithm buffers the first
k results (or last, depending on the sort order) seen so far by the
underlying index or collection access. If at any point the memory
footprint of these k results exceeds 32 megabytes, the query will
fail.
Make sure that you have an index on age. Then when MongoDB does the sorting, it will only keep the first k (in your case, 1) results in memory.
MongoDB can handle millions of documents for such basic queries, don't worry. Just make sure that you do have the proper indexes specified.
You should create an index with the following properties:
The index filters everyone with an age below 28
The index sorts by currentMeterReading.
This type of index is a partial index. Refer to the documentation about partial indexes and using indexes for sorting for more information.
The following index should do the trick:
db.users.createIndex(
{ age: -1, currentMeterReading: -1},
{ partialFilterExpression: { age: {$lt: 28} }
)
Without the index, a full scan would probably be in order which is very bad for performance.
With the index, only the columns you specified will be stored in memory (when possible) and searching them would be much faster.
Note that MongoDB may load the values of the index lazily instead of on index creation in some or all cases. This is an implementation choice.
As far as I know, there's no way to create an index with only the last record in it.
If you want to understand how databases work you have to understand how indexes work, specifically B-Trees. B-Trees are very common constructs of all databases.
Indexes do have their disadvantages so don't create one for each query. Always measure before creating an index since it might not be necessary.

MongoDB range query with a sort - how to speed up?

I have a query which is routinely taking around 30 seconds to run for a collection with 1 million documents. This query is to form part of a search engine, where the requirement is that every search completes in less than 5 seconds. Using a simplified example here (the actual docs has embedded documents and other attributes), let's say I have the following:
1 millions docs of a Users collections where each looks as follows:
{
name: Dan,
age: 30,
followers: 400
},
{
name: Sally,
age: 42,
followers: 250
}
... etc
Now, lets I'm wanting to return the IDs of 10 users with a follower count between 200 and 300, sorted by age in descending order. This can be achieved with the following:
db.users.find({
'followers': { $gt: 200, $lt: 300 },
}).
projection({ '_id': 1 }).
sort({ 'age': -1 }).
limit(10)
I have the following compound Index created, which winningPlan tells me is being used:
db.users.createIndex({ 'followed_by': -1, 'age': -1 })}
But this query is still taking ~30 seconds as it's having to examine thousands of docs, near equal to the amount of docs in this case that match the find query. I have experimented with different indexes (with different positions and sort orders) with no luck.
So my question is, what else can I do to either reduce the number of documents examined with the query, or speed up the the process of having to examine the docs?
The query is taking long both in production and on my local dev environment, somewhat ruling many network and hardware factors. currentOp shows that the query is not waiting for locks while running, or that there are any other queries running at the same time.
For me, it looks like you have an incorrect index: { 'followed_by': -1, 'age': -1 } for your query. You should have an index { 'followers': 1} (but take into consideration cardinality of that field). But even with that index, you will need to do inmem sort. Anyway, it should be much faster in the way you have high cardinality because you will not need to scan the whole collection for filtering step as you do with index prefix followed_by.

MongoDB compound index, aggregation performance

I've gone through many articles about indexes in MongoDB but still have no clear understanding of indexing strategy for my particular case.
I have the following collection with more than 10 million items:
{
active: BOOLEAN,
afp: ARRAY_of_integers
}
Previously I was using aggregation with this pipeline:
pipeline = [
{'$match': {'active': True}},
{'$project': {
'sub': {'$setIsSubset': [afp, '$afp']}
}},
{'$match': {'sub': True}}
]
Queries were pretty slow, so I started experimenting without aggregation:
db.collection.find({'active': True, 'afp': {'$all': afp}})
The latter query using $all runs much faster with the same indexes - no idea why...
I've tried these indexes (not much variations possible though):
{afp: 1}
{active: 1}
{active: 1, afp: 1}
{afp: 1, active: 1}
I don't know why, but the latest index gives the best performance - any ideas about the reason would be much appreciated.
Then I decided to add constraints in order to possibly improve speed.
Considering that number of integers in "afp" field can be different, there's no reason to scan documents having less number of integers than in the query. So I created one more field for all documens called "len_afp" which contains that number (afp length indeed).
Now documents look like this:
{
active: BOOLEAN,
afp: ARRAY_of_integers
len_afp: INTEGER
}
Query is:
db.collection.find({
'active': True,
'afp': {'$all': afp},
'len_afp: {'$gte': len_afp}
})
Also I've added three new indexes:
{afp: 1, len_afp: 1, active: 1}
{afp: 1, active: 1, len_afp: 1}
{len_afp: 1, afp: 1, active: 1}
The latest index gives the best performance - again for unknown reason.
So my question is: what the logic is behind fields order in compound indexes, and how this logic has to be considered when creating indexes?
Also it's interesting why $all works faster than $setIsSubset with all the same other conditions?
Can index intersection be used for my case instead of compound indexes?
Still the performance is pretty low - what am I doing wrong?
Can sharding help in my particular case (will it work using aggregation, or $all, or $gte)?
Sorry for huge question and Thank you in advance!

Compound Indexes Order in Mongo

Let's say I have the following document schema:
{
_id: ObjectId(...),
name: "Kevin",
weight: 500,
hobby: "scala",
favoriteFood : "chicken",
pet: "parrot",
favoriteMovie : "Diehard"
}
If I create a compound index on name-weight, I will be able to specify a strict parameter (name == "Kevin"), and a range on weight (between 50 and 200). However, I would not be able to do the reverse: specify a weight and give a "range" of names.
Of course compound index order matters where a range query is involved.
If only exact queries will be used (example: name == "Kevin", weight == 100, hobby == "C++"), then does the order actually matter for compound indexes?
When you have an exact query, the order should not matter. But when you want to be sure, the .explain() method on database cursors is your friend. It can tell you which indexes are used and how they are used when you perform a query in the mongo shell.
Important fields of the document returned by explain are:
indexOnly: when it's true, the query was completely covered by the index
n and nScanned: The first one tells you the number of found documents, the second how many documents had to be examined because the indexes couldn't sort them out. The latter shouldn't be notably higher than the first.
millis: number of milliseconds the query took to perform

Best way to sort on multiple fields in geonear query

In MongoDB I'm doing a geonear query on a collection containing ~3.5 million objects to return results near a certain lat/long. This query runs great if I have a basic 2d index on the object:
db.Listing.ensureIndex( { Coordinates: "2d" } );
However now I also want to filter by other fields (Price, Property Type, Year Built, Beds, Baths, etc...) within the geonear query. When I add to the query things like Price <= 10000000 then the query starts to slow down. I don't have any indexes on these other fields so I'm wondering what the best approach is performance-wise.
I tried adding separate indexes for each of the other fields (11 total indexes on the collection) however this made the query time out every time, I guess because a collection can only handle having so many indexes?
db.Listing.ensureIndex( { Coordinates: "2d" } );
db.Listing.ensureIndex( { Price: 1 } );
db.Listing.ensureIndex( { Beds: 1 } );
db.Listing.ensureIndex( { Baths: 1 } );
etc...
Now I'm thinking of having just 1 compound index on the collection like so:
db.Listing.ensureIndex( { Coordinates: "2d", Price: 1, PropertyType: 1, YearBuilt: 1, Beds: 1, Baths: 1, HouseSize: 1, LotSize: 1, Stories: 1 } );
Is this the correct approach or is there a better way?
Yes, compound index is probably the way to go. See http://www.mongodb.org/display/DOCS/Geospatial+Indexing#GeospatialIndexing-CompoundIndexes for details.
The only issue I see here is that you have a lot of fields in that index which will make it rather big so you may want to only have indexes on fields with high cardinality. Use explain() to optimize this.
Also, given your dataset it might be hard to keep the index right balanced (and thus it will start hitting disk when it runs out of physical memory) which will slow things down considerably.