Compound Index along with Single Index - mongodb

I have two fields in a document I want to index. One of them is Receive Time, and the other one is Serial Number. I want users to be able to query on Serial Number alone or on both Serial Number and Receive Time.
The way I see it, I have two options.
A.
db.collection.ensureIndex({SerialNumber: 1, ReceiveTime: 1})
db.collection.ensureIndex({ReceiveTime: 1})
B.
db.collection.ensureIndex({ReceiveTime: 1, SerialNumber: 1})
db.collection.ensureIndex({SerialNumber: 1})
Apparently, option A is a better choice (you want fields with low uniqueness to be later on in an index) versus option B. Why is that the case?
However, at the same time the MongoDB documentation states that if your index increments then the whole index need not fit in RAM. If this is a very write heavy application, would B then be the better option? (compound indexes are larger than single indexes and the compound index increments as opposed to A which doesn't increment)

The decision between {SerialNumber: 1, ReceiveTime: 1} and {ReceiveTime: 1, SerialNumber: 1} should be based on the type of queries that you plan to perform. If you generally query for a specific SerialNumber but a large range of possible ReceiveTimes, then you want to use {SerialNumber: 1, ReceiveTime: 1}. Conversely, if your queries are specific for ReceiveTime but more general for SerialNumber then go for {ReceiveTime: 1, SerialNumber: 1}. This way each query is likely to require fewer pages of the index, and will minimize the amount of swapping that the OS has to do.
Similarly, if you are always querying by, say, the most recent ReceiveTimes, then you can keep the working set small by using {ReceiveTime: 1, SerialNumber: 1}. You will only need to keep the pages corresponding to the most recent ReceiveTimes in memory. This is what the documentation you linked to is suggesting.

Related

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

Which MongoDB indexes should be created for different sorting and filtering conditions to improve performance?

I have MongoDB collection with ~100,000,000 records.
On the website, users search for these records with "Refinement search" functionality, where they can filter by multiple criteria:
by country, state, region;
by price range;
by industry;
Also, they can review search results sorted:
by title (asc/desc),
by price (asc/desc),
by bestMatch field.
I need to create indexes to avoid full scan for any of combination above (because users use most of the combinations). Following Equality-Sort-Range rule for creating indexes, I have to create a lot of indexes:
All filter combination × All sortings × All range filters, like the following:
country_title
state_title
region_title
title_price
industry_title
country_title_price
country_industry_title
state_industry_title
...
country_price
state_price
region_price
...
country_bestMatch
state_bestMatch
region_bestMatch
...
In reality, I have more criteria (including equality & range), and more sortings. For example, I have multiple price fields and users can sort by any of that prices, so I have to create all filtering indexes for each price field in case if the user will sort by that price.
We use MongoDB 4.0.9, only one server yet.
Until I had sorting, it was easier, at least I could have one compound index like country_state_region and always include country & state in the query when one searches for a region. But with sorting field at the end, I cannot do it anymore - I have to create all different indexes even for location (country/state/region) with all sorting combinations.
Also, not all products have a price, so I cannot just sort by price field. Instead, I have to create two indexes: {hasPrice: -1, price: 1}, and {hasPrice: -1, price: -1} (here, hasPrice is -1, to have records with hasPrice=true always first, no matter price sort direction).
Currently, I use the NodeJS code to generate indexes similar to the following (that's simplified example):
for (const filterFields of getAllCombinationsOf(['country', 'state', 'region', 'industry', 'price'])) {
for (const sortingField of ['name', 'price', 'bestMatch']) {
const index = {
...(_.fromPairs(filterFields.map(x => [x, 1]))),
[sortingField]: 1
};
await collection.ensureIndex(index);
}
}
So, the code above generates more than 90 indexes. And in my real task, this number is even more.
Is it possible somehow to decrease the number of indexes without reducing the query performance?
Thanks!
Firstly, in MongoDB (Refer: https://docs.mongodb.com/manual/reference/limits/), a single collection can have no more than 64 indexes. Also, you should never create 64 indexes unless there will be no writes or very minimal.
Is it possible somehow to decrease the number of indexes without reducing the query performance?
Without sacrificing either of functionality and query performance, you can't.
Few things you can do: (assuming you are using pagination to show results)
Create a separate (not compound) index on each column and let MongoDB execution planner choose index based on meta-information (cardinality, number, etc) it has. Of course, there will be a performance hit.
Based on your judgment and some analytics create compound indexes only for combinations which will be used most frequently.
Most important - While creating compound indexes you can let off sort column. Say you are filtering based on industry and sorting based on price. If you have a compound index (industry, price) then everything will work fine. But if you have index only on the industry (assuming paginated results), then for first few pages query will be quite fast, but will keep degrading as you move on to next pages. Generally, users don't navigate after 5-6 pages. Also, you have to keep in mind for larger skip values, the query will start to fail because of the 32mb memory limit for sorting. This can be overcome with aggregation (instead of the query) with allowDiskUse enable.
Check for keyset pagination (also called seek method) if that can be used in your use-case.

Slow creation of four-field index in MongoDB

I have a ProductRequest collection in MongoDB. It is somewhat large collection, but not that many documents. Number of documents is a bit over 300,000, but average size of a document is close to 1MB, thus data footprint is large.
To speed up certain queries I am setting up index on this collection:
db.ProductRequest.ensureIndex ({processed: 1, parsed: 1, error:1,processDate:1})
First three fields are Boolean, the last one is date time.
The command runs for soon 24 hours and would not come back
I already have index on ‘processed’ and ‘parsed’ fields (together) and a separate one on ‘error’. Why creation of that four-field index takes forever? My understanding is that size of an individual record should not matter in this case, am I wrong?
Additional Info:
MongoDB version 2.6.1 64-bit
Host OS Centos 6.5
Sharding: yes, shard key is _id. Number of shards: 2, number of replica sets in each shard is 3.
I belive its because of putting index for boolean fields.
since there are only two values (true or false), if you have 300.000 rows putting an index on that field will have to scan 150.00 rows to find all documents and in your case you have 3 Boolean fields, it makes it more slow.
You won't see a huge benefit from an index on those three fields and processDate compared to an index just on processDate. Indexes on boolean fields aren't very useful in the presence of other index-able fields because they aren't very selective. If you give a process date, there are only 8 possibilities for the combination of the other fields to further narrow down the results via the index.
Also, you should switch the order. Put processDate first as it is much more selective than a boolean field. That should greatly simplify the index and speed up the index build.
Finally, index creation in MongoDB is sometimes unavoidably slow and expensive because it involves creating large B-trees. The payoff, which is absolutely worth it, of course, is faster queries. It's possible that more than 24 hours are needed for an index build. Have you checked what the saturated resource is? It's likely the CPU for an index build. Your best option for this case is to create the index in the background. Background index builds
don't block read and write operation for the duration like foreground index builds
take longer
produce initially larger indexes that will converge to the size of an equivalent foreground index over time
You set an index build to occur in the background with an extra option to the ensureIndex call:
db.myCollection.ensureIndex({ "myField" : 1 }, { "background" : 1 })

Query optimizer index selection on compound index when querying only by the second field

Suppose I have a compound index { a: 1, b: 1 }.
The query db.Collection.find( { b: 1 } ) doesn't use this index. The query optimizer does not appear to select this index as a candidate run.
However if you specifically hint the index, the query runs much faster and the nscan is much lower:
db.Collection.find( { b: 1 } ).hint( { a: 1, b: 1 } )
My question is, if using the index results in a faster query, why would the query optimizer ignore the index in my query on b alone?
From the page you link to on "compound index": "Compound indexes support queries on any prefix of the fields in the index." The case where an index helps on a query that is not a prefix is fairly specific, and has something to do with the distribution of values of a (I believe it does a better job as the number of possible values of a decreases). The optimal thing to do in that case is to not try using an index, because that could make things slower.
In the comments, you suggest that it shouldn't be very much slower in the worst case, but could give large improvements. Well, let's try a little testing. I built a collection with 10^6 documents, where each document i is {a: i, b: i+1}. This is, in my hypothesis, the worst case for a query on only b when using the index {a: 1, b: 1}.
For the query
db.testing.find({b: 0}).explain()
we find that it scanned 1,000,000 documents (not surprising) in about 350ms. Not bad for an unindexed query. Now, let's hint that index:
db.testing.find({b: 0}).hint("a_1_b_1").explain()
This time it only scanned 954,546 documents. I don't know enough about MongoDB indexes to explain this. However, this slightly smaller scan took about 2300ms, or 6.5x as long as the unindexed query.
So yes, a poorly-indexed query can be much worse than an unindexed one. But this doesn't completely answer your question - why doesn't the query optimizer figure this out?
The query optimizer runs different plans in parallel the first time it sees a query, and remembers the best for future queries (this is occasionally re-evaluated). But, it will only try candidate indexes - that is, those where some non-empty prefix of the index matches some portion of the query. By this standard, of course, {a: 1, b: 1} is not a candidate index for a query on just b.
I would suggest either creating a second index on {b: 1} (or at least with that prefix), or reversing the order of the one you already have (create {b: 1, a: 1} and then drop the old one).
Compound index are generally used, for prefix matched queries, or full matched ones.
Clearly your first query don't qualify. You don't need to provide a hack for this. Instead you can just hint the optimiser to use the { a : 1, b : 1 } index
db.Collection.find({ b: 1 }).hint({ a:1, b:1 })
If you have a phone book that is organized by "Last name, First name" but you only had a first name, do you think the phone book would help you find the person you were searching for?
That's what you are trying to force the optimizer to do when you have an index on a, b and you are selecting on b. It means for every value of a it needs to look and see if b matches.
There are many possible reasons why using this index may be faster than a collection scan in some circumstances. In general, it's not a candidate index and you should not use this as a solution to speeding up queries on b.
The way the current version's MongoDB query optimizer works is it tries the query with multiple query plans (all candidate indexes plus collection scan). Whichever is fastest "wins", the others are terminated and the winning plan is cached for some period of time. If you run `db.collection.find(...).explain(true) you will actually see all the "plans" it has tried. If the index is not considered a candidate then it won't be in the mixed for this phase - the only way to get the query to use it would be to explicitly "hint" it.
The query optimizer will be changing in the next major release so the above applies to the state of the world in 2.4 and earlier versions.

Compound index order based on field selectivity

I have two fields a and b, where b has substantially higher selectivity than a.
Now, if I am only querying on both a and b (never on either field by itself), which of the following two indexes is better and why:
{a: 1, b : 1}
{b: 1, a : 1}
Explain seems to return almost identical results, but I read somewhere that you should put higher selectivity fields first. I don't know why that would make sense though.
After some extensive work to improve queries on a 150 000 000 records database I have found out the following:
not necessarily higher selectivity fields, but actually fields that are "faster" to match, being moved to the first position can increase performance drastically
I had an index composed of the following fields:
zip, address, city, first name, last name
Address is matched by an array, not string = string so it takes most time to execute and is the slowest to match. My first index that I created was: address_zip_city_last_name_first_name and the execution time for matching 1000 records against the whole DB would go for hours.
Address field actually probably has the highest selectivity on these, but since it is not being matched by a simple string equality, it takes the most time. It actually goes something like this
{ address: {$all : ["1233", "main", "avenue] }}
By changing this index to having the "faster" fields in the beginning, for example: zip_city_first_name_last_name_address the performance was much better. The same 1000 records would match in just one second instead for going for hours.
Hope this helps someone
cheers
After doing some further analysis the two indexes are in fact pretty much identical from a performance point of view.
Really if you are in a similar situation, the real consideration should be whether in the future you might be more likely to query on a alone or b alone, and put that field first in the index.
I believe the optimiser will choose the index best to use, although you can provide hints
e.g.
db.collection.find({user:u, foo:d}).hint({user:1});
see http://www.mongodb.org/display/DOCS/Optimization