Slow creation of four-field index in MongoDB - mongodb

I have a ProductRequest collection in MongoDB. It is somewhat large collection, but not that many documents. Number of documents is a bit over 300,000, but average size of a document is close to 1MB, thus data footprint is large.
To speed up certain queries I am setting up index on this collection:
db.ProductRequest.ensureIndex ({processed: 1, parsed: 1, error:1,processDate:1})
First three fields are Boolean, the last one is date time.
The command runs for soon 24 hours and would not come back
I already have index on ‘processed’ and ‘parsed’ fields (together) and a separate one on ‘error’. Why creation of that four-field index takes forever? My understanding is that size of an individual record should not matter in this case, am I wrong?
Additional Info:
MongoDB version 2.6.1 64-bit
Host OS Centos 6.5
Sharding: yes, shard key is _id. Number of shards: 2, number of replica sets in each shard is 3.

I belive its because of putting index for boolean fields.
since there are only two values (true or false), if you have 300.000 rows putting an index on that field will have to scan 150.00 rows to find all documents and in your case you have 3 Boolean fields, it makes it more slow.

You won't see a huge benefit from an index on those three fields and processDate compared to an index just on processDate. Indexes on boolean fields aren't very useful in the presence of other index-able fields because they aren't very selective. If you give a process date, there are only 8 possibilities for the combination of the other fields to further narrow down the results via the index.
Also, you should switch the order. Put processDate first as it is much more selective than a boolean field. That should greatly simplify the index and speed up the index build.
Finally, index creation in MongoDB is sometimes unavoidably slow and expensive because it involves creating large B-trees. The payoff, which is absolutely worth it, of course, is faster queries. It's possible that more than 24 hours are needed for an index build. Have you checked what the saturated resource is? It's likely the CPU for an index build. Your best option for this case is to create the index in the background. Background index builds
don't block read and write operation for the duration like foreground index builds
take longer
produce initially larger indexes that will converge to the size of an equivalent foreground index over time
You set an index build to occur in the background with an extra option to the ensureIndex call:
db.myCollection.ensureIndex({ "myField" : 1 }, { "background" : 1 })

Related

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

Unique multi key hashed index in MongoDB

I have a collection with several billion documents and need to create a unique multi-key index for every attribute of my documents.
The problem is, I get an error if I try to do that because the generated keys would be too large.
pymongo.errors.OperationFailure: WiredTigerIndex::insert: key too large to index, failing
I found out MongoDB lets you create hashed indexes, which would resolve this problem, however they are not to be used for multi-key indexes.
How can i resolve this?
My first idea was to create another attribute for each of my document with an hash of every value of its attributes, then creating an index on that new field.
However this would mean to recalculate the hash every time I wish to add a new attribute, plus the excessive amount of time necessary to create both the hashes and the indexes.
This is a feature added in mongoDB since 2.6 to prevent the total size of an index entry to exceed 1024 bytes (also known as Index Key Length Limit).
In MongoDB 2.6, if you attempt to insert or update a document so that the value of an indexed field is longer than the Index Key Length Limit, the operation will fail and return an error to the client. In previous versions of MongoDB, these operations would successfully insert or modify a document but the index or indexes would not include references to the document.
For migration purposes and other temporary scenarios you can downgrade to 2.4 handling of this use case where this exception would not be triggered via setting this mongoDB server flag:
db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
This however is not recommended.
Also consider that creating indexes for every attribute of your documents may not be the optimal solution at all.
Have you examined how you query your documents and on which fields you key on? Have you used explain to view the query plan? It would be an exception to the rule if you tell us that you query on all fields all the time.
Here are the recommended MongoDB indexing strategies.
Excessive indexing has a price as well and should be avoided.

Huge query times when sort applied

I have a collection in MongoDB at about 1.1 million records. The average object size is 7.4kb so the database is around 8gb. I have an application which parses through the collection, but must be done synchronously ordered by the endedAt date in each record. It is also important that these are not live games (isLive: false), because otherwise the endedAt date won't exist. Once a record has been parsed, in order to ensure it isn't pulled in again, I set a value of isComplete: true to the record.
Now because the data must be returned to me the earliest first according to the endedAt date, I run the sort() function on the set. This seems to be a huge bottleneck for me right now.
My query for getting the next X rows to parse (remember, these need to be synchronous) is as follows:
db.matches.find({ isComplete: { $exists: false }, isLive: false }).limit(n)
When n is simply 5, the speed of the query is:
0.22s
However, when I add the necessary sort to the same query, because I absolutely must only return the next n rows by the earliest endedAt date (if they haven't already been parsed), the query time increases substantially to:
46.5s
The strange thing is, I've managed to parse a few hundred thousand games without problem, and the queries have gotten slower and slower until now where they effectively time-out. To most people this would immediately sound like an index problem, however I have indexes on the following fields:
idx_startedAt (1)
idx_endedAt (1)
idx_isComplete (1)
idx_isLive (1)
I'm not sure what else I should be indexing to increase the speed of this query, but I'm becoming pretty lost as to how best approach this problem. Any help as always much appreciated.
You need to index all of the filter criteria using a compound index, including the sort.
Filtering only a single field will still require scanning a large number of documents from disk and then sorting the results in memory. Indexing all of the fields, including the sort, will minimize the number of documents read from disk and prevent the need to sort the results in memory.
The ideal index for this query would be the following:
db.matches.createIndex({ "isLive" : 1, "isComplete" : 1, "endedAt" : 1 }, { "background" : true } )

compound Index or single index in mongodb

I got a query like this that gets called 90% of the times:
db.xyz.find({ "ws.wz.eId" : 665 , "ws.ce1.id" : 665)
and another one like this that gets called 10% of the times:
db.xyz.find({ "ws.wz.eId" : 111 , "ws.ce2.id" : 111)
You can see that the id for the two collections in both queries are the same.
Now I'm wondering if I should just create a single index just for "ws.wz.eId" or if I should create two compound indexes: one for {"ws.wz.eId", "ws.ce.id"} and another one for {"ws.wz.eId", "ws.ce2.id"}
It seems to me that the single index is the best choice; however I might be wrong; so I would like to know if there is value in creating the compound index, or any other type.
As muratgu already pointed out, the best way to reason about performance is to stop reasoning and start measuring instead.
However, since measurements can be quite tricky, here's some theory:
You might want to consider one compound index {"ws.wz.eId", "ws.ce1.id"} because that can be used for the 90% case and, for the ten percent case, is equivalent to just having an index on ws.wz.eId.
When you do this, the first query can be matched through the index, the second query will have to find all candidates with matching ws.wz.eId first (fast, index present) and then scan-and-match all candidates to filter out those documents that don't match the ws.ce2.id criterion. Whether that is expensive or not depends on the number of documents with same ws.wz.eId that must be scanned, so this depends very much on your data.
An important factor is the selectivity of the key. For example, if there's a million documents with same ws.wz.eId and only one of those has the ws.ce2.id you're looking for, you might need the index, or want to reverse the query.

mongodb index strategy for range query with different fields

Almost all my documents include 2 fields, start timestamp and final timestamp. And in each query, I need to retrieve elements which are in selected period of time. so start should be after selected value and final should be before selected timestamp.
query looks like
db.collection.find({start:{$gt:DateTime(...)}, final:{$lt:DateTime(...)}})
So what is the best indexing strategy for that scenario?
By the way, which is better for performance - to store date as datetimes or as unix timestamps, which is long value itself
To add a little more to baloo's answer.
On the time-stamp vs. long issue. Generally the MongoDB server will not see a difference. The BSON encoding length is the same (64 bits). You may see a performance different on the client side depending on the driver's encoding. As an example, on the Java side a using the 10gen driver a time-stamp is rendered as Date that is a lot heavier than Long. There are drivers that try to avoid that overhead.
The other issue is that you will see a performance improvement if you close the range for the first field of the index. So if you use the index suggested by baloo:
db.collection.ensureIndex({start: 1, final: 1})
The query will perform (potentially much) better if it is:
db.collection.find({start:{$gt:DateTime(...),$lt:DateTime(...)},
final:{$lt:DateTime(...)}})
Conceptually, if you think of the indexes as a a tree the closed range limits both sides of the tree instead of just one side. Without the closed range the server has to "check" all of the entries with a start greater than the time stamp provided since it does not know of the relation between start and final.
You may even find that that the query performance is no better using a single field index like:
db.collection.ensureIndex({start: 1})
Most of the savings is from the first field's pruning. The case where this will not be the case is when the query is covered by the index or the ordering/sort for the results can be derived from the index.
You can use a Compound index in order to create an index for multiple fields.
db.collection.ensureIndex({start: 1, final: 1})
Compare different queries and indexes by using explain() to get the most out of your database