I am using DocumentDb database(total size ~13TB)
This is the schema of the database, all the keys are single indexed
[
{
{key : {"customerId" : 1}},
{key : {"typeOfProduct" : 1}},
{key : {"date": 1}},
}
]
1st query: db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>}).count()
returns 500 documents
2nd query: db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>, "typeOfProduct" : <TYPE>}).count()
returns 200 documents
3nd query db.collectionName.find({"customerId" : <SAMPLE-CUSTOMER>, "date" : { "$gte" : NumberLong(1584055385383), "$lte" : NumberLong(1584141785383)}}).count()
query just keeps on running infinitely
In the 2nd query, My understanding is that DocumentDb first gets all the documents for matching customerId which is 500, and then iteratively searches for matching typeOfProduct through each document.
In the 3rd query, it is supposed to be working in a similar manner, but the query keeps on running infinitely.
Can someone explain why this is happening? Why is it so slow with the Date? Is it because of the size of the database or am I writing the query wrong?
I'm not sure if you have one three-field index (1) or three single-field indexes (2).
Both first and second query uses that index because they're prefix-matching it. In general, the third query could use the index easily, skipping the second field. You can check it using explain().
It's not obvious how Mongo will pick the index, but knowing your data structure, you can suggest one, using hint(index) ({customerId: 1} in your case). To check which one is being used for this query by default, use explain().
Related
I have a field "productLowerCase" in my mongo documents. I created 2 indices
1. simple
{"productLowerCase" : 1}
2. compound
{"productLowerCase" : 1, "timestamp.milliseconds" : -1}
So If I run a query which has only productLowerCase, say:
db.coll.find({"productLowerCase" : {$regex : /^cap/}})
Which index will get used?
In this case mongo will use {"productLowerCase" : 1} this index, but you can remove this index, because if you have compound index you can search with first field without performance loss.
Beside this you can use explain() to explain your query.
I have some problems with very slow distinct commands that use a query.
From what I have observed the distinct command only makes use of an index if you do not specify a query:
I have created a test database on my MongoDB 3.0.10 server with 1Mio objects. Each object looks as follows:
{
"_id" : ObjectId("56e7fb5303858265f53c0ea1"),
"field1" : "field1_6",
"field2" : "field2_10",
"field3" : "field3_29",
"field4" : "field4_64"
}
The numbers at the end of the field values are random 0-99.
On the collections two simple indexes and one compound-index has been created:
{ "field1" : 1 } # simple index on "field1"
{ "field2" : 1 } # simple index on "field2"
{ # compound index on all fields
"field2" : 1,
"field1" : 1,
"field3" : 1,
"field4" : 1
}
Now I execute distinct queries on that database:
db.runCommand({ distinct: 'dbtest',key:'field1'})
The result contains 100 values, nscanned=100 and it has used index on "field1".
Now the same distinct query is limited by a query:
db.runCommand({ distinct: 'dbtest',key:'field1',query:{field2:"field2_10"}})
It contains again 100 values, however nscanned=9991 and the used index is the third one on all fields.
Now the third index that was used in the last query is dropped. Again the last query is executed:
db.runCommand({ distinct: 'dbtest',key:'field1',query:{field2:"field2_10"}})
It contains again 100 values, nscanned=9991 and the used index is the "field2" one.
Conclusion: If I execute a distinct command without query the result is taken directly from an index. However when I combine a distinct command with a query only the query uses an index, the distinct command itself does not use an index in such a case.
My problem is that I need to perform a distinct command with query on a very large database. The result set is very large but only contains ~100 distinct values. Therefore the complete distinct command takes ages (> 5 minutes) as it has to cycle through all values.
What needs to be done to perform my distinct command presented above that can be answered by the database directly from an index?
The index is automatically used for distinct queries if your Mongo database version supports it.
The possibility to use an index in a distinct query requires Mongo version 3.4 or higher - it works for both storage engines MMAPv1/WiredTiger.
See also the bug ticket https://jira.mongodb.org/browse/SERVER-19507
Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?
In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.
I have a production mongo database of over 1B documents in a single collection sharded on _id across multiple servers. I'm trying to replicate recently updated records from this collection into Red Shift.
Shard keys:
db.sample_collection.ensureIndex({_id: "hashed"})
sh.shardCollection("sample_collection.sample_object", {_id: "hashed"})
Example 'sample_object' Document
{
"_id" : ObjectId("527a6c9226d6b7770ab05345"),
"p": ISODate("2013-10-27T14:30:18.000Z"),
"a" : {
"ln" : "Doe",
"id" : NumberLong(3),
"fn" : "John",
},
"co" : {
"ct" : 2,
"it" : [
{'t': 'loreum', 'u' : NumberLong(300), 'd': ISODate("2013-10-28T14:30:18.000Z")},
{'t': 'loreum', 'u' : NumberLong(400), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
"li" : {
"ct" : 2,
"it" : [
{'u' : NumberLong(500), 'd': ISODate("2013-10-30T14:30:18.000Z")},
{'u' : NumberLong(501), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
}
Option #1:
I'm in the process of analyzing this data and I need to query for documents that were "updated" between a period.
i.e., I want to return all the objects that have been p (published) or an li.it (item) or co.it (item) added between '2014-07-01' and '2014-07-03'.
What would be the most performant way of doing this?
Option #2:
Another option that I'm evaluating is whether I want to add an 'u' property with an updated date to account for when the document was updated
(ie., li or co item added)
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
Would filtering on 'u' be more performant that Option 1? I'm looking at this option as using COPY FROM JSON from a mongoexport
Option #1 (multiple dates)
There isn't a good option to index this, as it looks like you would ideally want a compound index that includes p (date) plus two date arrays (lt.it and co.it). A compound index can only include at most one array field. Even if you could do this, the index would be very large given the suggested number of dates and the query would involve checking multiple fields to infer the last updated date.
Option #2 (single updated date)
Adding an indexed u (latest updated date) is definitely a better approach to allow a simple and performant query.
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
You can use the $exists operator to find documents that do not have this field set yet.
Caveat on hashed shard key
To elaborate on Neil's comment: a hashed shard key gives you good write distribution at the expense of being able to do range queries (all queries become scatter-gather). If your common queries are range-based on date (and you are concerned about performance) then you could possibly chose a more appropriate shard key to support those queries. However, since shard keys are immutable and you want to query on an "updated" date, it doesn't sound like a change of shard key will help your use case.
I have a (hopefully quick) question about MongoDB queries on compound indexes.
Say I have a data set (for example, comments) which I want to sort descending by score, and then date:
{ "score" : 10, "date" : ISODate("2014-02-24T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-18T00:00:00.000Z"), ...}
{ "score" : 10, "date" : ISODate("2014-02-12T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-22T00:00:00.000Z"), ...}
{ "score" : 9, "date" : ISODate("2014-02-16T00:00:00.000Z"), ...}
...
My understanding thus far is that I can make a compound index to support this query, which looks like {"score":-1,"date":-1}. (For clarity's sake, I am not using a date in the index, but an ObjectID for unique, roughly time-based order)
Now, say I want to support paging through the comments. The first page is easy enough, I can just stick a .limit(n) option on the end of the cursor. What I'm struggling with is continuing the search.
I have been referring to MongoDB: The Definitive Guide by Kristina Chodorow. In this book, Kristina mentions that using skip() on large datasets is not very performant, and recommends using range queries on parameters from the last seen result (eg. the last seen date).
What I would like to do is perform a range query that acts on two fields, but treats the second field as secondary to the first (just like the index is sorted.) Since my compound index is already sorted in exactly the order I want, it seems like there should be some way to jump into the search by pointing at a specific element in the index and traversing it in the sort order. However, from my (admittedly rudimentary) understanding of queries in MongoDB this doesn't seem possible.
As far as I can see, I have three options:
Using skip() anyway
Using either an $or query or two distinct queries: {$or : [{"score" : lastScore, "date" : { $lt : lastDate}}, {'score' : {$lt : lastScore}]}
Using the $max special query option
Number 3 seems like the closest to ideal for me, but the reference text notes that 'you should generally use "$lt" instead of "$max"'.
To summarize, I have a few questions:
Is there some way to perform the operation I described, that I may have missed? (Jumping into an index and traversing it in the sort order)
If not, of the three options I described (or any I have overlooked), which would (very generally speaking) give the most consistent performance under the compound index?
Why is $lt preferred over $max in most cases?
Thanks in advance for your help!
Another option is to store score and date in a sub-document and then index the sub-document. For example:
{
"a" : { "score" : 9,
"date" : ISODate("2014-02-22T00:00:00Z") },
...
}
db.foo.ensureIndex( { a : 1 } )
db.foo.find( { a : { $lt : { score : lastScore,
date: lastDate } } } ).sort( { a : -1 } )
With this approach you need to ensure that the fields in the BSON sub-document are always stored in the same order, otherwise the query won't match what you expect since index key comparison is binary comparison of the entire BSON sub-document.
I would go with using $max to specify the upper bound, in conjunction with $hint to make sure that the database uses the index you want. The reason that $lt is in general preferred over $max is because $max selects the index using the specified index bounds. This means:
the index chosen may not necessarily be the best choice.
if multiple indexes exist on same fields with different sort orders, the selection of the index may be ambiguous.
The above points are covered in further detail here.
One last point: max is equivalent to $lte, not $lt, so using this approach for pagination you'll need to skip over the first returned document to avoid outputting the same document twice.