Please lemme know why the performance is so poor in aggregation query in my examples Query 1, count of records in my audio_details collection is around 3M+ and sample records are like :
{
_id : xxx,
status : 'BUY', /* This is indexed field */
active : 't', /* This is indexed field */
created_date2 : "2022-09-23T09:00:00.000Z", /* This is indexed field */
audio_details : [
{id : 123 /* This is indexed field */ , created_date : "2022-XXX", /* Other fields goes here */},
{id : 124 /* This is indexed field */ , created_date : "2022-XXX", /* Other fields goes here */},
...
],
/* Other 60 fields goes here */
}
Query 1: This is very slow (300 s)
db.audio_details.aggregate([{$match : { status : 'BUY',active: 't','audio_history.id' : {$in: [123]}}}, {$sort : {created_date2 : -1}}]);
Query 2: This is very fast (0.5 s)
db.audio_details.find({ status : 'BUY',active: 't','audio_history.id' : {$in: [123]}
}).sort({created_date2 : -1})
Please share why the query 1 is slow
Regards
Kris
This appears to be a duplicate of why would identical mongo query take much longer via aggregation than via find? We can therefore make the following observations:
The issue linked from that answer sees to now be fixed. So upgrading to version 4.4+ may resolve the issue.
The sample operation that you've shown can be handled using just find() (with sort()). But in the comments you mention that "want to use $sort in our application". Is there some specific requirement to use the aggregation framework for these particular operations? It seems that you've demonstrated that there is no issue when using the equivalent .find().
In either case, you mention "indexed fields" in your question, but don't actually describe what the index definitions are. If these are single field indexes, then you may want to think about how you can restructure them as compound indexes.
Keep in mind that databases, MongoDB included, are usually most effective at using a single index per data source (collection in this situation) per operation. The only compelling reasons to have a single field index on {created_date2: 1} would be if it is a TTL index or if you are issuing queries where created_date2 is the only or most selective predicate. You should consider dropping such an index (and incorporating that field in a compound index per the third point above) if none of these conditions apply in your situation.
Related
I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?
In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}
There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?
Say you're querying documents based on 2 data points. One is a simple bool parameter, and the other is a complicated $geoWithin calculation.
db.collection.find( {"geoField": { "$geoWithin" : ...}, "boolField" : true} )
Will mongo reorder these parameters, so that it checks the boolField 1st, before running the complicated check?
MongoDB uses indexes like any other DBs. So the important thing for mongoDB is if any query fields has an index or not, not the order of query fields. At least there is no information in their documentation that mongoDB try to checks primitive query fields first. So for your example if boolField has an index mongoDB first check this field and eliminate documents whose boolField is false. But If geoField has an index then mongoDB first execute query on this field.
So what happens if none of them have index or both of them have? It should be the given order of fields in query because there is no suggestion or info beside of indexes in query optimization page of mongoDB. Additionally you can always test your queries performances with just adding .explain("executionStats").
So check the performance of db.collection.find( {"geoField": { "$geoWithin" : ...}, "boolField" : true} ) and db.collection.find( { "boolField" : true, "geoField": { "$geoWithin" : ...} } ). And let us know :)
To add to above response, if you want mongo to use specific index you can use cursor.hint . This https://docs.mongodb.com/manual/core/query-plans/ explains how default index selection is done.
I'm sure this is an easy one, but I just wanted to make sure. Is find() with some search and projection criterion same as applying a sort({$natural:1}) on it?
Also, what is the default natural sort order? How is it different from a sort({_id:1}), say?
db.collection.find() has the result as same as db.collection.find().sort({$natural:1})
{"$natural" : 1} forces the find query to do a table scan (default sort), it specifies hard-disk order when used in a sort.
When you are updating your document, mongo could move your document to another place of hard-disk.
for example insert documents as below
{
_id : 0,
},
{
_id : 1,
}
then update:
db.collection.update({ _id : 0} , { $set : { blob : BIG DATA}})
And when you perform the find query you will get
{
"_id" : 1
},
{
"_id" : 0,
"blob" : BIG DATA
}
as you see the order of documents has changed => the default order is not by _id
If you don't specify the sort then mongodb find() will return documents in the order they are stored on disk. Document storage on disk may coincide with insertion order but thats not always going to be true. It is also worth noting that the location of a document on disk may change. For instance in case of update, mongodb may move a document from one place to another if needed.
In case of index - The default order will be the order in which indexes are found if the query uses an index.
The $natural is the order in which documents are found on disk.
It is recommended that you specifiy sort explicitly to be sure of sorting order.
I am writing a method that updates a single document in a very large MongoCollection,
and I have an index that I want the MongoCollection.Update() call to use to drastically reduce lookup time, but I can't seem to find anything like MongoCursor.SetHint(string indexName).
Is using an index on an update operation possible? If so, how?
You can create index according to your query section of update command.
For example if you have this collection, named data:
> db.data.find()
{ "_id" : ObjectId("5334908bd7f87918dae92eaf"), "name" : "omid" }
{ "_id" : ObjectId("5334943fd7f87918dae92eb0"), "name" : "ali" }
{ "_id" : ObjectId("53349478d7f87918dae92eb1"), "name" : "reza" }
and if you do this update query:
> db.data.update(query={name:'ali'}, update={name: 'Ali'})
without any defined index, the number of scanned document is 2:
"nscanned" : 2,
But if you define an index, according to your query, here for name field:
db.data.ensureIndex({name:1})
Now if you update it again:
> db.data.update(query={name:'Ali'}, update={name: 'ALI'})
Mongodb use your index for doing update, and number of scanned document is 1:
"nscanned" : 1,
But if you want to hint for update, you can hint it for your query:
# Assume that the index and field of it exists.
> var cursor = db.data.find({name:'ALI'}).hint({family:1})
Then use it in your update query:
> db.data.update(query=cursor, update={name: 'ALI'})
If you already have indexed your collection, update will be using the CORRECT index right away. There is no point to provide hint (in fact you can't hint with update).
Hint is only for debugging and testing purposes. Mongo is in most cases smart enough to automatically decide which index (if you have many of them) should be used in a particular query and it reviews its strategy from time to time.
So short answer - do nothing. If you have an index and it is useful, it will be automatically used on find, update, delete, findOne.
If you want to see if it is used - take the part of the query which searches for something and run it through find with explain.
Example for hellboy. This is just an example and in real life it can be more complex.
So you have a collection with docs like this {a : int, b : timestamp}. You have 2 indexes: one is on a, another is on b. So right now you need to do a query like a > 5 and b is after 2014. For some reason it uses index a, which does not give you the faster time (may be because you have 1000 elements and most of them are bigger than 5 and only 10 are > 2004 ). SO you decided to hint it to use b index. Cool it works much faster now. But your collection changes and right now you are in 2020 year and most of your documents have b bigger than 2014. So right now your index b is not doing so much work. But mongo still uses it, because you told so.
This question concern the internal method to manage indexes and serching Bson Documents.
When you create a multiple indexes like "index1", "index2", "index3"...the index are stored to be used during queries, but what about the order of queries and the performance resulting.
sample
index1,index2,index3----> query in the same order index1,index2,index3 (best case)
index1,index2,index3----> query in another order index2,index1,index3 (the order altered)
Many times you use nested queries including these 3 index and others items or more indexes. The order of the queries would implicate some time lost?. Must passing the queries respecting the indexes order defined or the internal architecture take care about this order search? I searching to know if i do take care about this or can make my queries in freedom manier.
Thanks.
The order of the conditions in your query does not affect whether it can use an index or no.
e.g.
typical document structure:
{
"FieldA" : "A",
"FieldB" : "B"
}
If you have an compound index on A and B :
db.MyCollection.ensureIndex({FieldA : 1, FieldB : 1})
Then both of the following queries will be able to use that index:
db.MyCollection.find({FieldA : "A", FieldB : "B"})
db.MyCollection.find({FieldB : "B", FieldA : "A"})
So the ordering of the conditions in the query do not prevent the index being used - which I think is the question you are asking.
You can easily test this out by trying the 2 queries in the shell and adding .explain() after the find. I just did this to confirm, and they both showed that the compound index was used.
however, if you run the following query, this will NOT use the index as FieldA is not being queried on:
db.MyCollection.find({FieldB : "B"})
So it's the ordering of the fields in the index that defines whether it can be used by a query and not the ordering of the fields in the query itself (this was what Lucas was referring to).
From http://www.mongodb.org/display/DOCS/Indexes:
If you have a compound index on
multiple fields, you can use it to
query on the beginning subset of
fields. So if you have an index on
a,b,c
you can use it query on
a
a,b
a,b,c
So yes, order matters. You should clarify your question a bit if you need a more precise answer.