I'm using MongoDB 4.4.3 to query a random record from a collection :
db.MyCollection.aggregate([{ $sample: { size: 1 } }])
This query takes 20s (when a find query takes 0.2s)
Mongo doc states :
If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:
$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents
Here
$sample is the only stage of the pipeline
N = 1
MyCollection contains 46 millions documents
This problem is similar to MongoDB Aggregation with $sample very slow, which does not provide an answer for Mongo4.4.3
So why is this query so slow ?
Details
Query Planner
db.MyCollection.aggregate([{$sample: {size: 1}}]).explain()
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "DATABASE.MyCollection",
"indexFilterSet" : false,
"winningPlan" : {
"stage" : "MULTI_ITERATOR"
},
"rejectedPlans" : [ ]
}
}
},
{
"$sampleFromRandomCursor" : {
"size" : 1
}
}
],
"serverInfo" : {
"host" : "mongodb4-3",
"port" : 27017,
"version" : "4.4.3",
"gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13"
},
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1611128334, 1),
"signature" : {
"hash" : BinData(0,"ZDxiOTnmG/zLKNtDIAWhxmjHHLM="),
"keyId" : 6915708270745223171
}
},
"operationTime" : Timestamp(1611128334, 1)
}
Execution stats
Related
I have a collection with millions of documents, each document represent an event: {_id, product, timestamp}
In my query, I need to group by product and take the top 10 for example.
"aggregate" : "product_events",
"pipeline" : [
{
"$match" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
}
},
{
"$group" : {
"_id" : "$product",
"count" : {
"$sum" : 1
}
}
},
{
"$sort" : {
"count" : -1
}
},
{
"$limit" : 10
}
]
My query is very slow now (10 seconds), I am wondering if there is a way to store data differently to optimise this query?
db.product_events.explain("executionStats").aggregate([ {"$match" :
{"timeEvent" : {"$gt" : ISODate("2017-07-17T00:00:00Z")}}},{"$group" :
{"_id" : "$product","count" : {"$sum" : 1}}}, {"$project": {"_id": 1,
"count": 1}} , {"$sort" : {"count" : -1}},{"$limit" : 500}],
{"allowDiskUse": true})
{
"stages" : [
{
"$cursor" : {
"query" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"fields" : {
"product" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydb.product_events",
"indexFilterSet" : false,
"parsedQuery" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 2127315,
"executionTimeMillis" : 940,
"totalKeysExamined" : 0,
"totalDocsExamined" : 2127315,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"nReturned" : 2127315,
"executionTimeMillisEstimate" : 810,
"works" : 2127317,
"advanced" : 2127315,
"needTime" : 1,
"needYield" : 0,
"saveState" : 16620,
"restoreState" : 16620,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 2127315
}
}
}
},
{
"$group" : {
"_id" : "$product",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$project" : {
"_id" : true,
"count" : true
}
},
{
"$sort" : {
"sortKey" : {
"count" : -1
},
"limit" : NumberLong(500)
}
}
],
"ok" : 1
}
Below my indexes
db.product_events.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.product_events"
},
{
"v" : 2,
"key" : {
"product" : 1,
"timeEvent" : -1
},
"name" : "product_1_timeEvent_-1",
"ns" : "mydb.product_events"
}
]
Creating indexes on fields of a collection aids into optimising process of data retrieval from database collections.
Indexes are generally created on fields into which data are filtered according to specific criteria.
Data contained into indexed fields are sorted in specific order and while fetching data once match is found ,scanning of other document stops which makes process of fetching data faster.
According to description as mentioned into above question to optimise performance of aggregate query please try creating an index on timeEvent field as timeEvent field is used as a filter expression into $match stage of aggregation pipeline.
The documentation on compound indexes states the following.
db.products.createIndex( { "item": 1, "stock": 1 } )
The order of the fields listed in a compound index is important. The
index will contain references to documents sorted first by the values
of the item field and, within each value of the item field, sorted by
values of the stock field.
In addition to supporting queries that match on all the index fields,
compound indexes can support queries that match on the prefix of the
index fields. That is, the index supports queries on the item field as
well as both item and stock fields.
Your product_1_timeEvent_-1 index looks like this:
{
"product" : 1,
"timeEvent" : -1
}
which is why it cannot be used to support a query that only filters on timeEvent.
Options you have to get that sorted:
Flip the order of the fields in your index
Remove the product field from your index
Create an additional index with only the timeEvent field in it.
(Include some additional filter on the product field so the existing index gets used)
And keep in mind that any creation/deletion/modification of an index may impact other queries, too. So make sure you test your changes properly.
This is What I tried so far on aggregated query:
db.getCollection('storage').aggregate([
{
"$match": {
"user_id": 2
}
},
{
"$project": {
"formattedDate": {
"$dateToString": { "format": "%Y-%m", "date": "$created_on" }
},
"size": "$size"
}
},
{ "$group": {
"_id" : "$formattedDate",
"size" : { "$sum": "$size" }
} }
])
This is the result:
/* 1 */
{
"_id" : "2018-02",
"size" : NumberLong(10860595386)
}
/* 2 */
{
"_id" : "2017-12",
"size" : NumberLong(524288)
}
/* 3 */
{
"_id" : "2018-01",
"size" : NumberLong(21587971)
}
And this is the document structure:
{
"_id" : ObjectId("5a59efedd006b9036159e708"),
"user_id" : NumberLong(2),
"is_transferred" : false,
"is_active" : false,
"process_id" : NumberLong(0),
"ratio" : 0.000125759169459343,
"type_id" : 201,
"size" : NumberLong(1687911),
"is_processed" : false,
"created_on" : ISODate("2018-01-13T11:39:25.000Z"),
"processed_on" : ISODate("1970-01-01T00:00:00.000Z")
}
And last, the explain result:
/* 1 */
{
"stages" : [
{
"$cursor" : {
"query" : {
"user_id" : 2.0
},
"fields" : {
"created_on" : 1,
"size" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.storage",
"indexFilterSet" : false,
"parsedQuery" : {
"user_id" : {
"$eq" : 2.0
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"user_id" : 1
},
"indexName" : "user_id",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[2.0, 2.0]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$project" : {
"_id" : true,
"formattedDate" : {
"$dateToString" : {
"format" : "%Y-%m",
"date" : "$created_on"
}
},
"size" : "$size"
}
},
{
"$group" : {
"_id" : "$formattedDate",
"size" : {
"$sum" : "$size"
}
}
}
],
"ok" : 1.0
}
The problem:
I can navigate and get all results in almost instantly like in 0,002sec. However, when I specify user_id and sum them by grouping on each month, My result came in between 0,300s to 0,560s. I do similar tasks in one request and it becaomes more than a second to finish.
What I tried so far:
I've added an index for user_id
I've added an index for created_on
I used more $match conditions. However, This makes even worse.
This collection have almost 200,000 documents in it currently and approximately 150,000 of them are belongs to user_id = 2
How can I minimize the response time for this query?
Note: MongoDB 3.4.10 used.
Pratha,
try to add sort on "created_on" and "size" fields as the first stage in aggregation pipeline.
db.getCollection('storage').aggregate([
{
"$sort": {
"created_on": 1, "size": 1
}
}, ....
Before that, add compound key index:
db.getCollection('storage').createIndex({created_on:1,size:1})
If you sort data before the $group stage, it will improve the efficiency of accumulation of the totals.
Note about sort aggregation stage:
The $sort stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $sort will produce an error. To allow for the handling of large datasets, set the allowDiskUse option to true to enable $sort operations to write to temporary files.
P.S
get rid of match stage by userID to test performance, or add userID to compound key also.
I am using a (small, 256 MB) MongoDB 3.2.9 service instance through Swisscom CloudFoundry. As long as our entire DB fits into the available RAM, we see somewhat acceptable query performance.
However, we are experiencing very long query times on aggregation operations when our DB does not fit into RAM. We have created indexes for the accessed fields, but as far as I can tell it doesn't help.
Example document entry:
_id: 5a31...
description: Object
location: "XYZ"
name: "ABC"
status: "A"
m_nr: null
k_nr: null
city: "QWE"
high_value: 17
right_value: 71
more_data: Object
number: 101
interval: 1
next_date: "2016-01-16T00:00:00Z"
last_date: null
status: null
classification: Object
priority_value: "?"
redundancy_value: "?"
active_value: "0"
Example Query:
db.getCollection('a').aggregate(
[{ $sort:
{"description.location": 1}
},
{ $group:
{_id: "$description.location"}
}],
{ explain: true }
)
This query takes 25sec on a DB that only has 20k entries and produces 1k output fields.
The explain info for this query:
db.getCollection('a').aggregate([{ $group: {_id: "$description.location"} }], { explain: true }):
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {},
"fields" : {
"description.location" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "Z.a",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : []
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : []
},
"direction" : "forward"
},
"rejectedPlans" : []
}
}
},
{
"$group" : {
"_id" : "$description.location"
}
}
],
"ok" : 1.0
}
[UPDATE] Output of db.a.getIndexes():
/* 1 */
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "db.a"
},
{
"v" : 1,
"key" : {
"description.location" : 1.0
},
"name" : "description.location_1",
"ns" : "db.a"
}
]
Looks like it's doing a collection scan, have you tried adding an index on description.location?
db.a.createIndex({"description.location" : 1});
I am using spring data mongodb, in that want aggregation query to implement that I am using MongoTemplate with aggregation method. When I trace the log it shows the query as follows:
find: track.$cmd { "aggregate" : "stayRecord" , "pipeline" : [ { "$match" : { "vehicleId" : { "$all" : [ 10]}}} , { "$match" : { "stayTime" : { "$gte" : { "$date" : "2016-06-20T18:30:00.000Z"}}}} , { "$match" : { "stayTime" : { "$lt" : { "$date" : "2016-06-21T18:30:00.000Z"}}}} , { "$group" : { "_id" : "$stayTime" , "count" : { "$sum" : 1}}}}
I want to know execution plan for this query.
How can I find out if my indexes are used during that query?
Please note that in order to follow the below steps, you need the working aggregate query which mongo shell can understand.
Follow the below steps:-
1) Go to mongo shell
2) Execute the use command to switch to your database
use <database name>
3) Execute the below query. I hope the aggregate query mentioned in the thread is syntactically correct. Also, please change the collection name accordingly in the below syntax.
db.yourCollectionName.explain().aggregate({ "stayRecord" , "pipeline" : [ { "$match" : { "vehicleId" : { "$all" : [ 10]}}} , { "$match" : { "stayTime" : { "$gte" : { "$date" : "2016-06-20T18:30:00.000Z"}}}} , { "$match" : { "stayTime" : { "$lt" : { "$date" : "2016-06-21T18:30:00.000Z"}}}} , { "$group" : { "_id" : "$stayTime" , "count" : { "$sum" : 1}}}});
2) In the output, please find the "winningPlan" element. In the input stage ("inputStage") attribute, if the query used index it will show you the value as "IXSCAN" and the index name if the query used index. Otherwise, it would show "COLLSCAN" which means the query used the collection scan (i.e. index is not used).
"winningPlan" : {
"stage" : "LIMIT",
"limitAmount" : 0,
"inputStage" : {
"stage" : "SKIP",
"skipAmount" : 0,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"user.followers_count" : {
"$gt" : 1000
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"created_at" : -1
},
"indexName" : "created_at_-1",
"isMultiKey" : false,
"direction" : "backward",
"indexBounds" : {
"created_at" : [
"[MinKey, MaxKey]"
]
}
}
}
}
}
I would like to know if mongodb should re-order data after insert data according with the indexes configured previously, for instance:
After insert data according with sequence bellow:
db.test.insert(_id:1)
db.test.insert(_id:5)
db.test.insert(_id:2)
And executing the following search:
db.test.find();
We can see the result:
{ "_id" : 1 }
{ "_id" : 5 }
{ "_id" : 3 }
As we know, the field _id by default has a index, the question here is why after executing search the results are not return in sequence as presented bellow?
{ "_id" : 1 }
{ "_id" : 3 }
{ "_id" : 5 }
MongoDB indexes are separate data structures that contain the portion of the collection being indexed. The index does not dictate the order of documents in storage during insert.
Indexes are only used for sorting when a sort order is specified in the search:
db.test.find().sort({ _id: 1 })
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 5 }
You can verify the index usage using explain:
db.test.explain().find().sort({_id:1})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "so.test",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_id" : 1
},
"indexName" : "_id_",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"_id" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "KSA303096TX2",
"port" : 27017,
"version" : "3.0.11",
"gitVersion" : "48f8b49dc30cc2485c6c1f3db31b723258fcbf39 modules: enterprise"
},
"ok" : 1
}