I have following peace of java spring mongoDB code:
startTime = System.currentTimeMillis();
AggregationResults<MyClass> list = mongoTemplate.aggregate(Aggregation.newAggregation(operations),
"Post", MyClass.class);
System.out.println("Time taken for query execution -> "
+ (System.currentTimeMillis() - startTime));
when i am testing this code using jmeter, first execution shows:
Time taken for query execution -> 3275 ('list' has 16 records)
On 2nd and henceforth request its liks
Time taken for query execution -> 355 ('list' has 16 records)
Time difference is huge. How can I improve it in first call ?
When I do Aggregation.newAggregation(operations).toString() I am getting following query output. Running the folliwng aggregation query on shell command always take around .350sec.
{
"aggregate": "__collection__",
"pipeline": [
{
"$match": {
"$and": [
{
"postType": "AUTOMATIC"
}
]
}
},
{
"$project": {
"orders.id": 1,
"postedTotals": 1
}
},
{
"$unwind": "$orders"
},
{
"$group": {
"_id": "$orders.userId",
"ae": {
"$addToSet": "$orders.userId"
}
}
},
{
"$sort": {
"ae": 1
}
}
]
}
.explain().aggregate( shows following:
/* 1 */
{
"stages" : [
{
"$cursor" : {
"query" : {
"$and" : [
{
"postType" : "AUTOMATIC"
}
]
},
"fields" : {
"headerPostedTotals" : 1,
"orders.UserId" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "post",
"indexFilterSet" : false,
"parsedQuery" : {
"postType" : {
"$eq" : "AUTOMATIC"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"postType" : 1,
"orders.UserId" : 1,
"orders.flightStartDateForQuery" : 1,
"orders.flightEndDateForQuery" : 1,
"postRunDate" : -1
},
"indexName" : "default_filter_index",
"isMultiKey" : true,
"multiKeyPaths" : {
"postType" : [],
"orders.UserId" : [
"orders"
],
"orders.flightStartDateForQuery" : [
"orders"
],
"orders.flightEndDateForQuery" : [
"orders"
],
"postRunDate" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"postType" : [
"[\"AUTOMATIC\", \"AUTOMATIC\"]"
],
"orders.UserId" : [
"[MinKey, MaxKey]"
],
"orders.flightStartDateForQuery" : [
"[MinKey, MaxKey]"
],
"orders.flightEndDateForQuery" : [
"[MinKey, MaxKey]"
],
"postRunDate" : [
"[MaxKey, MinKey]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$project" : {
"_id" : true,
"headerPostedTotals" : true,
"orders" : {
"UserId" : true
}
}
},
{
"$unwind" : {
"path" : "$orders"
}
},
{
"$group" : {
"_id" : "$orders.UserId",
"aes" : {
"$addToSet" : "$orders.UserId"
}
}
},
{
"$sort" : {
"sortKey" : {
"aes" : 1
}
}
}
],
"ok" : 1.0
}
Related
{"messageId": "123124", "writtenAt":"2017-04-26T15:16:36.200Z", "updatedAt":"2999-12-31T23:59:59.999Z"}
{"messageId": "123124", "writtenAt":"2017-04-26T15:21:30.230Z", "updatedAt":"2999-12-31T23:59:59.999Z"}
The structure of the collection is above. Aside from the mongo id, it has an id called 'messageId', in a collection we can have multiple entries with same 'messageId' but has different 'writtenAt' field value. Have a compound index: messageId (desc), writtenAt (desc).
Now, wanting to do a group by on messageId so I would only get the the latest one (max writtenAt value). I have the following query but it's taking very long I haven't even gotten a result yet (more than 10 mins then I stop, collection has over 1.3 million records):
db.messages.aggregate(
[{ "$match": { "updatedAt": { "$gte": { "$date": "2021-02-26T06:59:51.738Z" } } } },
{ "$sort": { "messageId": -1, "writtenAt": -1 } },
{ "$group": { "_id": "$messageId", "doc": { "$first": "$$ROOT" } } },
{ "$replaceRoot" : { "newRoot" : "$doc"}}
], {allowDiskUse: true});
If I add an explain with executionStats, I can see it's picking up the index:
[
{
"$cursor" : {
"query" : {
"updatedAt" : {
"$gte" : ISODate("2021-02-26T06:59:51.738+0000")
}
},
"sort" : {
"messageId" : -1.0,
"writtenAt" : -1.0
},
"queryPlanner" : {
"plannerVersion" : 1.0,
"namespace" : "db.messages",
"indexFilterSet" : false,
"parsedQuery" : {
"updatedAt" : {
"$gte" : ISODate("2021-02-26T06:59:51.738+0000")
}
},
"queryHash" : "3141BBC5",
"planCacheKey" : "6858F892",
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"updatedAt" : {
"$gte" : ISODate("2021-02-26T06:59:51.738+0000")
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"messageId" : -1.0,
"writtenAt" : -1.0
},
"indexName" : "idx_messageId",
"isMultiKey" : false,
"multiKeyPaths" : {
"messageId" : [
],
"writtenAt" : [
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2.0,
"direction" : "forward",
"indexBounds" : {
"messageId" : [
"[MaxKey, MinKey]"
],
"writtenAt" : [
"[MaxKey, MinKey]"
]
}
}
},
"rejectedPlans" : [
]
}
}
},
{
"$group" : {
"_id" : "$messageId",
"doc" : {
"$first" : "$$ROOT"
}
}
},
{
"$replaceRoot" : {
"newRoot" : "$doc"
}
}
]
Any idea how can I can improve? After retrieving the latest messages by messageId then planning to do some pagination slicing after.
***Removed
I am new to mongo and below query performs really slow with record set over 2 Million records
Query
db.testCollection.aggregate({
$match: {
active: {
$ne: false
}
}
}, {
$group: {
_id: {
productName: "$productName",
model: "$model",
version: "$version",
uid: "$uid"
},
total: {
$sum: 1
}
}
}, {
$project: {
total: 1,
model: "$_id.model",
version: "$_id.version",
uid: "$_id.uid",
productName: "$_id.productName"
}
}, {
$sort: {
model: 1
}
})
explain()
{
"stages" : [
{
"$cursor" : {
"query" : {
"active" : {
"$ne" : false
}
},
"fields" : {
"version" : 1,
"productName" : 1,
"model" : 1,
"uid" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "fms2.device",
"indexFilterSet" : false,
"parsedQuery" : {
"$nor" : [
{
"active" : {
"$eq" : false
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"active" : 1
},
"indexName" : "active",
"isMultiKey" : false,
"multiKeyPaths" : {
"active" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"active" : [
"[MinKey, false)",
"(false, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : {
"productName" : "$productName",
"model" : "$model",
"version" : "$version",
"uid" : "$uid"
},
"total" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$project" : {
"_id" : true,
"total" : true,
"model" : "$_id.model",
"version" : "$_id.version",
"uid" : "$_id.uid",
"productName" : "$_id.productName"
}
},
{
"$sort" : {
"sortKey" : {
"model" : 1
}
}
}
],
"ok" : 1
}
Is there a way to optimize this query more ? I had a look into https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/ as well but most of the stated suggestions are not applicable for this query.
Not sure if it matters, result of this aggregation ends up with only 20-30 records.
I have a collection with one 4 key compound index:
> db.event.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
},
{
"v" : 2,
"key" : {
"epochWID" : 1,
"category" : 1,
"mos.types" : 1,
"mos.name" : 1
},
"name" : "epochWID_category_motype_movalue",
}
]
Query is as follows:
> db.event.explain().find({ "epochWID": 1510456188087, "category": 6, "mos.types": 9, "mos.name": "ctx_1" })
{
"queryPlanner" : {
"plannerVersion" : 1,
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"category" : {
"$eq" : 6
}
},
{
"epochWID" : {
"$eq" : 1510456188087
}
},
{
"mos.name" : {
"$eq" : "ctx_1"
}
},
{
"mos.types" : {
"$eq" : 9
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"mos.name" : {
"$eq" : "ctx_1"
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"epochWID" : 1,
"category" : 1,
"mos.types" : 1,
"mos.name" : 1
},
"indexName" : "epochWID_category_motype_movalue",
"isMultiKey" : true,
"multiKeyPaths" : {
"epochWID" : [ ],
"category" : [ ],
"mos.types" : [
"mos",
"mos.types"
],
"mos.name" : [
"mos"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"epochWID" : [
"[1510456188087.0, 1510456188087.0]"
],
"category" : [
"[6.0, 6.0]"
],
"mos.types" : [
"[9.0, 9.0]"
],
"mos.name" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"version" : "3.4.9",
},
"ok" : 1
}
Now if you look at the plan's indexBounds: it uses the first 3 keys but not the 4th mos.name, why?
"indexBounds" : {
"epochWID" : [
"[1510456188087.0, 1510456188087.0]"
],
"category" : [
"[6.0, 6.0]"
],
"mos.types" : [
"[9.0, 9.0]"
],
"mos.name" : [
"[MinKey, MaxKey]"
]
}
Based on https://docs.mongodb.com/manual/core/index-multikey/#compound-multikey-indexes we need to use $elemMatch, so following query uses the full index
> db.event.explain().find({ "epochWID": 1510456188087, "category": 6, "mos": { $elemMatch: {"types": 9, "name": "ctx_1"} } })
{
"queryPlanner" : {
"plannerVersion" : 1,
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"mos" : {
"$elemMatch" : {
"$and" : [
{
"name" : {
"$eq" : "ctx_1"
}
},
{
"types" : {
"$eq" : 9
}
}
]
}
}
},
{
"category" : {
"$eq" : 6
}
},
{
"epochWID" : {
"$eq" : 1510456188087
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"mos" : {
"$elemMatch" : {
"$and" : [
{
"types" : {
"$eq" : 9
}
},
{
"name" : {
"$eq" : "ctx_1"
}
}
]
}
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"epochWID" : 1,
"category" : 1,
"mos.types" : 1,
"mos.name" : 1
},
"indexName" : "epochWID_category_motype_movalue",
"isMultiKey" : true,
"multiKeyPaths" : {
"epochWID" : [ ],
"category" : [ ],
"mos.types" : [
"mos",
"mos.types"
],
"mos.name" : [
"mos"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"epochWID" : [
"[1510456188087.0, 1510456188087.0]"
],
"category" : [
"[6.0, 6.0]"
],
"mos.types" : [
"[9.0, 9.0]"
],
"mos.name" : [
"[\"ctx_1\", \"ctx_1\"]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"version" : "3.4.9",
},
"ok" : 1
}
EDIT: I contacted MongoDb support. Regarding multi key indexes and array fields - tl;dr is - An index is fine as long as only one of the indexed fields ever contains an array value (which is true in my case). Nesting level doesn't matter. The problem is indeed parallel arrays due to need of cartesian product.
A multi-key index cannot be across multiple arrays within a document. See the Limitations and reasoning in the documentation https://docs.mongodb.com/manual/core/index-multikey/#compound-multikey-indexes
I am new to Mongo and was trying to get distinct count of users. The field Id and Status are not individually Indexed columns but there exists a composite index on both the field. My current query is something like this where the match conditions changes depending on the requirements.
DBQuery.shellBatchSize = 1000000;
db.getCollection('username').aggregate([
{$match:
{ Status: "A"
} },
{"$group" : {_id:"$Id", count:{$sum:1}}}
]);
Is there anyway we can optimize this query more or add parallel runs on collection so that we can achieve results faster ?
Regards
You can tune your aggregation pipelines by passing in an option of explain=true in the aggregate method.
db.getCollection('username').aggregate([
{$match: { Status: "A" } },
{"$group" : {_id:"$Id", count:{$sum:1}}}],
{ explain: true });
This will then output the following to work with
{
"stages" : [
{
"$cursor" : {
"query" : {
"Status" : "A"
},
"fields" : {
"Id" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.usernames",
"indexFilterSet" : false,
"parsedQuery" : {
"Status" : {
"$eq" : "A"
}
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$Id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
So to speed up our query we need a index to help the match part of the pipeline, so let's create a index on Status
> db.usernames.createIndex({Status:1})
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
If we now run the explain again we'll get the following results
{
"stages" : [
{
"$cursor" : {
"query" : {
"Status" : "A"
},
"fields" : {
"Id" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.usernames",
"indexFilterSet" : false,
"parsedQuery" : {
"Status" : {
"$eq" : "A"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"Status" : 1
},
"indexName" : "Status_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"Status" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"Status" : [
"[\"A\", \"A\"]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$Id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
We can now see straight away this is using a index.
https://docs.mongodb.com/manual/reference/explain-results/
My Aggregation is pretty slow. I've already made it a little faster (from 3000 ms to 200ms) by using the match statement before the unwind statement. Is there any other way to improve my aggregation? In the end there'll be just one result (the last one based on timestamp). The unwind part is the longest operation if i'm right yet i really do need this.
db.CpuInfo.aggregate([
{"$match":
{
"timestamp": {"$gte":1464764400},
'hostname': 'baklap4'
}
},
{ "$unwind": "$cpuList" },
{ "$group":
{ "_id":
{ "interval":
{ "$subtract": [
"$timestamp",
{ "$mod": [ "$timestamp", 60 * 5 ] }
]}
},
"avgCPULoad": { "$avg": "$cpuList.load" },
"timestamp": { "$max": "$timestamp" }
}
},
{ "$project": { "_id": 0, "avgCPULoad": 1, "timestamp": 1 } },
{$sort: {'timestamp': -1}},
{$limit: 1}
])
The items in my collection are all simular to this:
{
"_id": ObjectId("574d6175da461e77030041b7"),
"hostname": "VPS",
"timestamp": NumberLong(1460040691),
"cpuCores": NumberLong(2),
"cpuList": [
{
"name": "cpu1",
"load": 3.4
},
{
"name": "cpu2",
"load": 0.7
}
]
}
I've added the explain option to my aggregation and this is the result:
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
"timestamp" : {
"$gte" : 1464732000
},
"hostname" : "baklap4"
},
"fields" : {
"cpuList" : 1,
"timestamp" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "prototyping.CpuInfo",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"hostname" : {
"$eq" : "baklap4"
}
},
{
"timestamp" : {
"$gte" : 1464732000
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"hostname" : {
"$eq" : "baklap4"
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"timestamp" : NumberLong(1)
},
"indexName" : "timestamp_1",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"timestamp" : [
"[1464732000.0, inf.0]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$unwind" : {
"path" : "$cpuList"
}
},
{
"$group" : {
"_id" : {
"interval" : {
"$subtract" : [
"$timestamp",
{
"$mod" : [
"$timestamp",
{
"$const" : 300
}
]
}
]
}
},
"avgCPULoad" : {
"$avg" : "$cpuList.load"
},
"timestamp" : {
"$max" : "$timestamp"
}
}
},
{
"$project" : {
"_id" : false,
"timestamp" : true,
"avgCPULoad" : true
}
},
{
"$sort" : {
"sortKey" : {
"timestamp" : -1
},
"limit" : NumberLong(1)
}
}
],
"ok" : 1
}
When i Look up in my table i see that Timestamp and Id are indexed:
db.CpuInfo.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "prototyping.CpuInfo"
},
{
"v" : 1,
"key" : {
"timestamp" : NumberLong(1)
},
"name" : "timestamp_1",
"ns" : "prototyping.CpuInfo",
"sparse" : false
}
]