I have a collection with millions of documents, each document represent an event: {_id, product, timestamp}
In my query, I need to group by product and take the top 10 for example.
"aggregate" : "product_events",
"pipeline" : [
{
"$match" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
}
},
{
"$group" : {
"_id" : "$product",
"count" : {
"$sum" : 1
}
}
},
{
"$sort" : {
"count" : -1
}
},
{
"$limit" : 10
}
]
My query is very slow now (10 seconds), I am wondering if there is a way to store data differently to optimise this query?
db.product_events.explain("executionStats").aggregate([ {"$match" :
{"timeEvent" : {"$gt" : ISODate("2017-07-17T00:00:00Z")}}},{"$group" :
{"_id" : "$product","count" : {"$sum" : 1}}}, {"$project": {"_id": 1,
"count": 1}} , {"$sort" : {"count" : -1}},{"$limit" : 500}],
{"allowDiskUse": true})
{
"stages" : [
{
"$cursor" : {
"query" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"fields" : {
"product" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydb.product_events",
"indexFilterSet" : false,
"parsedQuery" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 2127315,
"executionTimeMillis" : 940,
"totalKeysExamined" : 0,
"totalDocsExamined" : 2127315,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"nReturned" : 2127315,
"executionTimeMillisEstimate" : 810,
"works" : 2127317,
"advanced" : 2127315,
"needTime" : 1,
"needYield" : 0,
"saveState" : 16620,
"restoreState" : 16620,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 2127315
}
}
}
},
{
"$group" : {
"_id" : "$product",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$project" : {
"_id" : true,
"count" : true
}
},
{
"$sort" : {
"sortKey" : {
"count" : -1
},
"limit" : NumberLong(500)
}
}
],
"ok" : 1
}
Below my indexes
db.product_events.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.product_events"
},
{
"v" : 2,
"key" : {
"product" : 1,
"timeEvent" : -1
},
"name" : "product_1_timeEvent_-1",
"ns" : "mydb.product_events"
}
]
Creating indexes on fields of a collection aids into optimising process of data retrieval from database collections.
Indexes are generally created on fields into which data are filtered according to specific criteria.
Data contained into indexed fields are sorted in specific order and while fetching data once match is found ,scanning of other document stops which makes process of fetching data faster.
According to description as mentioned into above question to optimise performance of aggregate query please try creating an index on timeEvent field as timeEvent field is used as a filter expression into $match stage of aggregation pipeline.
The documentation on compound indexes states the following.
db.products.createIndex( { "item": 1, "stock": 1 } )
The order of the fields listed in a compound index is important. The
index will contain references to documents sorted first by the values
of the item field and, within each value of the item field, sorted by
values of the stock field.
In addition to supporting queries that match on all the index fields,
compound indexes can support queries that match on the prefix of the
index fields. That is, the index supports queries on the item field as
well as both item and stock fields.
Your product_1_timeEvent_-1 index looks like this:
{
"product" : 1,
"timeEvent" : -1
}
which is why it cannot be used to support a query that only filters on timeEvent.
Options you have to get that sorted:
Flip the order of the fields in your index
Remove the product field from your index
Create an additional index with only the timeEvent field in it.
(Include some additional filter on the product field so the existing index gets used)
And keep in mind that any creation/deletion/modification of an index may impact other queries, too. So make sure you test your changes properly.
Related
This is What I tried so far on aggregated query:
db.getCollection('storage').aggregate([
{
"$match": {
"user_id": 2
}
},
{
"$project": {
"formattedDate": {
"$dateToString": { "format": "%Y-%m", "date": "$created_on" }
},
"size": "$size"
}
},
{ "$group": {
"_id" : "$formattedDate",
"size" : { "$sum": "$size" }
} }
])
This is the result:
/* 1 */
{
"_id" : "2018-02",
"size" : NumberLong(10860595386)
}
/* 2 */
{
"_id" : "2017-12",
"size" : NumberLong(524288)
}
/* 3 */
{
"_id" : "2018-01",
"size" : NumberLong(21587971)
}
And this is the document structure:
{
"_id" : ObjectId("5a59efedd006b9036159e708"),
"user_id" : NumberLong(2),
"is_transferred" : false,
"is_active" : false,
"process_id" : NumberLong(0),
"ratio" : 0.000125759169459343,
"type_id" : 201,
"size" : NumberLong(1687911),
"is_processed" : false,
"created_on" : ISODate("2018-01-13T11:39:25.000Z"),
"processed_on" : ISODate("1970-01-01T00:00:00.000Z")
}
And last, the explain result:
/* 1 */
{
"stages" : [
{
"$cursor" : {
"query" : {
"user_id" : 2.0
},
"fields" : {
"created_on" : 1,
"size" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.storage",
"indexFilterSet" : false,
"parsedQuery" : {
"user_id" : {
"$eq" : 2.0
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"user_id" : 1
},
"indexName" : "user_id",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[2.0, 2.0]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$project" : {
"_id" : true,
"formattedDate" : {
"$dateToString" : {
"format" : "%Y-%m",
"date" : "$created_on"
}
},
"size" : "$size"
}
},
{
"$group" : {
"_id" : "$formattedDate",
"size" : {
"$sum" : "$size"
}
}
}
],
"ok" : 1.0
}
The problem:
I can navigate and get all results in almost instantly like in 0,002sec. However, when I specify user_id and sum them by grouping on each month, My result came in between 0,300s to 0,560s. I do similar tasks in one request and it becaomes more than a second to finish.
What I tried so far:
I've added an index for user_id
I've added an index for created_on
I used more $match conditions. However, This makes even worse.
This collection have almost 200,000 documents in it currently and approximately 150,000 of them are belongs to user_id = 2
How can I minimize the response time for this query?
Note: MongoDB 3.4.10 used.
Pratha,
try to add sort on "created_on" and "size" fields as the first stage in aggregation pipeline.
db.getCollection('storage').aggregate([
{
"$sort": {
"created_on": 1, "size": 1
}
}, ....
Before that, add compound key index:
db.getCollection('storage').createIndex({created_on:1,size:1})
If you sort data before the $group stage, it will improve the efficiency of accumulation of the totals.
Note about sort aggregation stage:
The $sort stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $sort will produce an error. To allow for the handling of large datasets, set the allowDiskUse option to true to enable $sort operations to write to temporary files.
P.S
get rid of match stage by userID to test performance, or add userID to compound key also.
I have the following table :
> db.foo.find()
{ "_id" : 1, "k" : [ { "a" : 50, "b" : 10 } ] }
{ "_id" : 2, "k" : [ { "a" : 90, "b" : 80 } ] }
With an compound index on k field :
"key" : {
"k.a" : 1,
"k.b" : 1
},
"name" : "k.a_1_k.b_1"
If I run the following query :
db.foo.aggregate([
{ $match: { "k.a" : 50 } },
{ $project: { _id : 0, "dummy": {$literal:""} }}
])
The index if used (make sense) and there is no need of FETCH stage :
"winningPlan" : {
"stage" : "COUNT_SCAN",
"keyPattern" : {
"k.a" : 1,
"k.b" : 1
},
"indexName" : "k.a_1_k.b_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"k.a" : [ ],
"k.b" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"indexBounds" : {
"startKey" : {
"k.a" : 50,
"k.b" : { "$minKey" : 1 }
},
"startKeyInclusive" : true,
"endKey" : {
"k.a" : 50,
"k.b" : { "$maxKey" : 1 }
},
"endKeyInclusive" : true
}
}
However, If I run the following query that use $elemMatch :
db.foo.aggregate([
{ $match: { k: {$elemMatch: {a : 50, b : { $in : [5, 6, 10]}}}} },
{ $project: { _id : 0, "dummy" : {$literal : ""}} }
])
There is a FETCH stage (which AFAIK is not necessary) :
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"k" : {
"$elemMatch" : {
"$and" : [
{ "a" : { "$eq" : 50 } },
{ "b" : { "$in" : [ 5, 6, 10 ] } }
]
}
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"k.a" : 1,
"k.b" : 1
},
"indexName" : "k.a_1_k.b_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"k.a" : [ ],
"k.b" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"k.a" : [
"[50.0, 50.0]"
],
"k.b" : [
"[5.0, 5.0]",
"[6.0, 6.0]",
"[10.0, 10.0]"
]
}
}
},
I am using MongoDB 3.4.
I ask this because I have a DB with lot of documents and there is a query that use aggregate() and $elemMatch (it perform more useful things than projecting nothing as in this question OFC, but theoretically things does not require a FETCH stage). I found out main reason of query being slow if is the FETCH stage, which AFAIK is not needed.
Is there some logic that force MongoDB to use FETCH when $elemMatch is used, or is it just a missing optimization ?
Looks like even a single "$elemMatch" forces mongo to do FETCH -> COUNT instead of COUNT_SCAN. Opened a ticket in their Jira with simple steps to reproduce - https://jira.mongodb.org/browse/SERVER-35223
TLDR: this is the expected behaviour of a multikey index combined with an $elemMatch.
From the covered query section of the multikey index documents:
Multikey indexes cannot cover queries over array field(s).
Meaning all information about a sub-document is not in the multikey index.
Let's imagine the following scenario:
//doc1
{
"k" : [ { "a" : 50, "b" : 10 } ]
}
//doc2
{
"k" : { "a" : 50, "b" : 10 }
}
Because Mongo "flattens" the array it indexes, once the index is built Mongo cannot differentiate between these 2 documents, $elemMatch specifically requires an array object to match (i.e doc2 will never match an $elemMatch query).
Meaning Mongo is forced to FETCH the documents to determine which docs will match, this is the premise causing the behaviour you see.
I would like to know if mongodb should re-order data after insert data according with the indexes configured previously, for instance:
After insert data according with sequence bellow:
db.test.insert(_id:1)
db.test.insert(_id:5)
db.test.insert(_id:2)
And executing the following search:
db.test.find();
We can see the result:
{ "_id" : 1 }
{ "_id" : 5 }
{ "_id" : 3 }
As we know, the field _id by default has a index, the question here is why after executing search the results are not return in sequence as presented bellow?
{ "_id" : 1 }
{ "_id" : 3 }
{ "_id" : 5 }
MongoDB indexes are separate data structures that contain the portion of the collection being indexed. The index does not dictate the order of documents in storage during insert.
Indexes are only used for sorting when a sort order is specified in the search:
db.test.find().sort({ _id: 1 })
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 5 }
You can verify the index usage using explain:
db.test.explain().find().sort({_id:1})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "so.test",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_id" : 1
},
"indexName" : "_id_",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"_id" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "KSA303096TX2",
"port" : 27017,
"version" : "3.0.11",
"gitVersion" : "48f8b49dc30cc2485c6c1f3db31b723258fcbf39 modules: enterprise"
},
"ok" : 1
}
Suppose we have a following document
{
embedded:[
{
email:"abc#abc.com",
active:true
},
{
email:"def#abc.com",
active:false
}]
}
What indexing should be used to support $elemMatch query on email and active field of embedded doc.
Update on question :-
db.foo.aggregate([{"$match":{"embedded":{"$elemMatch":{"email":"abc#abc.com","active":true}}}},{"$group":{_id:null,"total":{"$sum":1}}}],{explain:true});
on querying this i am getting following output of explain on aggregate :-
{
"stages" : [
{
"$cursor" : {
"query" : {
"embedded" : {
"$elemMatch" : {
"email" : "abc#abc.com",
"active" : true
}
}
},
"fields" : {
"_id" : 0,
"$noFieldsNeeded" : 1
},
"planError" : "InternalError No plan available to provide stats"
}
},
{
"$group" : {
"_id" : {
"$const" : null
},
"total" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
I think mongodb internally not using index for this query.
Thanx in advance :)
Update on output of db.foo.stats()
db.foo.stats()
{
"ns" : "test.foo",
"count" : 2,
"size" : 480,
"avgObjSize" : 240,
"storageSize" : 8192,
"numExtents" : 1,
"nindexes" : 3,
"lastExtentSize" : 8192,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 24528,
"indexSizes" : {
"_id_" : 8176,
"embedded.email_1_embedded.active_1" : 8176,
"name_1" : 8176
},
"ok" : 1
}
db.foo.getIndexes();
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "test.foo"
},
{
"v" : 1,
"key" : {
"embedded.email" : 1,
"embedded.active" : 1
},
"name" : "embedded.email_1_embedded.active_1",
"ns" : "test.foo"
},
{
"v" : 1,
"key" : {
"name" : 1
},
"name" : "name_1",
"ns" : "test.foo"
}
]
Should you decide to stick to that data model and your queries, here's how to create indexes that match the query:
You can simply index "embedded.email", or use a compound key of embedded indexes, i.e. something like
> db.foo.ensureIndex({"embedded.email" : 1 });
- or -
> db.foo.ensureIndex({"embedded.email" : 1, "embedded.active" : 1});
Indexing boolean fields is often not too useful, since their selectivity is low.
I'm running the standard Homebrew installation of Mongo DB, version 2.4.6, and I've got a database with a collection called 'items', which has 600k documents within it.
I've written the following query to find the the top five brands for the collection of items:
db.items.aggregate([
{ $group: { _id: '$brand', size: { $sum: 1}}},
{ $sort: {"size": -1}},
{ $limit: 5}
])
which returns the result I expected, but to be frank, takes much longer to complete than I ever would have imagined. Here is the profile data:
{
"op" : "command",
"ns" : "insights-development.$cmd",
"command" : {
"aggregate" : "items",
"pipeline" : [
{
"$group" : {
"_id" : "$brand",
"size" : {
"$sum" : 1
}
}
},
{
"$sort" : {
"size" : -1
}
},
{
"$limit" : 5
}
]
},
"ntoreturn" : 1,
"keyUpdates" : 0,
"numYield" : 3,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(3581974),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(1314151),
"w" : NumberLong(10)
}
},
"responseLength" : 267,
"millis" : 2275,
"ts" : ISODate("2013-11-23T18:16:33.886Z"),
"client" : "127.0.0.1",
"allUsers" : [ ],
"user" : ""
}
Here is the ouptut of db.items.stats():
{
"sharded" : false,
"primary" : "a59aff30810b066bbe31d1fae79596af",
"ns" : "insights-development.items",
"count" : 640590,
"size" : 454491840,
"avgObjSize" : 709.4894394230319,
"storageSize" : 576061440,
"numExtents" : 14,
"nindexes" : 10,
"lastExtentSize" : 156225536,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 165923744,
"indexSizes" : {
"_id_" : 17889088,
"demographic_1" : 14741328,
"brand_1" : 17946320,
"retailer_1" : 18690336,
"color_1" : 15738800,
"style_1" : 18951968,
"classification_1" : 15019312,
"placement_1" : 19107312,
"state_1" : 12394816,
"gender_1" : 15444464
},
"ok" : 1
}
I'm fairly new to MongoDB so I'm hoping someone can point out why this aggregation takes so long to run and if there is anything I can do to speed it up as it seems to me that 600k isn't a huge number of documents more Mongo to run calculations on.
If you have an index on "brand" field, then adding a {$sort:{brand:1}} at the beginning of the pipeline may help performance. The reason you're not seeing good performance right now is likely due to the need to scan every document to group by brand. If there was an index, then it could be used to scan index only rather than all the documents. And sorting (which uses an index) can speed up grouping in some cases where having a result ordered by the field being grouped is beneficial.
If you created an index on brand and didn't see any improvement, try adding a $sort before you get rid of the index. If it happens that you already have an index where brand is the first field, you then don't need to add another index on brand - the compound index will automatically be used.