I have the following mongo test cluster in place
No of shards -2
No of config server -1
mongos instances -2
Replication not enabled
I have around 41 million records split across the shards, I have defined a compound index {field1:1,field2:1,field3:1}, my queries are of the form (field=1 and field2 between x and y), I expected the compound index to be useful for these queries, however the query response time is around 8 sec for the query I described. I am specifying only the fields of interest when I execute find.
Mongos is installed on the machine from where I execute the query and I am using java to do the querying.
Can someone throw some light on the possible reasons, why this query takes such a long time? I would be happy to provide more information if required.
The following is the output of explain command
{
"indexBounds": {
"LOGIN_ID": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
],
"LOGIN_TIME": [
[
1262332800000,
1293782400000
]
]
},
"nYields": 7,
"millisShardTotal": 7410,
"millisShardAvg": 7410,
"numQueries": 1,
"nChunkSkips": 0,
"shards": {
"server1:27017": [
{
"nYields": 7,
"nscannedAllPlans": 1769804,
"allPlans": [
{
"cursor": "BtreeCursor LOGIN_TIME_1_LOGIN_ID_1",
"indexBounds": {
"LOGIN_ID": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
],
"LOGIN_TIME": [
[
1262332800000,
1293782400000
]
]
},
"nscannedObjects": 1763903,
"nscanned": 1763903,
"n": 14081
},
{
"cursor": "BasicCursor",
"indexBounds": {},
"nscannedObjects": 5901,
"nscanned": 5901,
"n": 0
}
],
"millis": 7410,
"nChunkSkips": 0,
"server": "server2:27017",
"n": 14081,
"cursor": "BtreeCursor LOGIN_TIME_1_LOGIN_ID_1",
"oldPlan": {
"cursor": "BtreeCursor LOGIN_TIME_1_LOGIN_ID_1",
"indexBounds": {
"LOGIN_ID": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
],
"LOGIN_TIME": [
[
1262332800000,
1293782400000
]
]
}
},
"scanAndOrder": false,
"indexBounds": {
"LOGIN_ID": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
],
"LOGIN_TIME": [
[
1262332800000,
1293782400000
]
]
},
"nscannedObjectsAllPlans": 1769804,
"isMultiKey": false,
"indexOnly": false,
"nscanned": 1763903,
"nscannedObjects": 1763903
}
]
},
"n": 14081,
"cursor": "BtreeCursor LOGIN_TIME_1_LOGIN_ID_1",
"oldPlan": {
"cursor": "BtreeCursor LOGIN_TIME_1_LOGIN_ID_1",
"indexBounds": {
"LOGIN_ID": [
[
{
"$minElement": 1
},
{
"$maxElement": 1
}
]
],
"LOGIN_TIME": [
[
1262332800000,
1293782400000
]
]
}
},
"numShards": 1,
"clusteredType": "ParallelSort",
"nscannedAllPlans": 1769804,
"nscannedObjectsAllPlans": 1769804,
"millis": 7438,
"nscanned": 1763903,
"nscannedObjects": 1763903
}
A sample document in my db is as follows
{
"_id" : ObjectId("52d5192c1a45f84e48c24e2f"),
"LOGIN_ID" : <loginId>,
"LOGIN_TIME" : NumberLong("1372343932000"),
"BUSINESS_ID" : <businessId>,
"USER_ID" : <userid>,
"EMAIL" : "a#b.com",
"SITE_POD_NAME" : "x",
"USER_AGENT" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML. like Gecko) Chrome/26.0.1410.43 Safari/537.31"
}
There are some other fields in the above doc which I cannot expose outside, but its a simple key value of string and string
This is how I query the db
DBObject dbObject = new BasicDBObject("BUSINESS_ID", businessId)
.append("LOGIN_TIME",
new BasicDBObject("$gte",start).append("$lt", end))
.append("LOGIN_TYPE", loginType);
long startTime = System.currentTimeMillis();
DBObject keys = new BasicDBObject("LOGIN_TIME", 1);
DBCursor find = collection.find(dbObject, keys);
int count = 0;
while (find.hasNext()) {
find.next();
count++;
}
long endTime = System.currentTimeMillis();
Mongo DB version is 2.4.9. Appreciate any help.
I see a following spots which could head into finding more about exact issue:
What is login_time and what does the numbers in the query range actually mean? The range looks quite wide by numeric difference. May be you filter criteria is vey wide-ranged? This is also indicative from the "nscanned" from the explain plan.
I see the index is on login_time and login_id, where as your query is on login_time and login_type. Just high-lighting that although you are using index, your query criteria is wide enough to cover a much larger index range and since the second criteria of login_type is not part of the index, query would need to fetch all "nscanned" documents to determine if it a valid record for this query.
Related
In the collection below I am trying to calculate total / sold_total / sold_percent by aggregating the copies sub doc.
Similarly, I want to calculate grand_total / sold_grand_total / sold_grand_percent by aggregating the inventory sub document.
I prefer to do this during writes/updates or using a MongoDB function/job instead during 'reads' for efficiency.
I have tried a couple of aggregate pipelines but sub-array unwinding the copies array clears everything above it. Any help appreciated, thanks.
{
"_id" : "xyz",
"store" : "StoreB",
"grand_total" : 7,
"sold_grand_total" : 5,
"sold_grand_percent" : 72,
"inventory" : [
{"title" : "BookA", "total" : 4, "sold_total" : 3, "sold_percent" : 75,
"copies" : [
{"_id": 1, "condition": "new", "sold": 1 },
{"_id": 2,"condition": "new", "sold": 1 },
{"_id": 3,"condition": "new", "sold": 0 },
{"_id": 4,"condition": "new", "sold": 1 }
]
},
{"title" : "BookB", "total" : 1, "sold_total" : 1, "sold_percent" : 100,
"copies" : [
{"_id": 1, "condition": "new", "sold": 1 }
]
},
{"title" : "BookC", "total" : 2, "sold_total" : 1, "sold_percent" : 50,
"copies" : [
{"_id": 1, "condition": "new", "sold": 1 },
{"_id": 2,"condition": "new", "sold": 0 }
]
}
]
}
There are multiple ways of going this. I am not sure what your architecture is.
These are the 2 different aggregates:
This one gives "total" and "sold_total"
[
{
"$unwind" : "$inventory"
},
{
"$unwind" : "$inventory.copies"
},
{
"$group": {
"_id": "$inventory.title",
"total": {
"$sum": "$inventory.copies.sold"
},
"sold_total": {
"$sum": 1
}
}
}]
Other gives grand_total / sold_grand_total
[
{
"$unwind" : "$inventory"
},
{
"$unwind" : "$inventory.copies"
},
{
"$group": {
"_id": null,
"total": {
"$sum": "$inventory.copies.sold"
},
"count": {
"$sum": 1
}
}
}]
You can do both together, by getting the entire object from a group by operation and giving performing the other group by on it. basically, project and pipeline it.
For what would be this query in SQL (to find duplicates):
SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1
I performed this simple query in MongoDB:
res = db.col.group({key:{userId:true,name:true},
reduce: function(obj,prev) {prev.count++;},
initial: {count:0}})
I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:
for (i in res) {if (res[i].count>1) printjson(res[i])};
Is there a better way to do this other than using javascript code in the client?
If this is the best/simplest way, say that it is, and this question will help someone :)
New answer using Mongo aggregation framework
After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:
db.col.aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.
The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.
The original command
db.getCollection('so').explain(1).aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Parts from the explain plan
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet" : false,
"parsedQuery" : {},
"queryHash" : "F301762B",
"planCacheKey" : "F301762B",
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"name" : 1,
"userId" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 6000,
"executionTimeMillis" : 8,
"totalKeysExamined" : 0,
"totalDocsExamined" : 6000,
The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations
Now a small, hack and addition of an index.
db.getCollection('so').createIndex({userId : 1, name : 1})
New Command
db.getCollection('so').explain(1).aggregate( [
{$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Explain Plan
{
"stages": [{
"$cursor": {
"queryPlanner": {
"plannerVersion": 1,
"namespace": "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet": false,
"parsedQuery": {
"$and": [{
"name": {
"$not": {
"$eq": null
}
}
},
{
"userId": {
"$not": {
"$eq": null
}
}
}
]
},
"queryHash": "4EF9C4D5",
"planCacheKey": "3898FC0A",
"winningPlan": {
"stage": "PROJECTION_COVERED",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"userId": 1.0,
"name": 1.0
},
"indexName": "userId_1_name_1",
"isMultiKey": false,
"multiKeyPaths": {
"userId": [],
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"userId": [
"[MinKey, undefined)",
"(null, MaxKey]"
],
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
},
"rejectedPlans": [{
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "FETCH",
"filter": {
"userId": {
"$not": {
"$eq": null
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"name": 1.0
},
"indexName": "name_1",
"isMultiKey": false,
"multiKeyPaths": {
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
}
}]
},
"executionStats": {
"executionSuccess": true,
"nReturned": 6000,
"executionTimeMillis": 9,
"totalKeysExamined": 6000,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION_COVERED",
"nReturned": 6000,
Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally
*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.
{ "_id" : 1, "quizzes" : [ 10, 6, 7 ], "labs" : [ 5, 8 ], "final" : 80, "midterm" : 75 ,"extraMarks":10}
{ "_id" : 2, "quizzes" : [ 9, 10 ], "labs" : [ 8, 8 ], "final" : 95, "midterm" : 80 }
{ "_id" : 3, "quizzes" : [ 4, 5, 5 ], "labs" : [ 6, 5 ], "final" : 78, "midterm" : 70 }
These are the documents in my collection.
Using the pipeline query as suggested in
$add with some fields as Null returning sum value as Null
I am able to project the sum of fields using this query:
db.students.aggregate([
{
"$project": {
"final": 1,
"midterm": 1,
"examTotal": {
"$add": [
"$final",
"$midterm",
{
"$ifNull": [
"$extraMarks",
0
]
}
]
}
}
}
])
Now we have to update the students collection a new field called total as field similar to exam total in the above projection?
Starting from MongoDB 4.2, you can update with aggregation pipeline.
db.students.update({},
[
{
$set: {
"total": {
"$add": [
"$final",
"$midterm",
{
"$ifNull": [
"$extraMarks",
0
]
}
]
}
}
}
])
Here is the Mongo playground for your reference.
With a MongoDB collection test containing the following documents:
{ "_id" : 1, "color" : "blue", "items" : [ 1, 2, 0 ] }
{ "_id" : 2, "color" : "red", "items" : [ 0, 3, 4 ] }
if I sort them in reversed order based on the second element in the items array, using
db.test.find().sort({"items.1": -1})
they will be correctly sorted as:
{ "_id" : 2, "color" : "red", "items" : [ 0, 3, 4 ] }
{ "_id" : 1, "color" : "blue", "items" : [ 1, 2, 0 ] }
However, when I attempt to sort them using the aggregate function:
db.test.aggregate([{$sort: {"items.1": -1} }])
They will not sort correctly, even though the query is accepted as valid:
{
"result" : [
{
"_id" : 1,
"color" : "blue",
"items" : [
1,
2,
0
]
},
{
"_id" : 2,
"color" : "red",
"items" : [
0,
3,
4
]
}
],
"ok" : 1
}
Why is this?
The aggregation framework just does not "deal with" arrays in the same way as is applied to .find() queries in general. This is not only true of operations like .sort(), but also with other operators, and namely $slice, though that example is about to get a fix ( more later ).
So it pretty much is impossible to deal with anything using the "dot notation" form with an index of an array position as you have. But there is a way around this.
What you "can" do is basically work out what the "nth" array element actually is as a value, and then return that as a field that can be sorted:
db.test.aggregate([
{ "$unwind": "$items" },
{ "$group": {
"_id": "$_id",
"items": { "$push": "$items" },
"itemsCopy": { "$push": "$items" },
"first": { "$first": "$items" }
}},
{ "$unwind": "$itemsCopy" },
{ "$project": {
"items": 1,
"itemsCopy": 1,
"first": 1,
"seen": { "$eq": [ "$itemsCopy", "$first" ] }
}},
{ "$match": { "seen": false } },
{ "$group": {
"_id": "$_id",
"items": { "$first": "$items" },
"itemsCopy": { "$push": "$itemsCopy" },
"first": { "$first": "$first" },
"second": { "$first": "$itemsCopy" }
}},
{ "$sort": { "second": -1 } }
])
It's a horrible and "iterable" approach where you essentially "step through" each array element by getting the $first match per document from the array after processing with $unwind. Then after $unwind again, you test to see if that array elements are the same as the one(s) already "seen" from the identified array positions.
It's terrible, and worse for the more positions you want to move along, but it does get the result:
{ "_id" : 2, "items" : [ 0, 3, 4 ], "itemsCopy" : [ 3, 4 ], "first" : 0, "second" : 3 }
{ "_id" : 1, "items" : [ 1, 2, 0 ], "itemsCopy" : [ 2, 0 ], "first" : 1, "second" : 2 }
{ "_id" : 3, "items" : [ 2, 1, 5 ], "itemsCopy" : [ 1, 5 ], "first" : 2, "second" : 1 }
Fortunately, upcoming releases of MongoDB ( as currently available in develpment releases ) get a "fix" for this. It may not be the "perfect" fix that you desire, but it does solve the basic problem.
There is a new $slice operator available for the aggregation framework there, and it will return the required element(s) of the array from the indexed positions:
db.test.aggregate([
{ "$project": {
"items": 1,
"slice": { "$slice": [ "$items",1,1 ] }
}},
{ "$sort": { "slice": -1 } }
])
Which produces:
{ "_id" : 2, "items" : [ 0, 3, 4 ], "slice" : [ 3 ] }
{ "_id" : 1, "items" : [ 1, 2, 0 ], "slice" : [ 2 ] }
{ "_id" : 3, "items" : [ 2, 1, 5 ], "slice" : [ 1 ] }
So you can note that as a "slice", the result is still an "array", however the $sort in the aggregation framework has always used the "first position" of the array in order to sort the contents. That means that with a singular value extracted from the indexed position ( just as the long procedure above ) then the result will be sorted as you expect.
The end cases here are that is just how it works. Either live with the sort of operations you need from above to work with a indexed position of the array, or "wait" until a brand new shiny version comes to your rescue with better operators.
For what would be this query in SQL (to find duplicates):
SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1
I performed this simple query in MongoDB:
res = db.col.group({key:{userId:true,name:true},
reduce: function(obj,prev) {prev.count++;},
initial: {count:0}})
I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:
for (i in res) {if (res[i].count>1) printjson(res[i])};
Is there a better way to do this other than using javascript code in the client?
If this is the best/simplest way, say that it is, and this question will help someone :)
New answer using Mongo aggregation framework
After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:
db.col.aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.
The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.
The original command
db.getCollection('so').explain(1).aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Parts from the explain plan
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet" : false,
"parsedQuery" : {},
"queryHash" : "F301762B",
"planCacheKey" : "F301762B",
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"name" : 1,
"userId" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 6000,
"executionTimeMillis" : 8,
"totalKeysExamined" : 0,
"totalDocsExamined" : 6000,
The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations
Now a small, hack and addition of an index.
db.getCollection('so').createIndex({userId : 1, name : 1})
New Command
db.getCollection('so').explain(1).aggregate( [
{$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Explain Plan
{
"stages": [{
"$cursor": {
"queryPlanner": {
"plannerVersion": 1,
"namespace": "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet": false,
"parsedQuery": {
"$and": [{
"name": {
"$not": {
"$eq": null
}
}
},
{
"userId": {
"$not": {
"$eq": null
}
}
}
]
},
"queryHash": "4EF9C4D5",
"planCacheKey": "3898FC0A",
"winningPlan": {
"stage": "PROJECTION_COVERED",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"userId": 1.0,
"name": 1.0
},
"indexName": "userId_1_name_1",
"isMultiKey": false,
"multiKeyPaths": {
"userId": [],
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"userId": [
"[MinKey, undefined)",
"(null, MaxKey]"
],
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
},
"rejectedPlans": [{
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "FETCH",
"filter": {
"userId": {
"$not": {
"$eq": null
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"name": 1.0
},
"indexName": "name_1",
"isMultiKey": false,
"multiKeyPaths": {
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
}
}]
},
"executionStats": {
"executionSuccess": true,
"nReturned": 6000,
"executionTimeMillis": 9,
"totalKeysExamined": 6000,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION_COVERED",
"nReturned": 6000,
Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally
*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.