I have a mongo db collections of about 168,200,000 documents. I am trying to get the average of a certain field with $group, and I am using $match before the $group in the pipeline to use the index on client.city. But the query is taking about 5 minutes to run, which is very slow.
Here are the things I tried:
db.ar12.aggregate(
{$match:{'client.city':'New York'}},
{'$group':{'_id':'client.city', 'avg':{'$avg':'$length'}}}
)
db.ar12.aggregate(
{$match:{'client.city':'New York'}},
{'$group':{'_id':null, 'avg':{'$avg':'$length'}}}
)
db.ar12.aggregate(
{$match:{'client.city':'New York'}},
{$project: {'length':1}},
{'$group':{'_id':null, 'avg':{'$avg':'$length'}}}
)
All 3 queries take about the same time, number of documents with client.city = to New York is 1,231,672, find({'client.city':'New York').count() takes a second to run
> db.version()
3.2.0
EDIT
Here's the explain result... As for the comment for adding a compound index with length, would that help, although I am not search by length I want all lengthes...
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
"client.city" : "New York"
},
"fields" : {
"length" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "clients.ar12",
"indexFilterSet" : false,
"parsedQuery" : {
"client.city" : {
"$eq" : "New York"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"client.city" : 1
},
"indexName" : "client.city_1",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"client.city" : [
"[\"New York\", \"New York\"]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$project" : {
"length" : true
}
},
{
"$group" : {
"_id" : {
"$const" : null
},
"total" : {
"$avg" : "$length"
}
}
}
],
"ok" : 1
}
EDIT 2
I have added a compound index of client.city and length, but to no avail the speed is still too slow, I tried these 2 queries:
db.ar12.aggregate(
{$match: {'client.city':'New York'}},
{$project: {'client.city':1, 'length':1}},
{'$group':{'_id':'$client.city', 'avg':{'$avg':'$length'}}}
)
The above query wasn't using the compound index, so I tried this to force using it, and still nothing changed:
db.ar12.aggregate(
{$match: { $and : [{'client.city':'New York'}, {'length':{'$gt':0}}]}},
{$project: {'client.city':1, 'length':1}},
{'$group':{'_id':'$client.city', 'avg':{'$avg':'$length'}}}
)
below is the explain of the last query:
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
"$and" : [
{
"client.city" : "New York"
},
{
"length" : {
"$gt" : 0
}
}
]
},
"fields" : {
"client.city" : 1,
"length" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "clients.ar12",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"client.city" : {
"$eq" : "New York"
}
},
{
"length" : {
"$gt" : 0
}
}
]
},
"winningPlan" : {
"stage" : "CACHED_PLAN",
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"client.city" : 1,
"length" : 1
},
"indexName" : "client.city_1_length_1",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"client.city" : [
"[\"New York\", \"New York\"]"
],
"length" : [
"(0.0, inf.0]"
]
}
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$project" : {
"client" : {
"city" : true
},
"length" : true
}
},
{
"$group" : {
"_id" : "$client.city",
"avg" : {
"$avg" : "$length"
}
}
}
],
"ok" : 1
}
I have found a work around, length goes from 1 till 70. So what I did is in python I iterated from 1 to 70, and found the count of each length for each city,
db.ar12.find({'client.city':'New York', 'length':i}).count()
which is very fast, then calculated the average in python, it is taking about 2 seconds to run.
This is not the best solution, since I have other queries to run, I don't know if I can find a work around for all of them...
Related
I have the MongoDB aggregation query
db.data.aggregate([{ "$match" : { "$text" : { "$search" : "STORAGE TYPE" } } },
{ "$group" :
{ "_id" :{"doc_type": "$doc_type" ,"title" : "$title", "player_name" : "$player_name", "player_type" : "INSTITUTION", "country_code" :"$country_code" },
"number_records" : { "$sum" : 1}
}
},
{"$match" : {"doc_type": "PATENT"} },
{"$sort":{"number_records" : -1}},
{"$limit" : 10}],
{"allowDiskuse" : true}
)
When I tried to execute the above code, it keeps on buffering for a long time, I am not getting any output. Can anyone help me?
When I used command explain(), it shows the following code:
{
"stages" : [
{
"$cursor" : {
"query" : {
"$and" : [
{
"$text" : {
"$search" : "STORAGE TYPE"
}
},
{
"doc_type" : "PATENT"
}
]
},
"fields" : {
"country_code" : 1,
"doc_type" : 1,
"player_name" : 1,
"title" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "datadocuments.data",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"doc_type" : {
"$eq" : "PATENT"
}
},
{
"$text" : {
"$search" : "STORAGE TYPE",
"$language" : "english",
"$caseSensitive" : false,
"$diacriticSensitive" : false
}
}
]
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"doc_type" : {
"$eq" : "PATENT"
}
},
"inputStage" : {
"stage" : "TEXT",
"indexPrefix" : {
},
"indexName" : "title",
"parsedTextQuery" : {
"terms" : [
"storag",
"type"
],
"negatedTerms" : [ ],
"phrases" : [ ],
"negatedPhrases" : [ ]
},
"textIndexVersion" : 3,
"inputStage" : {
"stage" : "TEXT_MATCH",
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "OR",
"inputStages" : [
{
"stage" : "IXSCAN",
"keyPattern" : {
"_fts" : "text",
"_ftsx" : 1
},
"indexName" : "title",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "backward",
"indexBounds" : {
}
},
{
"stage" : "IXSCAN",
"keyPattern" : {
"_fts" : "text",
"_ftsx" : 1
},
"indexName" : "title",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "backward",
"indexBounds" : {
}
}
]
}
}
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : {
"doc_type" : "$doc_type",
"title" : "$title",
"player_name" : "$player_name",
"player_type" : {
"$const" : "INSTITUTION"
},
"country_code" : "$country_code"
},
"number_records" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$sort" : {
"sortKey" : {
"number_records" : -1
},
"limit" : NumberLong("10")
}
}
],
"ok" : 1
}
I couldn't figure out the mistake; is there any problem in aggregation, if not, how to increase the performance?
Your error comes from your second match stage : at this point, doc_type doesn't exist, but _id.doc_type instead. But you bettermerge this stage with the first one, to improve performance by reducing number of documents passed to the $group stage.
Your improved query will be :
db.data.aggregate([
{"$match" : { "$text" : { "$search" : "STORAGE TYPE" `},"doc_type": "PATENT" } },`
{ "$group" :
{ "_id" :{"doc_type": "$doc_type" ,"title" : "$title", "player_name" : "$player_name", "player_type" : "INSTITUTION", "country_code" :"$country_code" },
"number_records" : { "$sum" : 1}
}
},
{"$sort":{"number_records" : -1}},
{"$limit" : 10}],
{"allowDiskuse" : true}
)
I have a cluster with 3 config dbs and 3 shards. I am querying against a db with 106M records, each with 410 fields. I imported this data using a shard key of:
{state: 1, zipCode: 1}.
When I run the following queries individually each one completes in less than 5sec. (SC = 1.6M records, NC = 5.2M records)
db.records.find( { "state" : "NC" } ).count()
db.records.find( { "state" : "SC" } ).count()
db.records.find( { "state" : { $in : ["NC"] } } ).count()
db.records.find( { "state" : { $in : ["SC"] } } ).count()
However when I query both states using an $in or an $or the query takes over an hour to complete.
db.records.find( "state" : { $in : [ "NC" , "SC" ] } ).count()
db.records.find({ $or : [ { "state" : "NC" }, { "state" : "SC" } ).count()
The entirety of both states exist on 1 shard. Below are the results of .explain() using the $in query:
db.records.find({"state":{$in:["NC","SC"]}}).explain()
{
"queryPlanner" : {
"mongosPlannerVersion" : 1,
"winningPlan" : {
"stage" : "SINGLE_SHARD",
"shards" : [
{
"shardName" : "s2",
"connectionString" : "s2/192.168.2.17:27000,192.168.2.17:27001",
"serverInfo" : {
"host" : "MonDbShard2",
"port" : 27000,
"version" : "3.4.7",
"gitVersion" : "cf38c1b8a0a8dca4a11737581beafef4fe120bcd"
},
"plannerVersion" : 1,
"namespace" : "DBNAME.records",
"indexFilterSet" : false,
"parsedQuery" : {
"state" : {
"$in" : [
"NC",
"SC"
]
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "SHARDING_FILTER",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"state" : 1,
"zipCode" : 1
},
"indexName" : "state_1_zipCode_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"state" : [ ],
"zipCode" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"state" : [
"[\"NC\", \"NC\"]",
"[\"SC\", \"SC\"]"
],
"zipCode" : [
"[MinKey, MaxKey]"
]
}
}
}
},
"rejectedPlans" : [ ]
}
]
}
},
"ok" : 1
}
Why would querying two states at one time cause such a drastic difference in completion time? Also, querying a single zipCode without also including the corresponding state, results in the same drastic difference in completion time. I feel like I am misunderstanding how the shard-key actually operates. Any thoughts?
I am trying to optimise my query and have found that when using $in on a non-indexed column that the performance appears to be faster than when on an indexed column.
For example:
I have added an index on myCollection: {"entryVals.col1" : 1}.
To confirm:
db.myCollection.getIndexes()
returns:
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "myDb.myCollection"
},
{
"v" : 2,
"key" : {
"entryVals.col1" : 1
},
"name" : "entryVals.col1_1",
"ns" : "myDb.myCollection"
} ]
I then run a count with a query (printing the time taken) on both the indexed and non-indexed columns.
Count on indexed column
var a = new Date().getTime();
db.myCollection.count({"entryVals.col1": {$in:["a","b","c","d"]}});
new Date().getTime() - a;
returns
96 (time in ms)
Count on non-indexed column
var a = new Date().getTime();
db.myCollection.count({"entryVals.col2": {$in:["a","b","c","d"]}});
new Date().getTime() - a;
returns
60 (time in ms)
Please bare in mind that I ran the queries several times and took an average (there were little to no anomalies) .
Is anyone able to help enlighten me as to why the query on the column that is indexed is slower please?
Thanks in advance.
Explains
Count on indexed column
db.myCollection.explain().count({"entryVals.col1": {$in:["a","b","c","d"]}})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "myDb.myCollection",
"indexFilterSet" : false,
"parsedQuery" : {
"entryVals.col1" : {
"$in" : [
"a",
"b",
"c",
"d"
]
}
},
"winningPlan" : {
"stage" : "COUNT",
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"entryVals.col1" : 1
},
"indexName" : "entryVals.col1_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"entryVals.col1" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"entryVals.col1" : [
"[\"a\", \"a\"]",
"[\"b\", \"b\"]",
"[\"c\", \"c\"]",
"[\"d\", \"d\"]"
]
}
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "obfuscated",
"port" : obfuscated,
"version" : "3.4.6-1.7",
"gitVersion" : "obfuscated"
},
"ok" : 1
}
Count on non-indexed column
db.myCollection.explain().count({"entryVals.col2": {$in:["a","b","c","d"]}})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "myDb.myCollection",
"indexFilterSet" : false,
"parsedQuery" : {
"entryVals.col2" : {
"$in" : [
"a",
"b",
"c",
"d"
]
}
},
"winningPlan" : {
"stage" : "COUNT",
"inputStage" : {
"stage" : "COLLSCAN",
"filter" : {
"entryVals.col2" : {
"$in" : [
"a",
"b",
"c",
"d"
]
}
},
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "obfuscated",
"port" : obfuscated,
"version" : "3.4.6-1.7",
"gitVersion" : "obfuscated"
},
"ok" : 1
}
I am new to Mongo and was trying to get distinct count of users. The field Id and Status are not individually Indexed columns but there exists a composite index on both the field. My current query is something like this where the match conditions changes depending on the requirements.
DBQuery.shellBatchSize = 1000000;
db.getCollection('username').aggregate([
{$match:
{ Status: "A"
} },
{"$group" : {_id:"$Id", count:{$sum:1}}}
]);
Is there anyway we can optimize this query more or add parallel runs on collection so that we can achieve results faster ?
Regards
You can tune your aggregation pipelines by passing in an option of explain=true in the aggregate method.
db.getCollection('username').aggregate([
{$match: { Status: "A" } },
{"$group" : {_id:"$Id", count:{$sum:1}}}],
{ explain: true });
This will then output the following to work with
{
"stages" : [
{
"$cursor" : {
"query" : {
"Status" : "A"
},
"fields" : {
"Id" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.usernames",
"indexFilterSet" : false,
"parsedQuery" : {
"Status" : {
"$eq" : "A"
}
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$Id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
So to speed up our query we need a index to help the match part of the pipeline, so let's create a index on Status
> db.usernames.createIndex({Status:1})
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
If we now run the explain again we'll get the following results
{
"stages" : [
{
"$cursor" : {
"query" : {
"Status" : "A"
},
"fields" : {
"Id" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.usernames",
"indexFilterSet" : false,
"parsedQuery" : {
"Status" : {
"$eq" : "A"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"Status" : 1
},
"indexName" : "Status_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"Status" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"Status" : [
"[\"A\", \"A\"]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$Id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
We can now see straight away this is using a index.
https://docs.mongodb.com/manual/reference/explain-results/
I have an aggregate on a collection with about 1.6M of registers. That consult is a simple example of other more complex, but illustrate the poor optimization of index used in my opinion.
db.getCollection('cbAlters').runCommand("aggregate", {pipeline: [
{
$match: { cre_carteraId: "31" }
},
{
$group: { _id: { ca_tramomora: "$cre_tramoMora" },
count: { $sum: 1 } }
}
]})
That query toke about 5 sec. The colleccion have 25 indexes configured to differents consults. The one used according to query explain is:
{
"v" : 1,
"key" : {
"cre_carteraId" : 1,
"cre_periodo" : 1,
"cre_tramoMora" : 1,
"cre_inactivo" : 1
},
"name" : "cartPerTramInact",
"ns" : "basedatos.cbAlters"
},
I created an index adjusted to this particular query:
{
"v" : 1,
"key" : {
"cre_carteraId" : 1,
"cre_tramoMora" : 1
},
"name" : "cartPerTramTest",
"ns" : "basedatos.cbAlters"
}
The query optimizer reject this index, and suggests me to use the initial index. Output of my query explain seem like this:
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
"cre_carteraId" : "31"
},
"fields" : {
"cre_tramoMora" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "basedatos.cbAlters",
"indexFilterSet" : false,
"parsedQuery" : {
"cre_carteraId" : {
"$eq" : "31"
}
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"cre_tramoMora" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"cre_carteraId" : 1,
"cre_periodo" : 1,
"cre_tramoMora" : 1,
"cre_inactivo" : 1
},
"indexName" : "cartPerTramInact",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"cre_carteraId" : [
"[\"31\", \"31\"]"
],
"cre_periodo" : [
"[MinKey, MaxKey]"
],
"cre_tramoMora" : [
"[MinKey, MaxKey]"
],
"cre_inactivo" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [
{
"stage" : "PROJECTION",
"transformBy" : {
"cre_tramoMora" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"cre_carteraId" : 1,
"cre_tramoMora" : 1
},
"indexName" : "cartPerTramTest",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"cre_carteraId" : [
"[\"31\", \"31\"]"
],
"cre_tramoMora" : [
"[MinKey, MaxKey]"
]
}
}
}
]
}
}
},
{
"$group" : {
"_id" : {
"ca_tramomora" : "$cre_tramoMora"
},
"count" : {
"$sum" : {
"$const" : 1.0
}
}
}
}
],
"ok" : 1.0
}
Then, why optimizer prefers an index less adjusted? Should indexFilterSet (result filtered for index) be true for this aggregate?
How can I improve this index, or something goes wrong with the query?
I do not have much experience with mongoDB, I appreciate any help
As long as you have index cartPerTramInact, optimizer won't use your cartPerTramTest index because first fields are same and in same order.
This goes with other indexes too. When there is indexes what have same keys at same order (like a.b.c.d, a.b.d, a.b) and you query use fields a.b, it will favour that a.b.c.d. Anyway you don't need that index a.b because you already have two indexes what covers a.b (a.b.c.d and a.b.d)
Index a.b.d is used only when you do query with those fields a.b.d, BUT if a.b is already very selective, it's probably faster to do select with index a.b.c.d using only part a.b and do "full table scan" to find that d
There is a hint option for aggregations that can help with the index...
See https://www.mongodb.com/docs/upcoming/reference/method/db.collection.aggregate/#mongodb-method-db.collection.aggregate