I am in the process of moving from mysql to mongodb. Started learning mongodb yesterday.
I have a big mysql table (over 4 million rows, with over 300 fields each) which I am moving to mongodb.
Let's assume, the products table have the following fields -
_id, category, and 300+ other fields.
To find the top 5 categories in the products along with their count, I have the following mysql query
Select category, count(_id) as N from products group by category order by N DESC limit 5;
I have an index on category field and this query takes around 4.4 sec in mysql.
Now, I have successfully moved this table to mongodb and this is my corresponding query for finding top 5 categories with their counts.
db.products.aggregate([{$group : {_id:"$category", N:{$sum:1}}},{$sort:{N: -1}},{$limit:5}]);
I again have an index on category but the query doesn't seem to be using it (explain : true says so) and it is also taking around 13.5 sec for this query.
Having read more about mongodb aggregation pipeline optimization, I found out that we need to use sort prior to aggregation for index to work but I am sorting on the derived field from aggregation so can't bring it before the aggregate function.
How do I optimize queries like these in mongodb?
=========================================================================
Output of explain
db.products.aggregate([{$group : {_id:"$category",N:{$sum:1}}},{$sort:{N: -1}},{$limit:5}], { explain: true });
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"category" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydb.products",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$category",
"N" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$sort" : {
"sortKey" : {
"N" : -1
},
"limit" : NumberLong(5)
}
}
],
"ok" : 1
}
There are currently some limitations in what aggregation framework can do to improve the performance in our use case, however, you should be able to speed up the query by sorting on category first. This will force the query to use the index you have added and should speed up the group query in the second part of your pipeline:
db.products.aggregate([
{ "$sort" : { "category" : 1 },
{$group : {_id:"$category",N:{$sum:1}}},
{$sort:{N: -1}},{$limit:5}]);
Related
I'm using MongoDB 4.4.3 to query a random record from a collection :
db.MyCollection.aggregate([{ $sample: { size: 1 } }])
This query takes 20s (when a find query takes 0.2s)
Mongo doc states :
If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:
$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents
Here
$sample is the only stage of the pipeline
N = 1
MyCollection contains 46 millions documents
This problem is similar to MongoDB Aggregation with $sample very slow, which does not provide an answer for Mongo4.4.3
So why is this query so slow ?
Details
Query Planner
db.MyCollection.aggregate([{$sample: {size: 1}}]).explain()
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "DATABASE.MyCollection",
"indexFilterSet" : false,
"winningPlan" : {
"stage" : "MULTI_ITERATOR"
},
"rejectedPlans" : [ ]
}
}
},
{
"$sampleFromRandomCursor" : {
"size" : 1
}
}
],
"serverInfo" : {
"host" : "mongodb4-3",
"port" : 27017,
"version" : "4.4.3",
"gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13"
},
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1611128334, 1),
"signature" : {
"hash" : BinData(0,"ZDxiOTnmG/zLKNtDIAWhxmjHHLM="),
"keyId" : 6915708270745223171
}
},
"operationTime" : Timestamp(1611128334, 1)
}
Execution stats
I have searched here but could not find an clear answer to the following question. In the sample collection mycollection below, how would one select distinct vin numbers only in Objects where the status field exists and the status is UNLOCKED ?
I have tried
db.getCollection('mycollection').distinct("vin", {$and: [{"decoded_payload.status": {$exists: true}}, {"decoded_payload.status":"UNLOCKED"}]})
but this query hangs indefinitely
Due to the large size of the database and the lengthy delay of such a query, I would like to limit the output to check if it runs at all but it seems limit() is not an option with .distinct()
In MongoDB, how would one select the distinct vin in the data below, set the limit = 1 and only select based on the status condition (status exists and is equal to "UNLOCKED")?
Would aggregate() be the right choice? How does one use the above conditions with aggregate() and limit() ?
The output in this case would be 34567
{
"_id" : ObjectId("1"),
"vin" : "12345",
"class_name" : "foo",
"decoded_payload" : {
"timestamp" : 1547329250,
"status" : "LOCKED"
}
}
{
"_id" : ObjectId("2"),
"vin" : "23456",
"class_name" : "foo",
"decoded_payload" : {
"timestamp" : 1547329260,
"status" : "LOCKED"
}
}
{
"_id" : ObjectId("3"),
"vin" : "34567",
"class_name" : "bar",
"decoded_payload" : {
"timestamp" : 1547329270,
"status" : "UNLOCKED",
"reservation_id" : "71"
}
}
{
"_id" : ObjectId("4"),
"vin" : "45678",
"class_name" : "baz",
"decoded_payload" : {
"timestamp" : 1547329280,
"reservation_id" : "71"
}
}
You can use this aggregation Query to filter data and return distinct "vin"
db.mycollection.aggregate([
{
$match: {
$and: [{
"decoded_payload.status": { $exists: true }
}, {
"decoded_payload.status": "UNLOCKED"
}]
}
},
{ $limit : 5 }, // You can use this stage after group too
{
$group: { _id: "$vin" }
}
])
Use limit stage before and after $group stage as per requirement
I have a query like this
db.UserPosts.aggregate([
{ "$match" : { "Posts.DateTime" : { "$gte" : ISODate("2018-09-04T11:50:58Z"), "$lte" : ISODate("2018-09-05T11:50:58Z") } } },
{ "$match" : { "UserId" : { "$in" : [NUUID("aaaaaaaa-cccc-dddd-dddd-5369b183cccc"), NUUID("vvvvvvvv-bbbb-ffff-cccc-e0af0c8acccc")] } } },
{ "$project" : { "_id" : 0, "UserId" : 1, "Posts" : 1 } },
{ "$unwind" : "$Posts" },
{ "$unwind" : "$Posts.Comments" },
{ "$sort" : {"Posts.DateTime" : -1} },
{ "$skip" : 0 }, { "$limit" : 20 },
{ "$project" : { "_id" : 0, "UserId" : 1, "DateTime" : "$Posts.DateTime", "Title" : "$Posts.Title", "Type" : "$Posts.Comments.Type", "Comment" : "$Posts.Comments.Description" } },
],{allowDiskUse:true})
I have a compound index
{
"Posts.DateTime" : -1,
"UserId" : 1
}
Posts and Comments are array of objects.
I've tried different types of indexes but the problem is it does not use my index in $sort stage. I changed the place of my $sort stage but wasn't successful. It seems it is working in $match but not set to $sort. I even tried 2 simple indexes on those fields and combination of 2 simple indexes and one compound index but none of them works.
I also read related documents in MongoDB website for
Compound Indexes
Use Indexes to Sort Query Results
Index Intersection
Aggregation Pipeline Optimization
Could somebody please help me to find the solution?
I solved this problem by changing my data model and moving DateTime to higher level of data.
Hi everyone I have a huge data that contains some information like this below:
{ "_id" : "01011", "city" : "CHESTER", "loc" : [ -72.988761, 42.279421 ], "pop" : 1688, "state" : "MA" }
{ "_id" : "01012", "city" : "CHESTERFIELD", "loc" : [ -72.833309, 42.38167 ], "pop" : 177, "state" : "MA" }
{ "_id" : "01013", "city" : "CHICOPEE", "loc" : [ -72.607962, 42.162046 ], "pop" : 23396, "state" : "MA" }
{ "_id" : "01020", "city" : "CHICOPEE", "loc" : [ -72.576142, 42.176443 ], "pop" : 31495, "state" : "MA" }
I want to be able to find the number of the cities in this database using Mongodb command. But also the database may have more than one recored that has the same city. As the example above.
I tried:
>db.zipcodes.distinct("city").count();
2015-04-25T15:57:45.446-0400 E QUERY warning: log line attempted (159k) over max size (10k), printing beginning and end ... TypeError: Object AGAWAM,BELCHERTOWN ***data*** has no method 'count'
but I didn't work with me.Also I did something like this:
>db.zipcodes.find({city:.*}).count();
2015-04-25T16:00:01.043-0400 E QUERY SyntaxError: Unexpected token .
But it didn't work also and even if does work it will count the redundant data (city). Any idea?
Instead of doing
db.zipcodes.distinct("city").count();
do this:
db.zipcodes.distinct("city").length;
and there is aggregate function, which may help you.
I have also found 1 example on aggregate (related to your query).
If you want to add condition, then you could refer $gte / $gte (aggregation) and/or $lte / $lte (aggregation)
See, if that helps.
You can also use the aggregation framework for this. The aggregation pipeline has two $group operator stages; the first groups the documents by city and the last calculates the total distinct documents from the previous stream:
db.collection.aggregate([
{
"$group": {
"_id": "$city"
}
},
{
"$group": {
"_id": 0,
"count": { "$sum": 1 }
}
}
]);
Output:
/* 1 */
{
"result" : [
{
"_id" : 0,
"count" : 3
}
],
"ok" : 1
}
I Just read this link Mongodb Explain for Aggregation framework but not explain my problem
I want retrieve information about aggregation like db.coll.find({bla:foo}).explain()
I tried
db.coll.aggregate([
my-op
],
{
explain: true
})
the result is not Explain, but the query on Database.
I have tried also
db.runCommand({
aggregate: "mycoll",
pipeline: [
my- op
],
explain: true
})
I retrieved information with this command, but i haven't millis, nscannedObjects etc...
I use mongoDb 2.6.2
Aggregations don't run like traditional queries and you can't run the explain on them. They are actually classified as commands and though they make use of indexes you can't readily find out how they are being executed in real-time.
Your best bet is to take the $match portion of your aggregation and run it as a query with explain to figure out how the indexes are performing and get an idea on nscanned.
I am not sure how you managed to fail getting explain information. In 2.6.x this information is available and you can explain your aggregation results:
db.orders.aggregate([
# put your whole aggregation query
], {
explain: true
})
which gives me something like:
{
"stages" : [
{
"$cursor" : {
"query" : {
"a" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.a",
"indexFilterSet" : false,
"parsedQuery" : {
"a" : {
"$eq" : 1
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"a" : {
"$eq" : 1
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
}
],
"ok" : 1
}