Objective
Find out possible differences in the following MongoDB queries and understand why one of them works and the other doesn't.
Background
A while ago I posted a questions asking for help regarding a MongoDB query:
Using $push with $group with pymongo
In that question my query didn't work, and I was looking for a way to fix it. I had a ton of help in the comments, and eventually found out the solution, but no one seems to be able to explain me why my first incorrect query doesn't work, and the second one does.
Code
1st (incorrect) query:
pipeline = [
{"$group": {"_id": "$user.screen_name", "tweet_texts": {"$push": "$text"}, "count": {"$sum": 1}}},
{"$project": {"_id": "$user.screen_name", "count": 1, "tweet_texts": 1}},
{"$sort" : {"count" : -1}},
{"$limit": 5}
]
2nd query:
pipeline = [
{"$group": {"_id": "$user.screen_name", "tweet_texts": {"$push": "$text"}, "count": {"$sum": 1}}},
{"$sort" : {"count" : -1}},
{"$limit": 5}
]
Now, the mindful eye will see that the difference between the two queries is the project stage {"$project": {"_id": "$user.screen_name", "count": 1, "tweet_texts": 1}}.
At the time I thought this stage was necessary, but since I am already selecting the fields I need in the $group stage, I don't really need it. In fact, this additional and unnecessary stage was causing the tests to fail.
Question
If the $project stage in the first example is useless and does the same thing as the $group stage, why was my code failing? Shouldn't it make no difference at all (since the change is idempotent?)
In the first query, after the group stage, the user screen name value is saved under the _id key. Not under the user.screen_name key, therefore, that value will not be projected since there is no key.
If you modify your projection, using
{"$project": {"_id": "$_id", "count": 1, "tweet_texts": 1}},
or
{"$project": {"_id": 1, "count": 1, "tweet_texts": 1}},
or
{"$project": {"count": 1, "tweet_texts": 1}},
first pipeline will be similar like second pipeline.
Related
On production server I use mongodb 4.4
I have a query that works well
db.step_tournaments_results.aggregate([
{ "$match": { "tournament_id": "6377f2f96174982ef89c48d2" } },
{ "$sort": { "total_points": -1, "time_spent": 1 } },
{
$group: {
_id: "$club_name",
'total_points': { $sum: "$total_points"},
'time_spent': { $sum: "$time_spent"}
},
},
])
But the problem is in $group operator, because it sums all the points of every group for total_points, but I need only best 5 of every group. How to achieve that?
Query
like your query, match and sort
on group instead of sum, gather all members inside one array
(i collected the $ROOT but you can collect only the 2 fields you need inside a {}, if the documents have many fields)
take the first 5 of them
take the 2 sums you need from the first 5
remove the temp fields
*with mongodb 6, you can do this in the group, without need to collect th members in an array, in mongodb 5 you can also do those with window-fields without group, but for mongodb 4.4 i think this is a way to do it
aggregate(
[{"$match": {"tournament_id": {"$eq": "6377f2f96174982ef89c48d2"}}},
{"$sort": {"total_points": -1, "time_spent": 1}},
{"$group": {"_id": "$club_name", "group-members": {"$push": "$$ROOT"}}},
{"$set":
{"first-five": {"$slice": ["$group-members", 5]},
"group-members": "$$REMOVE"}},
{"$set":
{"total_points": {"$sum": "$first-five.total_points"},
"time_spent": {"$sum": "$first-five.time_spent"},
"first-five": "$$REMOVE"}}])
I have a MongoDB collection that I have managed to process using an aggregation pipeline to produce the following result:
[
{
_id: 'Complex Numbers',
count: 2
},
{ _id: 'Calculus',
count: 1
}
]
But the result that I am aiming for is something like the following:
{
'Complex Numbers': 2,
'Calculus': 1
}
is there a way to achieve that?
Query
to convert to {} we need somethings like [[k1 v1] ...] OR [{"k" "..." :v "..."}]
first stage
converts each document to [{"k" ".." , "v" ".."}]
then arrayToObject
and replace root
so we have each document like "Complex Numbers": 2
the group is used to combine all those documents in 1 document
and then replace the root with that one document
Test code here
aggregate(
[{"$replaceRoot":
{"newRoot": {"$arrayToObject": [[{"k": "$_id", "v": "$count"}]]}}},
{"$group": {"_id": null, "data": {"$mergeObjects": "$$ROOT"}}},
{"$replaceRoot": {"newRoot": "$data"}}])
This question already has answers here:
Fill missing dates in records
(5 answers)
Closed 2 years ago.
I'm trying to get a small table from mongodb aggregate, for example, number of fatal accidents by years.
I want to get all the years , even if sum is null (zero).
MongoDB query:
[
{"$match": {"city": "myCity"}},
{"$group": {
"_id": "$accident_year",
"count": {"$sum": 1}}
},
{"$sort": {"_id": 1}}
]
actual result:
[
{"_id": "2015", "count": 2},
{"_id": "2017", "count": 4},
{"_id": "2018", "count": 6},
{"_id": "2019", "count": 2}
]
desired result:
[
{"_id": "2015", "count": 2},
{"_id": "2016", "count": 0},
{"_id": "2017", "count": 4},
{"_id": "2018", "count": 6},
{"_id": "2019", "count": 2}
]
Thank you
I don't think it's possible to get an output of 0 for years that are not listed in the collection the way you have the aggregation pipeline created now. I don't see a way for the $group stage to know what values don't exist.
Since you are sorting the results, your application code that receives the output could check if years are listed and manipulate the aggregation pipeline results to include missing years. This is probably the best option.
Here is a hack to get around this if you really want the results in your aggregation pipeline:
Create a document for every year in your collection. Inside of those documents add a field called something like isAccident and set it to 0. For example:
{
{"_id":{"$oid":"5e662fca1c9d440000c1aa71"},
"accident_year":"2018",
"isAccident":"0"
}
Then you can update your pipeline to have a $project stage before the $group stage that adds the isAccident field to all of the documents that don't have the isAccident field and assigns them a value of 1. In your $project stage you can sum on the $isAccident field.
[{$project: {
_id: 1,
accident_year: 1,
isAccident: { $ifNull: ["$isAccident", 1] }
}}, {$group: {
_id: "$accident_year",
count: {
$sum: "$isAccident"
}
}}]
This will give you the results you're expecting. Beware that if others after you come to group and sum the accidents in this collection and don't realize you've created this extra documents for the years, their calculations will be off by one.
I have more than 100k records in my collections, and for every 5 seconds it will add a record into collection. I have a aggregate query to get 720(approx) records from last one year data.
The aggregate query:
db.collectionName.aggregate([
{"$match": {
"Id": "****-id-****",
"receivedDate": {
"$gte": ISODate("2016-06-26T18:30:00.463Z"),
"$lt": ISODate("2017-06-26T18:30:00.463Z")
}
}
},
{"$group": {
"_id": {
"$add": [
{"$subtract": [
{"$subtract": ["$receivedDate", ISODate("1970-01-01T00:00:00.000Z")]},
{"$mod": [
{"$subtract": ["$receivedDate", ISODate("1970-01-01T00:00:00.000Z")]},
43200000
]}
]},
ISODate("1970-01-01T00:00:00.000Z")
]
},
"_rid": {"$first": "$_id"},
"_data": {"$first": "$receivedData.data"},
"count": {"$sum": 1}
}
},
{"$sort": {"_id": -1}},
{"$project": {
"_id": "$_rid",
"receivedDate": "$_id",
"receivedData": {"data": "$_data"}
}
}
])
I am not sure why its taking more than 15 seconds, when I try to get data for 1 month it is working fine.
Its too late to answer this question, This would be helpful for others,
Might be the compound index can help in this situation, Compound indexes can support queries that match on multiple fields.
You can create compound index on Id and receivedDate fields,
db.collectionName.createIndex({ Id: -1, receivedDate: -1 });
The order of the fields listed in a compound index is important. The index will contain references to documents sorted first by the values of the Id field and, within each value of the Id field, sorted by values of the receivedDate field.
I am trying to pull a list of all programs that air between 6pm and 11pm from my schedules collection. The problem is that in the match query, I am not sure how to extract the hour value from the StartUTC which is a DateTime value so that I can do the 23>x>18 comparison on all the times. Any ideas?
#"start": {"$gt": {"$hour": "$StartUtc"} }
print db.schedules.aggregate([
{"$match": { "$StartUtc"" : { "$gt" : 18, "$lte" : 23 } } },
{"$group": {"_id": "$OriginalProgramId", "count": {"$sum": 1}}},
{"$sort": SON([("count", -1), ("_id", -1)])},
])
What you need is the date aggregation operators and more specifically the $hour operator which returns the hour portion of a date. You can get the hour in a project phase of the pipeline, so it's available in the subsequent phases for match, etc...
db.schedules.aggregate([
{"starthour": { $hour: "$StartUtc" }}, // Project other values too so it's available in the next phase of the pipeline
{"$match": { "$starthour"" : { "$gt" : 18, "$lte" : 23 } } },
{"$group": {"_id": "$OriginalProgramId", "count": {"$sum": 1}}},
{"$sort": SON([("count", -1), ("_id", -1)])},
])
Note: For simplicity sake, I've only included the starthour field in the project phase. You'll have to include other fields as well for subsequent phase of the aggregation pipeline to work.