Mongodb aggregate multilevels - mongodb

I'm trying to count documents containing
{ date, direction, procedure } e.g
{'Dec 12', 'West', 'Up' }
and I want output: foreach date, foreach direction, count each procedure type
Dec 12
North Up 2 Down 3
South Up 4 Down 17
etc
It's fairly easy using javascript but I'd like to use mongodb if possible. I can't get aggregate group to filter more than one level and I'm not sure if map_reduce would help. I don't properly understand either.
I would appreciate a little guidance. Thanks
Some detail:
It's a schema-less collection but the interesting bits look like this:
{ "_id" : ObjectId(), "direction" : String, "procedure" : String, "date" : String, .... , "format" : "procedure" }
direction: "North" | "East" | "South" | "West"
procedure: "Arrive" | "Depart"
date: "Mmm dd"
.... lots of other stuff
The output is not critical - it could be:
[ { date: "Mmm dd",
direction: { procedure: count, procedure: count },
direction: { procedure: count, ... },
....
}
{ ... }
...
]
e.g:
[ { date: "Dec 12",
"West": { "Arrive": 5, "Depart": 5 },
"East": { "Arrive": 1, "Depart": 7 },
...
},
{ date: ...},
...
]
The more I play with it the more I think it's a bit of a stretch - That could be good advise :-)

This is a solution for your aggregation pipeline:
[{
'$group': {
'_id': {
'date': '$date',
'direction': '$direction',
'procedure': '$procedure'
},
'count': {'$sum': 1}
}
},
{
'$group': {
'_id': '$_id.date',
'directions': {
'$push': {
'direction': '$_id.direction',
'procedure': '$_id.procedure',
'count': '$count'
}
}
}
}]
Giving the following result:
{
_id: "Dec 12",
directions: [
{ "direction": "North", "procedure": "Arrive", "count": 5},
{ "direction": "North", "procedure": "Depar", "count": 3},
{ "direction": "South", "procedure": "Arrive", "count": 1},
...
]
},
...
Explanation
Basically what you are asking for is a count for each (date, direction, procedure) tuple. You just want it to be a little reorganized, and more precisely: grouped by date with for each date all possible (direction, procedure) couples, and the corresponding count.
So we are exactly doing this:
first $group stage in the pipeline groups by unique (date, direction, procedure), putting them in the _id field, and counting occurences; at this stage the output is:
[{
_id: {
date: "Dec 12",
direction: "North",
procedure: "Depar"
},
count: 4
},
...
]
second $group stage just re-groups the results by date pushing other fields (which are embedded in a document at the _id field, as result of the previous stage) into an array at the new directions field, as (direction, procedure, count) tuples with the same date.

Related

How to retrieve each single array element from mongo pipeline?

Let's assume that this is how a sample document looks like in mongo-db,
[
{
"_id": "1",
"attrib_1": "value_1",
"attrib_2": "value_2",
"months": {
"2": {
"month": "2",
"year": "2008",
"transactions": [
{
"field_1": "val_1",
"field_2": "val_2",
},
{
"field_1": "val_4",
"field_2": "val_5",
"field_3": "val_6"
},
]
},
"3": {
"month": "3",
"year": "2018",
"transactions": [
{
"field_1": "val_7",
"field_3": "val_9"
},
{
"field_1": "val_10",
"field_2": "val_11",
},
]
},
}
}
]
The desired output is something like this, (I am just showing it for months 2 & 3)
id
months
year
field_1
field_2
field_3
1
2
2008
val_1
val_2
1
2
2008
val_4
val_5
val_6
1
3
2018
val_7
val_9
1
3
2018
val_10
val_11
My attempt:
I tried something like this in Py-Mongo,
pipeline = [
{
# some filter logic here to filter data basically first
},
{
"$addFields": {
"latest": {
"$map": {
"input": {
"$objectToArray": "$months",
},
"as": "obj",
"in": {
"all_field_1" : {"$ifNull" : ["$$obj.v.transactions.field_1", [""]]},
"all_field_2": {"$ifNull" : ["$$obj.v.transactions.field_2", [""]]},
"all_field_3": {"$ifNull" : ["$$obj.v.transactions.field_3", [""]]},
"all_months" : {"$ifNull" : ["$$obj.v.month", ""]},
"all_years" : {"$ifNull" : ["$$obj.v.year", ""]},
}
}
}
}
},
{
"$project": {
"_id": 1,
"months": "$latest.all_months",
"year": "$latest.all_years",
"field_1": "$latest.all_field_1",
"field_2": "$latest.all_field_2",
"field_3": "$latest.all_field_3",
}
}
]
# and I executed it as
my_db.collection.aggregate(pipeline, allowDiskUse=True)
The above is actually bring the data but it's bringing them in lists. Is there a way to easily bring them one each row in mongo itself?
the above brings data in this way,
id
months
year
field_1
field_2
field_3
1
["2", "3"]
["2008", "2018"]
[["val_1", "val_4"], ["val_7", "val_10"]]
[["val_2", "val_5"], ["", "val_11"]]
[["", "val_6"], ["val_9", ""]]
Would highly appreciate your valuable inputs regarding the same and a better way to do the same as well!
Thanks for your time.
My Mongo version is 3.4.6 and I am using PyMongo as my driver. You can see the query in action at mongo-db-playground
This is might be bad idea to do all process in a aggregation query, you could do this in your client side,
I have created a query which is lengthy may cause performance issues in huge data,
$objectToArray convert months object to array
$unwind deconstruct months array
$unwind deconstruct transactions array and provide index field index
$group by _id, year, month and index, and get first object from transactions in fields
$project you can design your response if you want otherwise this is optional i have added in playground link
my_db.collection.aggregate([
{ # some filter logic here to filter data basically first },
{ $project: { months: { $objectToArray: "$months" } } },
{ $unwind: "$months" },
{
$unwind: {
path: "$months.v.transactions",
includeArrayIndex: "index"
}
},
{
$group: {
_id: {
_id: "$_id",
year: "$months.v.year",
month: "$months.v.month",
index: "$index"
},
fields: { $first: "$months.v.transactions" }
}
}
], allowDiskUse=True);
Playground

How do I get a sum of the occurrence of each item in an array across all documents?

I want to get an aggregation/count of the occurrence of all items in an array across all documents. I've tried looking up examples but none of them seem to cover this scenario exactly or go about it in a very obtuse way.
Here's a simple idea of the document model i'm working with. The itemIds array within each object is always unique (no repeated values):
[{
_id:1,
itemIds:[3, 4, 6, 12]
},
{
_id:2,
itemIds:[4, 12]
},
{
_id:3,
itemIds:[3, 4, 8, 9, 12]
}]
I need the counts of each of these summed up (doesn't have to be this exact format but just giving a general idea of what I need):
{
itemsCount:[
{
itemId:3,
count:2
},
{
itemId:4,
count:3
},
{
itemId:6,
count:1
},
{
itemId:8,
count:1
},
{
itemId:9,
count:1
},
{
itemId:12,
count:3
}
]
}
Please try this :
db.yourCollection.aggregate([
{$project : {'itemIds' : 1, _id :0}},
{$unwind : '$itemIds'},
{$group : {'_id': '$itemIds', count :{$sum :1}}}
])

MongoDB query - aggregates and embedded documents

Need some help writing a MongoDB query.
Background: I'm building an app that keeps track of donations.
I creating an API in ExpressJS, and I am using Mongoose to hook up to MongoDB.
I have a MongoDB collection called Donations that looks like this:
[
{
donor: 123,
currency: 'CAD',
donationAmount: 50
},
{
donor: 123,
currency: 'USD',
donationAmount: 50
},
{
donor: 789,
currency: 'CAD',
donationAmount: 50
},
{
donor: 123,
currency: 'CAD',
donationAmount: 50
}
]
For each donor, I need to sum up the total amount of donations per currency.
Ideally I want a single MongoDB query that would produce the following dataset. (I'm flexible on the structure ... my only requirement are that in the results, 1) each donor has one and only one document, and 2) this document contains the summed total of each currency type)
[
{
donor: 123,
donations: [
{
CAD : 100,
},
{
USD : 50
}
]
},
{
donor: 789,
donations: [
{
CAD: 50
}
]
},
]
Any ideas on the best way to do this?
My solution right now is pretty ugly - I haven't been able to achieve it without doing multiple queries.
You can run $group twice and use $arrayToObject to build your keys dynamically:
Model.aggregate([
{ $group: { _id: { donor: "$donor", currency: "$currency" }, sum: { $sum: "$donationAmount" } } },
{ $group: { _id: "$_id.donor", donations: { $push: { $arrayToObject: [[{ k: "$_id.currency", v: "$sum" }]] } } } },
{ $project: { _id: 0, donor: "$_id", donations: 1 } }
])
Mongo Playground

How to handle partial week data grouping in mongodb

I have some docs (daily open price for a stock) like the followings:
/* 0 */
{
"_id" : ObjectId("54d65597daf0910dfa8169b0"),
"D" : ISODate("2014-12-29T00:00:00.000Z"),
"O" : 104.98
}
/* 1 */
{
"_id" : ObjectId("54d65597daf0910dfa8169af"),
"D" : ISODate("2014-12-30T00:00:00.000Z"),
"O" : 104.73
}
/* 2 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ae"),
"D" : ISODate("2014-12-31T00:00:00.000Z"),
"O" : 104.51
}
/* 3 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ad"),
"D" : ISODate("2015-01-02T00:00:00.000Z"),
"O" : 103.75
}
/* 4 */
{
"_id" : ObjectId("54d65597daf0910dfa8169ac"),
"D" : ISODate("2015-01-05T00:00:00.000Z"),
"O" : 102.5
}
and I want to aggregate the records by week so I can get the weekly average open price. My first attempt is to use:
db.ohlc.aggregate({
$match: {
D: {
$gte: new ISODate('2014-12-28')
}
}
}, {
$project: {
year: {
$year: '$D'
},
week: {
$week: '$D'
},
O: 1
}
}, {
$group: {
_id: {
year: '$year',
week: '$week'
},
O: {
$avg: '$O'
}
}
}, {
$sort: {
_id: 1
}
})
Bu I soon realized the result is incorrect as both the last week of 2014 (week number 52) and the first week of 2015 (week number 0) are partial weeks. With this aggregation I would have an average price for 12/29-12/31/2014 and another one for 01/02/2015 (which is the only trading date in the first week of 2015) but in my application I would need to group the data from 12/29/2015 through 01/02/2015. Any advice?
To answer my own question, the trick is to calculate the number of weeks based on a reference date (1970-01-04) and group by that number. You can check out my new post at http://midnightcodr.github.io/2015/02/07/OHLC-data-grouping-with-mongodb/ for details.
I use this for candelization; with allowDiskUsage, out and some date filters it works great. Maybe you can adopt the grouping?
db.getCollection('market').aggregate(
[
{ $match: { date: { $exists: true } } },
{ $sort: { date: 1 } },
{ $project: { _id: 0, date: 1, rate: 1, amount: 1, tm15: { $mod: [ "$date", 900 ] } } },
{ $project: { _id: 0, date: 1, rate: 1, amount: 1, candleDate: { $subtract: [ "$date", "$tm15" ] } } },
{ $group: { _id: "$candleDate", open: { $first: '$rate' }, low: { $min: '$rate' }, high: { $max: '$rate' }, close: { $last: '$rate' }, volume: { $sum: '$amount' }, trades: { $sum: 1 } } }
])
From my experience, this is not a really good approach to tackle the problem. Why? This will definitely not scale, the amount of computation needed is quite exhausting, specially to do the grouping.
What I would do in your situation is to move part of the application logic to the documents in the DB.
My first approach would be to add a "week" field that will state the previous (or next) Sunday of the date the sample belongs to. This is quite easy to do at the moment of insertion. Then you can simply run the aggregation method grouping by that field. If you want more performance, add an index for { symbol : 1, week : 1 } and do a sort in the aggregate.
My second approach, which would be if you plan on making a lot this type of aggregations, is basically having documents that group the samples in a weekly manner. Like this:
{
week : <Day Representing Week>,
prices: [
{ Day Sample }, ...
]
}
Then you can simply work on those documents directly. This will help you reduce your indexes in a significant manner, thus speeding things up.

MongoDB - Query all documents createdAt within last hours, and group by minute?

From reading various articles out there, I believe this should be possible, but I'm not sure where exactly to start.
This is what I'm trying to do:
I want to run a query, where it finds all documents createAt within the last hour, and groups all of them by minute, and since each document has a tweet value, like 5, 6, or 19, add them up for each one of those minutes and provides a sum.
Here's a sample of the collection:
{
"createdAt": { "$date": 1385064947832 },
"updatedAt": null,
"tweets": 47,
"id": "06E72EBD-D6F4-42B6-B79B-DB700CCD4E3F",
"_id": "06E72EBD-D6F4-42B6-B79B-DB700CCD4E3F"
}
Is this possible to do in mongodb?
#zero323 - I first tried just grouping the last hour like so:
db.tweetdatas.group( {
key: { tweets: 1, 'createdAt': 1 },
cond: { createdAt: { $gt: new Date("2013-11-20T19:44:58.435Z"), $lt: new Date("2013-11-20T20:44:58.435Z") } },
reduce: function ( curr, result ) { },
initial: { }
} )
But that just returns all the tweets within the timeframe, which technically is what I want, but now I want to group them all by each minute, and add up the sum of tweets for each minute.
#almypal
Here is the query that I'm using, based off your suggestion:
db.tweetdatas.aggregate(
{$match:{ "createdAt":{$gt: "2013-11-22T14:59:18.748Z"}, }},
{$project: { "createdAt":1, "createdAt_Minutes": { $minute : "$createdAt" }, "tweets":1, }},
{$group:{ "_id":"$createdAt_Minutes", "sum_tweets":{$sum:"$tweets"} }}
)
However, it's displaying this response:
{ "result" : [ ], "ok" : 1 }
Update: The response from #almypal is working. Apparently, putting in the date like I have in the above example does not work. While I'm running this query from Node, in the shell, I thought it would be easier to convert the var date to a string, and use that in the shell.
Use aggregation as below:
var lastHour = new Date();
lastHour.setHours(lastHour.getHours()-1);
db.tweetdatas.aggregate(
{$match:{ "createdAt":{$gt: lastHour}, }},
{$project: { "createdAt":1, "createdAt_Minutes": { $minute : "$createdAt" }, "tweets":1, }},
{$group:{ "_id":"$createdAt_Minutes", "sum_tweets":{$sum:"$tweets"} }}
)
and the result would be like this
{
"result" : [
{
"_id" : 1,
"sum_tweets" : 117
},
{
"_id" : 2,
"sum_tweets" : 40
},
{
"_id" : 3,
"sum_tweets" : 73
}
],
"ok" : 1
}
where _id corresponds to the specific minute and sum_tweets is the total number of tweets in that minute.