MongoDB: Aggregating hourly data into daily aggregates

MongoDB: Aggregating hourly data into daily aggregates - mongodb

I am new to MongoDB and have not been able to find a solution to my problem.
I am collecting hourly crypto data. Each document is an array of objects. Within each of these objects there is another nested array of objects. It looks as follows:
timestamp: "2022-05-11T12:38:01.537Z",
positions: [
{
detail: 1,
name: "name",
importantNumber: 0,
arrayOfTokens: [
{
tokenName: "name",
tokenSymbol: "symbol",
tokenPrice: 1,
tokensEarned: 10,
baseAssetValueOfTokensEarned: 10,
},
{
tokenName: "name2",
tokenSymbol: "symbol2",
tokenPrice: 2,
tokensEarned: 10,
baseAssetValueOfTokensEarned: 20,
},
],
},
],};
My goal is to be able to aggregate the hourly data into daily groups, where the timestamp becomes the day's date, the position array still houses the primary details of each position, sums the importantNumber (these I believe I have been able to achieve), and aggregates each hour's token details into one object for each token, calculating the average token price, the total tokens earned etc.
What I have so far is:
const res = await Name.aggregate([
{
$unwind: {
path: "$positions",
},
},
{
$project: {
_id: 0,
timestamp: "$timestamp",
detail: "$positions.detail",
name: "$positions.name",
importantNumber: "$positions.importantNumber",
arrayOfTokens: "$positions.arrayOfTokens ",
},
},
{
$group: {
_id: {
date: { $dateToString: { format: "%Y-%m-%d", date: "$timestamp" } },
name: "$name",
},
importantNumber: { $sum: "$importantNumber" },
arrayOfTokens: { $push: "$arrayOfTokens" }, // it is here that I am stuck
},
},
]);
return res;
};
With two hours recorded, the above query returns the following result, with the arrayOfTokens housing multiple arrays:
{
_id: {
date: '2022-05-11',
name: 'name',
},
importantNumber: //sum of important number,
arrayOfTokens: [
[ [Object], [Object] ], // hour 1: token1, token2.
[ [Object], [Object] ] // hour 2: token1, token2
]
}
I would like the arrayOfTokens to house only one instance of each token object. Something similar to the following:
...
arrayOfTokens: [
{allToken1}, {allToken2} // with the name and symbol preserved, and average of the token price, sum of tokens earned and sum of base asset value.
]
Any help would be greatly appreciated, thank you.

Should be this one:
db.collection.aggregate([
{ $unwind: { path: "$positions" } },
{
$group: {
_id: {
date: { $dateTrunc: { date: "$timestamp", unit: "day" } },
name: "$positions.name"
},
importantNumber: { $sum: "$positions.importantNumber" },
arrayOfTokens: { $push: "$positions.arrayOfTokens" }
}
}
])
I prefer $dateTrunc over group by string.
Mongo Playground

Related

How to map on array fields with a dynamic variable in MongoDB, while projection (aggregation)

I want to serve data from multiple collections, let's say product1 and product2.
Schemas of both can be referred to as -:
{ amount: Number } // other fields might be there but not useful in this case.
Now after multiple stages of aggregation pipeline, I'm able to get the data in the following format-:
items: [
{
amount: 10,
type: "product1",
date: "2022-10-05"
},
{
amount: 15,
type: "product2",
date: "2022-10-07"
},
{
amount: 100,
type: "product1",
date: "2022-10-10"
}
]
However, I want one more field added to each element of items - The sum of all the previous amounts.
Desired Result -:
items: [
{
amount: 10,
type: "product1",
date: "2022-10-05",
totalAmount: 10
},
{
amount: 15,
type: "product2",
date: "2022-10-07",
totalAmount: 25
},
{
amount: 100,
type: "product1",
date: "2022-10-10",
totalAmount: 125
}
]
I tried adding another $project stage, which goes as follows -:
{
items: {
$map: {
input: "$items",
in: {
$mergeObjects: [
"$$this",
{ totalAmount: {$add : ["$$this.amount", 0] } },
]
}
}
}
}
This just appends another field, totalAmount as the sum of 0 and the amount of that item itself.
I couldn't find a way to make the second argument (currently 0) in {$add : ["$$this.amount", 0] } as a variable (initial value 0).
What's the way to perform such action in MongoDb aggregation pipeline ?
PS-: I could easily perform this action by a later mapping in the code itself, but I need to add limit (for pagination) to it in the later stage.

You can use $reduce instead of $map for this:
db.collection.aggregate([
{$project: {
items: {
$reduce: {
input: "$items",
initialValue: [],
in: {
$concatArrays: [
"$$value",
[{$mergeObjects: [
"$$this",
{totalAmount: {$add: ["$$this.amount", {$sum: "$$value.amount"}]}}
]}]
]
}
}
}
}}
])
See how it works on the playground example

Performance of mongo request for rain/sunshine/raindays on weekends

I want to know:
sum of rain (mm)
sum of sunshine (hours)
Probability (%) of a rainday with more than 0.5mm rain on weekends
On the weekends (sa+so) between week 20 to 40 for the last 17years.
I have 820k documents in 10min periods.
The request took sometimes 38sec but sometimes more than 1min.
Do you have an Idea how to improve performance?
data-Model:
[
'datum',
'regen',
'tempAussen',
'sonnenSchein',
and more...
]
schema:
[
{
$project: {
jahr: {
$year: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
woche: {
$week: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
day: {
$isoDayOfWeek: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
stunde: {
$hour: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
tagjahr: {
$dayOfYear: {
date: '$datum',
timezone: 'Europe/Berlin',
},
},
tempAussen: 1,
regen: 1,
sonnenSchein: 1,
},
},
{
$match: {
$and: [
{
woche: {
$gte: 20,
},
},
{
woche: {
$lte: 40,
},
},
{
day: {
$gte: 6,
},
},
],
},
},
{
$group: {
_id: ['$tagjahr', '$jahr'],
woche: {
$first: '$woche',
},
regen_sum: {
$sum: '$regen',
},
sonnenSchein_sum: {
$sum: '$sonnenSchein',
},
},
},
{
$project: {
_id: '$_id',
regenTage: {
$sum: {
$cond: {
if: {
$gte: ['$regen_sum', 0.5],
},
then: 1,
else: 0,
},
},
},
woche: 1,
regen_sum: 1,
sonnenSchein_sum: 1,
},
},
{
$group: {
_id: '$woche',
regen_sum: {
$sum: '$regen_sum',
},
sonnenSchein_sum: {
$sum: '$sonnenSchein_sum',
},
regenTage: {
$sum: '$regenTage',
},
},
},
{
$project: {
regenTage: 1,
regen_sum: {
$divide: ['$regen_sum', 34],
},
sonnenSchein_sum: {
$divide: ['$sonnenSchein_sum', 2040],
},
probability: {
$divide: ['$regenTage', 0.34],
},
},
},
{
$project: {
regen_sum: {
$round: ['$regen_sum', 1],
},
sonnenSchein_sum: {
$round: ['$sonnenSchein_sum', 1],
},
wahrscheinlich: {
$round: ['$probability', 0],
},
},
},
{
$sort: {
_id: 1,
},
},
]
this result is an example for week 20:
on the weekend of calender week 20 I have in average 2.3mm rain, 11.9h sunshine and a probility of 35% that it will rain atleast on one day of the weekend
_id:20
regen_sum:2.3
sonnenSchein_sum:11.9
probability:35

Without having the verbose explain output (.explain("allPlansExecution")), it is hard to say anything for sure. Here are some observations from just taking a look at the aggregation pipeline that was provided (underneath "schema:").
Before going into observations, I must ask what your specific goals are. Are operations like these something you will be running frequently? Is anything faster than 38 seconds acceptable, or is there a specific runtime that you are looking for? As outlined below, there probably isn't much opportunity for direct improvement. Therefore it might be beneficial to look into other approaches to the problem, and I'll outline one at the end.
The first observation is that this aggregation is going to perform a full collection scan. Even if an index existed on the datum field, it could not be used since the filtering in the $match is done on new fields that are calculated from datum. We could make some changes to allow an index to be used, but it probably wouldn't help. You are processing ~38% of your data (20 of the 52 weeks per year) so the overhead of doing the index scan and randomly fetching a significant portion of the data is probably more than just scanning the entire collection directly.
Secondly, you are currently $grouping twice. The only reason for this seems to be so that you can determine if a day is considered 'rainy' first (more than 0.5mm of rain). But the 'rainy day' indicator then effectively gets combined to become a 'rainy weekend' indicator in the second grouping. Although it could technically change the results a little due to the rounding done on the 24 hour basis, perhaps that small change would be worthwhile to eliminate one of the $group stages entirely?
If this were my system, I would consider pre-aggregating some of this data. Specifically having daily summaries as opposed to the 10 minute intervals of raw data would really go a long way here in reducing the amount of processing that is required to generate summaries like this. Details for each day (which won't change) would then be contained in a single document rather than in 144 individual ones. That would certainly allow an aggregation logically equivalent to the one above to process much faster than what you are currently observing.

categraji data by using MongoDb aggregation

Payload in excel sheets that consist of 4 columns i.e Date, status, amount, orderId.You need to structure the data / categorize the columns according to months and in each month orders are categorized as per status.
Umbrella Status:
INTRANSIT - ‘intransit’, ‘at hub’, ‘out for delivery’
RTO - ‘RTO Intransit’, ‘RTO Delivered’
PROCESSING - ‘processing’
For example:
Response should look like: -
May :
1.INTRANSIT
2. RTO
3.PROCESSING
June:
1.INTRANSIT
2. RTO
3.PROCESSING
You can use different aggregation operators provided in MongoDB.For example: -group, facet, Match, unwind, bucket, project, lookup, etc.
I tried it with this:
const pipeline = [{
$facet:
{
"INTRANSIT": [{ $match: { Status: { $in: ['INTRANSIT', 'AT HUB', 'OUT FOR
DELIVERY'] } } }, { $group: { _id: "$Date", numberofbookings: { $sum: 1 } }
}],
"RTO": [{ $match: { Status: { $in: ['RTO INTRANSIT', 'RTO DELIVERED'] } } },
{ $group: { _id: "$Date", numberofbookings: { $sum: 1 } } }],
"PROCESSING": [{ $match: { Status: { $in: ['PROCESSING'] } } }, {
$group: {
_id: date.getMonth("$Date"),
numberofbookings: { $sum: 1 }
}
}]
}
}];
const aggCursor = coll.aggregate(pipeline);

Creating objects from array of objects in a grouped mongo aggregation

I have been writing an aggregation pipeline to show a summarized version of data from a collection.
Sample Structure of Document:
{
_id: 'abcxyz',
eventCode: 'EVENTCODE01',
eventName: 'SOMEEVENT',
units: 1,
rate: 2,
cost: 2,
distribution: [
{
startDate: 2021-05-31T04:00:00.000+00:00
units: 1
}
]
}
I have grouped it and merged the distribution into a single list with $unwind step before $group:
[
$unwind: {
path: '$distribution',
preserveNullAndEmptyArrays: false
},
$group: {
_id: {
eventName: '$eventName',
eventCode: '$eventCode'
},
totalUnits: {
$sum: '$units'
},
distributionList: {
$push: '$distribution'
},
perUnitRate: {
$avg: '$rate'
},
perUnitCost: {
$avg: '$cost'
}
}
]
Sample Output:
{
_id: {
eventName: 'EVENTNAME101'
eventCode: 'QQQ'
},
totalUnits: 7,
perUnitRate: 2,
perUnitCost: 2,
distributionList: [
{
startDate: 2021-05-31T04:00:00.000+00:00,
units: 1
},
{
startDate: 2021-05-31T04:00:00.000+00:00,
units: 1
},
{
startDate: 2021-06-07T04:00:00.000+00:00,
units: 1
}
]
}
I'm getting stuck at the next step; I want to consolidate the distributionList into a new List with no repeating startDate.
Example: Since first 2 objects of distributionList have the same startDate, it should be a single object in output with sum of units:
Expected:
{
_id: {
eventName: 'EVENTNAME101'
eventCode: 'QQQ'
},
totalUnits: 7,
perUnitRate: 2,
perUnitCost: 2,
newDistributionList: [
{
startDate: 2021-05-31T04:00:00.000+00:00,
units: 2 //units summed for first 2 objects
},
{
startDate: 2021-06-07T04:00:00.000+00:00,
units: 1
}
]
}
I couldn't use $unwind or $bucket as I intend to keep the grouping I did in previous steps ($group).
Can I get suggestions or a different approach if this doesn't seem accurate?

You may want to do the first $group at eventName, eventCode, distribution.startDate level. Then, you can $group again at eventName, eventCode level and using $first to keep your original $group fields.
Here is the Mongo Playground to show the idea for your reference.

MongoDB aggregation: $unwind after grouping by date

I have this model for purchases:
{
purchase_date: 2018-03-11 00:00:00.000,
total_cost: 400,
items: [
{
title: 'Pringles',
price: 200,
quantity: 2,
category: 'Snacks'
}
]
}
What I'm trying to do is to, first of all, to group the purchases by date, by doing so:
{$group: {
_id: {
date: $purchase_date,
items: '$items'
}
}}
However, now what I want to do is group the purchases of each day by items[].category and calculate how much was spent for each category in that day. I was able to do that with one day, but when I grouped each purchase by date I no longer able to $unwind the items.
I tried passing the path $items and it doesn't find it at all. If I try to use $_id.$items or _id.$items in both cases I get an error stating that it is not a valid path for $unwind.

You can use purchase_data and items.category as a grouping _id but you need to use $unwind on items before and then you can add another $group to get all groups per day
db.col.aggregate([
{ $unwind: "$items" },
{
$group: {
_id: {
purchase_date: "$purchase_date",
category: "$items.category",
},
total: { $sum: { $multiply: [ "$items.price", "$items.quantity" ] } }
}
},
{
$group: {
_id: "$_id.purchase_date",
categories: { $push: { name: "$_id.category", total: "$total" } }
}
}
])

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse