MongoDB Aggregation question using summations / matches - mongodb

I have a collection with the following type of documents:
{
device: integer,
date: string,
time: string,
voltage: double,
amperage: double
}
Data is inserted as time series data, and a separate process aggregates and averages results so that this collection has a single document per device every 5 minutes. ie. time is 00:05:00, 00:10:00, etc.
I need to search for a specific group of devices (usually 5-10 at a time). I need the voltage to be >= 27.0, and I need to search for a single date.
That part is easy, but I need to only find data when all 5-10 systems at a time interval meet the 27.0 requirement. I'm not sure how to handle that requirement.
Once I know that, I then need to find the specific grouping of devices that have the lowest summation of the amperage field, and I need to return the time that this occurred.
So, lets assume I am going to search for 5 devices. I need to find the time when all 5 devices have a voltage >= 27.0 and the summation of the amperage field is the lowest.
I'm not sure how to require that all the devices meet the voltage requirement, and then for that group of devices, to then find the time when the amperage summation is the lowest.
Any questions would be great.
Thanks.

You need to use $all operator.
Note: Provide please more information about "the summation of the amperage field is the lowest"
db.collection.aggregate([
{
$match: {
device: { $in: [1, 2, 3] },
date: "2022/10/01",
voltage: { $gte: 27.0 }
}
},
{
$group: {
_id: "$time",
device: {
"$addToSet": "$device"
},
amperage: {
$min: "$amperage"
},
root: {
$push: "$$ROOT"
}
}
},
{
$match: {
device: { $all: [ 1, 2, 3 ] }
}
}
])
MongoPlayground

Related

MongoDB Aggregation to get events in timespan, plus the previous event

I have timeseries data as events coming in at random times. They are not ongoing metrics, but rather events. "This device went online." "This device went offline."
I need to report on the number of actual transitions within a time range. Because there are occasionally same-state events, for example two "went online" events in a row, I need to "seed" the data with the state previous to the time range. If I have events in my time range, I need to compare them to the state before the time range in order to determine if something actually changed.
I already have aggregation stages that remove same-state events.
Is there a way to add "the latest, previous event" to the data in the pipeline without writing two queries? A $facet stage totally ruins performance.
For "previous", I'm currently trying something like this in a separate query, but it's very slow on the millions of records:
// Get the latest event before a given date
db.devicemetrics.aggregate([
{
$match: {
'device.someMetadata': '70b28808-da2b-4623-ad83-6cba3b20b774',
time: {
$lt: ISODate('2023-01-18T07:00:00.000Z'),
},
someValue: { $ne: null },
},
},
{
$group: {
_id: '$device._id',
lastEvent: { $last: '$$ROOT' },
},
},
{
$replaceRoot: { newRoot: '$lastEvent' },
}
]);
You are looking for something akin to LAG window function in SQL. Mongo has $setWindowFields for this, combined with $shift Order operator.
Not sure about fields in your collection, but this should give you an idea.
{
$setWindowFields: {
partitionBy: "$device._id", //1. partition the data based on $device._id
sortBy: { time: 1 }, //2. within each partition, sort based on $time
output: {
"shiftedEvent": { //3. add a new field shiftedEvent to each document
$shift: {
output: "$event", //4. whose value is previous $event
by: -1
}
}
}
}
}
Then, you can compare the event and shiftedEvent fields.

Calculate amount of minutes between multiple date ranges, but don't calculate the overlapping dates in MongoDB

I am creating a way to generate reports of the amount of time equipment was down for, during a given time frame. I will potentially have 100s to thousands of documents to work with. Every document will have a start date and end date, both in BSON format and will generally be within minutes of each other. For simplicity sake I am also zeroing out the seconds.
The actual aggregation I need to do, is I need to calculate the amount of minutes between each given date, but there may be other documents with overlapping dates. Any overlapping time should not be calculated if it's been calculated already. There are various other aggregations I'll need to do, but this is the only one that I'm unsure of, if it's even possible at all.
{
"StartTime": "2020-07-07T18:10:00.000Z",
"StopTime": "2020-07-07T18:13:00.000Z",
"TotalMinutesDown": 3,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
{
"StartTime": "2020-07-07T18:12:00.000Z",
"StopTime": "2020-07-07T18:14:00.000Z",
"TotalMinutesDown": 2,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
The two documents above are examples of what I'm working with. Every document gets the total amount of minutes between the two dates stored in the document (This field serves another purpose, unrelated). If I were to aggregate this to get total minutes down, the output of total minutes should be 4, as I'm not wanting to calculate the overlapping minutes.
Finding overlap of time ranges sounds to me a bit abstract. Let's try to convert it to a concept that databases are usually used for: discrete values.
If we convert the times to discrete value, we will be able to find the duplicate values, i.e. the "overlapping values" and eliminate them.
I'll illustrate the steps using your sample data. Since you have zeroed out the seconds, for simplicity sake, we can start from there.
Since we care about minute increments we are going to convert times to "minutes" elapsed since the Unix epoch.
{
"StartMinutes": 26569090,
"StopMinutes": 26569092,
}
{
"StartMinutes": 26569092,
"StopMinutes": 26569092
}
We convert them to discrete values
{
"minutes": [26569090, 26569091, 26569092]
}
{
"minutes": [26569092, 26569093]
}
Then we can do a set union on all the arrays
{
"allMinutes": [26569090, 26569091, 26569092, 26569093]
}
This is how we can get to the solution using aggregation. I have simplified the queries and grouped some operations together
db.collection.aggregate({
$project: {
minutes: {
$range: [
{
$divide: [{ $toLong: "$StartTime" }, 60000] // convert to minutes timestamp
},
{
$divide: [{ $toLong: "$StopTime" }, 60000]
}
]
},
}
},
{
$group: { // combine to one document
_id: null,
_temp: { $push: "$minutes" }
}
},
{
$project: {
totalMinutes: {
$size: { // get the size of the union set
$reduce: {
input: "$_temp",
initialValue: [],
in: {
$setUnion: ["$$value", "$$this"] // combine the values using set union
}
}
}
}
}
})
Mongo Playground

Speed up aggregation on large collection

I currently have a database with about 270 000 000 documents. They look like this:
[{
'location': 'Berlin',
'product': 4531,
'createdAt': ISODate(...),
'value': 3523,
'minOffer': 3215,
'quantity': 7812
},{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}]
The database currently holds a bit over one month of data and has ~170 locations (in EU and US) with ~8000 products. These documents represent timesteps, so there are about ~12-16 entries per day, per product per location (at most 1 per hour though).
My goal is to retrieve all timesteps of a product in a given location for the last 7 days. For a single location this query works reasonable fast (150ms) with the index { product: 1, location: 1, createdAt: -1 }.
However, I also need these timesteps not just for a single location, but an entire region (so about 85 locations). I'm currently doing that with this aggregation, which groups all the entries per hour and averages the desired values:
this.db.collection('...').aggregate([
{ $match: { { location: { $in: [array of ~85 locations] } }, product: productId, createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) } } }, {
$group: {
_id: {
$toDate: {
$concat: [
{ $toString: { $year: '$createdAt' } },
'-',
{ $toString: { $month: '$createdAt' } },
'-',
{ $toString: { $dayOfMonth: '$createdAt' } },
' ',
{ $toString: { $hour: '$createdAt' } },
':00'
]
}
},
value: { $avg: '$value' },
minOffer: { $avg: '$minOffer' },
quantity: { $avg: '$quantity' }
}
}
]).sort({ _id: 1 }).toArray()
However, this is really really slow, even with the index { product: 1, createdAt: -1, location: 1 } (~40 secs). Is there any way to speed up this aggregation so it goes down to a few seconds at most? Is this even possible, or should I think about using something else?
I've thought about saving these aggregations in another database and just retrieving that and aggregating the rest, this is however really awkward for the first users on the site who have to sit 40 secs through waiting.
These are some ideas which can benefit the querying and performance. Whether all these will work together is matter of some trials and testing. Also, note that changing the way data is stored and adding new indexes means that there will changes to application, i.e., capturing data, and the other queries on the same data need to be carefully verified (that they are not affected in a wrong way).
(A) Storing a Day's Details in a Document:
Store (embed) a day's data within the same document as an array of sub-documents. Each sub-document represents an hour's entry.
From:
{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}
to:
{
location: 'London',
product: 1231,
createdAt: ISODate(...),
details: [ { value: 53523, minOffer: 44215, quantity: 2812 }, ... ]
}
This means about ten entries per document. Adding data for an entry will be pushing data into the details array, instead of adding a document as in present application. In case the hour's info (time) is required it can also be stored as part of the details sub-document; it will entirely depend upon your application needs.
The benefits of this design:
The number of documents to maintain and query will reduce (per
product per day about ten documents).
In the query, the group stage will go away. This will be just a
project stage. Note that the $project supports accumulators $avg and $sum.
The following stage will create the sums and averages for the day (or a document).
{
$project: { value: { $avg: '$value' }, minOffer: { $avg: '$minOffer' }, quantity: { $avg: '$quantity' } }
}
Note the increase in size of the document is not much, with the amount of details being stored per day.
(B) Querying by Region:
The present matching of multiple locations (or a region) with this query filer: { location: { $in: [array of ~85 locations] } }. This filter says : location: location-1, -or- location: location-3, -or- ..., location: location-50. Adding a new field , region, will filter with one value matching.
The query by region will change to:
{
$match: {
region: regionId,
product: productId,
createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) }
}
}
The regionId variable is to be supplied to match with the region field.
Note that, both the queries, "by location" and "by region", will benefit with the above two considerations, A and B.
(C) Indexing Considerations:
The present index: { product: 1, location: 1, createdAt: -1 }.
Taking into consideration, the new field region, newer indexing will be needed. The query with region cannot benefit without an index on the region field. A second index will be needed; a compound index to suit the query. Creating an index with the region field means additional overhead on write operations. Also, there will be memory and storage considerations.
NOTES:
After adding the index, both the queries ("by location" and "by region") need to be verified using explain if they are using their respective indexes. This will require some testing; a trial-and-error process.
Again, adding new data, storing data in a different format, adding new indexes requires to consider these:
Careful testing and verifying that the other existing queries perform as usual.
The change in data capture needs.
Testing the new queries and verifying if the new design performs as expected.
Honestly your aggregation is pretty much as optimized as it can get, especially if you have { product: 1, createdAt: -1, location: 1 } as an index like you stated.
I'm not exactly sure how your entire product is built, however the best solution in my opinion is to have another collection containing just the "relevant" documents from the past week.
Then you could query that collection with ease, This is quite easy to do in Mongo as well using a TTL Index.
If this not an option you could add a temporary field to the "relevant" documents and query on that making it somewhat faster to retrieve them, but maintaining this field will require you to have a process running every X time which could make your results now 100% accurate depending when you decide to run it.

MongoDB: aggregate and group by splitting the id

My schema implementation is influenced from this tutorial on official mongo site
{
_id: String,
data:[
{
point_1: Number,
ts: Date
}
]
}
This is basically schema designed for time series data and I store data for each hour per device in an array in a single document. I create _id field combining device id which is sending the data and time. For example if a device having id xyz1234 sends a data at 2018-09-11 12:30:00 then my _id field becomes xyz1234:2018091112.
I create new doc if the document for that hour for that device doesn't exist otherwise I just push my data to the data array.
client.db('iot')
.collection('iotdata')
.update({_id:id},{$push:{data:{point_1,ts:date}}},{upsert:true});
Now I am facing problem while doing aggregation. I am trying to get these types of values
Min point_1 value for many devices in last 24 hours by grouping on device id
Max point_1 value for many devices in last 24 hours by grouping on device id
Average point_1 for many devices in last 24 hours by grouping on device id
I thought this is very simple aggregation then I realized device id is not direct but mixed with time data so it's not so direct to group data by device id. How can I split the _id and group based on device id? I tried my level best to write the question as clearly as possible so please ask questions in comments if any part of the question is not clear.
You can start with $unwind on data to get single document per entry. Then you can get deviceId using $substr and $indexOfBytes operators. Then you can apply your filtering condition (last 24 hours) and use $group to get min, max and avg
db.col.aggregate([
{
$unwind: "$data"
},
{
$project: {
point_1: "$data.point_1",
deviceId: { $substr: [ "$_id", 0, { $indexOfBytes: [ "$_id", ":" ] } ] },
dateTime: "$data.ts"
}
},
{
$match: {
dateTime: { $gte: ISODate("2018-09-10T12:00:00Z") }
}
},
{
$group: {
_id: "$deviceId",
min: { $min: "$point_1" },
max: { $max: "$point_1" },
avg: { $avg: "$point_1" }
}
}
])
You can use below query in 3.6.
db.colname.aggregate([
{"$project":{
"deviceandtime":{"$split":["$_id", ":"]},
"minpoint":{"$min":"$data.point_1"},
"maxpoint":{"$min":"$data.point_1"},
"sumpoint":{"$sum":"$data.point_1"},
"count":{"$size":"$data.point_1"}
}},
{"$match":{"$expr":{"$gte":[{"$arrayElemAt":["$deviceandtime",1]},"2018-09-10 00:00:00"]}}},
{"$group":{
"_id":{"$arrayElemAt":["$deviceandtime",0]},
"minpoint":{"$min":"$minpoint"},
"maxpoint":{"$max":"$maxpoint"},
"sumpoint":{"$sum":"$sumpoint"},
"countpoint":{"$sum":"$count"}
}},
{"$project":{
"minpoint":1,
"maxpoint":1,
"avgpoint":{"$divide":["$sumpoint","$countpoint"]}
}}
])

Filter large dataset base on aggregation result

I need to do sort of an "Advanced Search" functionality with MongoDB. It's a sport system, where player statistic are collected for each season like this:
{
player: {
id: int,
name: string
},
goals: int,
season: int
}
Uses can search data across season, for example: I want to search for player who scored > 30 goals from season 2012 - 2016.
I could use mongodb aggregation:
db.stats.aggregate( [
{ $match: { season: { $gte: 2014, $lte: 2016 } } }
{ $group: { _id: "$player", totalGoals: { $sum: "$goals" } } },
{ $match: { $totalGoals: { $gte: 30 } } },
{ $limit: 10 },
{ $skip: 0 }
] )
That's working fine, the speed is acceptable for the collections with more than 3 millions records.
However, if the user just want to search for a larger seasons range, let say: players lifetime statistic. The aggregation turns out to be very very very slow. And I understand that MongoDB has to go through all the docs and calculate the $totalGoals.
I just wonder if there is better approach that could solve this performance problem?
you can have pre-calculated data for past seasons and make two step query:
a) get past data
b) get current data
you could try to optimise indexes on that query
hardware: use SSD
hardware: more memory
introduce sharding to split load