I have timeseries data as events coming in at random times. They are not ongoing metrics, but rather events. "This device went online." "This device went offline."
I need to report on the number of actual transitions within a time range. Because there are occasionally same-state events, for example two "went online" events in a row, I need to "seed" the data with the state previous to the time range. If I have events in my time range, I need to compare them to the state before the time range in order to determine if something actually changed.
I already have aggregation stages that remove same-state events.
Is there a way to add "the latest, previous event" to the data in the pipeline without writing two queries? A $facet stage totally ruins performance.
For "previous", I'm currently trying something like this in a separate query, but it's very slow on the millions of records:
// Get the latest event before a given date
db.devicemetrics.aggregate([
{
$match: {
'device.someMetadata': '70b28808-da2b-4623-ad83-6cba3b20b774',
time: {
$lt: ISODate('2023-01-18T07:00:00.000Z'),
},
someValue: { $ne: null },
},
},
{
$group: {
_id: '$device._id',
lastEvent: { $last: '$$ROOT' },
},
},
{
$replaceRoot: { newRoot: '$lastEvent' },
}
]);
You are looking for something akin to LAG window function in SQL. Mongo has $setWindowFields for this, combined with $shift Order operator.
Not sure about fields in your collection, but this should give you an idea.
{
$setWindowFields: {
partitionBy: "$device._id", //1. partition the data based on $device._id
sortBy: { time: 1 }, //2. within each partition, sort based on $time
output: {
"shiftedEvent": { //3. add a new field shiftedEvent to each document
$shift: {
output: "$event", //4. whose value is previous $event
by: -1
}
}
}
}
}
Then, you can compare the event and shiftedEvent fields.
Related
I want to be able to retrieve every nth item of a given collection which is quite large (millions of records)
Here is a sample of my collection
{
_id: ObjectId("614965487d5d1c55794ad324"),
hour: ISODate("2021-09-21T17:21:03.259Z"),
searches: [
ObjectId("614965487d5d1c55794ce670")
]
}
My start of aggregation is like so
[
{
$match: {
searches: {
$in: [ObjectId('614965487d5d1c55794ce670')],
},
},
},
{ $sort: { hour: -1 } },
{ $project: { hour: 1 } },
...
]
I have tried many things including
$sample which does not make the pick in the good order
Using $skip makes it very slow as the number given to skip grows
Using _id instead of $skip but my ids are unfortunately not created in an ordered manner
My goal is thus to retrieve the hour of a record, every 20000 record, so that I can then make a call to retrieve data by chunks of approximately 20000 records.
I imagine it would be possible to
sort, and number every records, then keep only the first, 20000, 40000, ..., and the last
Thanks for your help and let me know if you need more information
I am creating a way to generate reports of the amount of time equipment was down for, during a given time frame. I will potentially have 100s to thousands of documents to work with. Every document will have a start date and end date, both in BSON format and will generally be within minutes of each other. For simplicity sake I am also zeroing out the seconds.
The actual aggregation I need to do, is I need to calculate the amount of minutes between each given date, but there may be other documents with overlapping dates. Any overlapping time should not be calculated if it's been calculated already. There are various other aggregations I'll need to do, but this is the only one that I'm unsure of, if it's even possible at all.
{
"StartTime": "2020-07-07T18:10:00.000Z",
"StopTime": "2020-07-07T18:13:00.000Z",
"TotalMinutesDown": 3,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
{
"StartTime": "2020-07-07T18:12:00.000Z",
"StopTime": "2020-07-07T18:14:00.000Z",
"TotalMinutesDown": 2,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
The two documents above are examples of what I'm working with. Every document gets the total amount of minutes between the two dates stored in the document (This field serves another purpose, unrelated). If I were to aggregate this to get total minutes down, the output of total minutes should be 4, as I'm not wanting to calculate the overlapping minutes.
Finding overlap of time ranges sounds to me a bit abstract. Let's try to convert it to a concept that databases are usually used for: discrete values.
If we convert the times to discrete value, we will be able to find the duplicate values, i.e. the "overlapping values" and eliminate them.
I'll illustrate the steps using your sample data. Since you have zeroed out the seconds, for simplicity sake, we can start from there.
Since we care about minute increments we are going to convert times to "minutes" elapsed since the Unix epoch.
{
"StartMinutes": 26569090,
"StopMinutes": 26569092,
}
{
"StartMinutes": 26569092,
"StopMinutes": 26569092
}
We convert them to discrete values
{
"minutes": [26569090, 26569091, 26569092]
}
{
"minutes": [26569092, 26569093]
}
Then we can do a set union on all the arrays
{
"allMinutes": [26569090, 26569091, 26569092, 26569093]
}
This is how we can get to the solution using aggregation. I have simplified the queries and grouped some operations together
db.collection.aggregate({
$project: {
minutes: {
$range: [
{
$divide: [{ $toLong: "$StartTime" }, 60000] // convert to minutes timestamp
},
{
$divide: [{ $toLong: "$StopTime" }, 60000]
}
]
},
}
},
{
$group: { // combine to one document
_id: null,
_temp: { $push: "$minutes" }
}
},
{
$project: {
totalMinutes: {
$size: { // get the size of the union set
$reduce: {
input: "$_temp",
initialValue: [],
in: {
$setUnion: ["$$value", "$$this"] // combine the values using set union
}
}
}
}
}
})
Mongo Playground
I have a problem:
I have a set of documents which represent "completions of a task".
Each such completion has a user assigned to it, and a time the completion took.
I need to group my documents by user, and then sort it by the accumulated time, and this works fine:
const chartsAggregation = [
{
$group: {
_id: '$user',
totalTime: { $sum: '$totalTime' },
},
},
{
$sort: {
totalTime: -1,
},
},
{
$addFields: {
placement: { $inc: 1 }, // This does not work
},
},
];
However, I need to "burn in" the placement after sorting, the "rank" so to speak.
The reason is, that I want to display like a "charts page" with the people who took the most time one top. This page needs to be searchable and paginated, so people find themselves and their placement.
As I need to do search queries and limits (for the pagination) later, the actual positions of my users in the resulting array is no use to me.
I want to add a field (i tried this in the $addFields - portion) that associates the placement in the list with the data set, so even if I later filter and limit the results, the original placement is intact.
All I need for this is to add an incrementing counter within the $addFields - statement, but I can't find a way to do this. There doesn't seem to be something like that in the documentation.
Can you help me?
Below I have a structure for supporting custom picklist fields (in this example) within my sails.js application. The general idea is we support a collection of custom picklist values on any model within the app and the customer can have total control of the configuration of the custom field.
I went with this relationship model as using a simple json field lacks robustness when it comes to updating each individual custom picklist value. If I allow a customer to change "Internal" to "External" I need to update all records that have the value "Internal" recorded against that custom picklist with the new value.
This way - when I update the "value" field of CustomPicklistValue wherever that record is referenced via ID it will use the new value.
Now the problem comes when I need to integrate this model into my existing report engine...
rawCollection
.aggregate(
[
{
$match: {
createdAt: {
$gte: rangeEndDate,
$lte: rangeStartDate
},
...$match
}
},
{
$project: {
...$project,
total: $projectAggregation
}
},
{
$group: {
_id: {
...$groupKey
},
total: {
[`$${aggrAttrFunc}`]: "$total"
}
}
}
],
{
cursor: {
batchSize: 100
}
}
)
Here is the main part of a method for retrieving and aggregating any models stored in my mongodb instance. A user can specify a whole range of things including but not limited to the model, field specific date ranges and filters such as "where certificate status equals expired" etc.
So I'm now presented with this data structure:
{
id: '5e5fb732a9422a001146509f',
customPicklistValues: [
{
id: '5e4e904f16ab94bff1a324a0',
value: 'Internal',
fieldName: 'Business Group',
customPicklist: '109c7a1a9d00b664f2ee7827'
},
{
id: '5e4e904f16ab94bff1a324a4',
value: 'Slack',
fieldName: 'Application',
customPicklist: '109c5a1a9d00b664f2ee7827'
}
],
}
And for the life of me can't work out if there's any way I can essentially pull out fieldName and value for each of the populated records as key-value pairs and add each to the parent record before running my match clause...
I think I need to use lookup to initially populate the customPicklistValues and then merge them somehow?
Any help appreciated.
EDIT:
#whoami has suggested I use $addFields. There was a fair amount I needed to do before $addFields to populate the linked records (due to how Waterline via sails.js handles saving Mongo ObjectIDs in related collections as strings), you can see my steps in compass here:
Final step would be to edit this or add a stage to it to actually be able to support a key:value pair like Business Group: "Finance" in this example.
You can try these stages after your $lookup stage :
db.collection.aggregate([
{
$addFields: {
customPicklistValues:
{
$arrayToObject: {
$map: {
input: '$customPicklistValues',
in: { k: '$$this.fieldName', v: '$$this.value' }
}
}
}
}
},
{ $replaceRoot: { newRoot: { $mergeObjects: ['$customPicklistValues', '$$ROOT'] } } },
{ $project: { customPicklistValues: 0 } }
])
Test : MongoDB-Playground
There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?
TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.