Mongodb selecting every nth of a given sorted aggregation - mongodb

I want to be able to retrieve every nth item of a given collection which is quite large (millions of records)
Here is a sample of my collection
{
_id: ObjectId("614965487d5d1c55794ad324"),
hour: ISODate("2021-09-21T17:21:03.259Z"),
searches: [
ObjectId("614965487d5d1c55794ce670")
]
}
My start of aggregation is like so
[
{
$match: {
searches: {
$in: [ObjectId('614965487d5d1c55794ce670')],
},
},
},
{ $sort: { hour: -1 } },
{ $project: { hour: 1 } },
...
]
I have tried many things including
$sample which does not make the pick in the good order
Using $skip makes it very slow as the number given to skip grows
Using _id instead of $skip but my ids are unfortunately not created in an ordered manner
My goal is thus to retrieve the hour of a record, every 20000 record, so that I can then make a call to retrieve data by chunks of approximately 20000 records.
I imagine it would be possible to
sort, and number every records, then keep only the first, 20000, 40000, ..., and the last
Thanks for your help and let me know if you need more information

Related

Calculate amount of minutes between multiple date ranges, but don't calculate the overlapping dates in MongoDB

I am creating a way to generate reports of the amount of time equipment was down for, during a given time frame. I will potentially have 100s to thousands of documents to work with. Every document will have a start date and end date, both in BSON format and will generally be within minutes of each other. For simplicity sake I am also zeroing out the seconds.
The actual aggregation I need to do, is I need to calculate the amount of minutes between each given date, but there may be other documents with overlapping dates. Any overlapping time should not be calculated if it's been calculated already. There are various other aggregations I'll need to do, but this is the only one that I'm unsure of, if it's even possible at all.
{
"StartTime": "2020-07-07T18:10:00.000Z",
"StopTime": "2020-07-07T18:13:00.000Z",
"TotalMinutesDown": 3,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
{
"StartTime": "2020-07-07T18:12:00.000Z",
"StopTime": "2020-07-07T18:14:00.000Z",
"TotalMinutesDown": 2,
"CreatedAt": "2020-07-07T18:13:57.675Z"
}
The two documents above are examples of what I'm working with. Every document gets the total amount of minutes between the two dates stored in the document (This field serves another purpose, unrelated). If I were to aggregate this to get total minutes down, the output of total minutes should be 4, as I'm not wanting to calculate the overlapping minutes.
Finding overlap of time ranges sounds to me a bit abstract. Let's try to convert it to a concept that databases are usually used for: discrete values.
If we convert the times to discrete value, we will be able to find the duplicate values, i.e. the "overlapping values" and eliminate them.
I'll illustrate the steps using your sample data. Since you have zeroed out the seconds, for simplicity sake, we can start from there.
Since we care about minute increments we are going to convert times to "minutes" elapsed since the Unix epoch.
{
"StartMinutes": 26569090,
"StopMinutes": 26569092,
}
{
"StartMinutes": 26569092,
"StopMinutes": 26569092
}
We convert them to discrete values
{
"minutes": [26569090, 26569091, 26569092]
}
{
"minutes": [26569092, 26569093]
}
Then we can do a set union on all the arrays
{
"allMinutes": [26569090, 26569091, 26569092, 26569093]
}
This is how we can get to the solution using aggregation. I have simplified the queries and grouped some operations together
db.collection.aggregate({
$project: {
minutes: {
$range: [
{
$divide: [{ $toLong: "$StartTime" }, 60000] // convert to minutes timestamp
},
{
$divide: [{ $toLong: "$StopTime" }, 60000]
}
]
},
}
},
{
$group: { // combine to one document
_id: null,
_temp: { $push: "$minutes" }
}
},
{
$project: {
totalMinutes: {
$size: { // get the size of the union set
$reduce: {
input: "$_temp",
initialValue: [],
in: {
$setUnion: ["$$value", "$$this"] // combine the values using set union
}
}
}
}
}
})
Mongo Playground

How to add a counter in a MongoDB aggregate stage?

I have a problem:
I have a set of documents which represent "completions of a task".
Each such completion has a user assigned to it, and a time the completion took.
I need to group my documents by user, and then sort it by the accumulated time, and this works fine:
const chartsAggregation = [
{
$group: {
_id: '$user',
totalTime: { $sum: '$totalTime' },
},
},
{
$sort: {
totalTime: -1,
},
},
{
$addFields: {
placement: { $inc: 1 }, // This does not work
},
},
];
However, I need to "burn in" the placement after sorting, the "rank" so to speak.
The reason is, that I want to display like a "charts page" with the people who took the most time one top. This page needs to be searchable and paginated, so people find themselves and their placement.
As I need to do search queries and limits (for the pagination) later, the actual positions of my users in the resulting array is no use to me.
I want to add a field (i tried this in the $addFields - portion) that associates the placement in the list with the data set, so even if I later filter and limit the results, the original placement is intact.
All I need for this is to add an incrementing counter within the $addFields - statement, but I can't find a way to do this. There doesn't seem to be something like that in the documentation.
Can you help me?

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?
TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.

Speed up aggregation on large collection

I currently have a database with about 270 000 000 documents. They look like this:
[{
'location': 'Berlin',
'product': 4531,
'createdAt': ISODate(...),
'value': 3523,
'minOffer': 3215,
'quantity': 7812
},{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}]
The database currently holds a bit over one month of data and has ~170 locations (in EU and US) with ~8000 products. These documents represent timesteps, so there are about ~12-16 entries per day, per product per location (at most 1 per hour though).
My goal is to retrieve all timesteps of a product in a given location for the last 7 days. For a single location this query works reasonable fast (150ms) with the index { product: 1, location: 1, createdAt: -1 }.
However, I also need these timesteps not just for a single location, but an entire region (so about 85 locations). I'm currently doing that with this aggregation, which groups all the entries per hour and averages the desired values:
this.db.collection('...').aggregate([
{ $match: { { location: { $in: [array of ~85 locations] } }, product: productId, createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) } } }, {
$group: {
_id: {
$toDate: {
$concat: [
{ $toString: { $year: '$createdAt' } },
'-',
{ $toString: { $month: '$createdAt' } },
'-',
{ $toString: { $dayOfMonth: '$createdAt' } },
' ',
{ $toString: { $hour: '$createdAt' } },
':00'
]
}
},
value: { $avg: '$value' },
minOffer: { $avg: '$minOffer' },
quantity: { $avg: '$quantity' }
}
}
]).sort({ _id: 1 }).toArray()
However, this is really really slow, even with the index { product: 1, createdAt: -1, location: 1 } (~40 secs). Is there any way to speed up this aggregation so it goes down to a few seconds at most? Is this even possible, or should I think about using something else?
I've thought about saving these aggregations in another database and just retrieving that and aggregating the rest, this is however really awkward for the first users on the site who have to sit 40 secs through waiting.
These are some ideas which can benefit the querying and performance. Whether all these will work together is matter of some trials and testing. Also, note that changing the way data is stored and adding new indexes means that there will changes to application, i.e., capturing data, and the other queries on the same data need to be carefully verified (that they are not affected in a wrong way).
(A) Storing a Day's Details in a Document:
Store (embed) a day's data within the same document as an array of sub-documents. Each sub-document represents an hour's entry.
From:
{
'location': 'London',
'product': 1231,
'createdAt': ISODate(...),
'value': 53523,
'minOffer': 44215,
'quantity': 2812
}
to:
{
location: 'London',
product: 1231,
createdAt: ISODate(...),
details: [ { value: 53523, minOffer: 44215, quantity: 2812 }, ... ]
}
This means about ten entries per document. Adding data for an entry will be pushing data into the details array, instead of adding a document as in present application. In case the hour's info (time) is required it can also be stored as part of the details sub-document; it will entirely depend upon your application needs.
The benefits of this design:
The number of documents to maintain and query will reduce (per
product per day about ten documents).
In the query, the group stage will go away. This will be just a
project stage. Note that the $project supports accumulators $avg and $sum.
The following stage will create the sums and averages for the day (or a document).
{
$project: { value: { $avg: '$value' }, minOffer: { $avg: '$minOffer' }, quantity: { $avg: '$quantity' } }
}
Note the increase in size of the document is not much, with the amount of details being stored per day.
(B) Querying by Region:
The present matching of multiple locations (or a region) with this query filer: { location: { $in: [array of ~85 locations] } }. This filter says : location: location-1, -or- location: location-3, -or- ..., location: location-50. Adding a new field , region, will filter with one value matching.
The query by region will change to:
{
$match: {
region: regionId,
product: productId,
createdAt: { $gte: new Date(Date.now() - sevenDaysAgo) }
}
}
The regionId variable is to be supplied to match with the region field.
Note that, both the queries, "by location" and "by region", will benefit with the above two considerations, A and B.
(C) Indexing Considerations:
The present index: { product: 1, location: 1, createdAt: -1 }.
Taking into consideration, the new field region, newer indexing will be needed. The query with region cannot benefit without an index on the region field. A second index will be needed; a compound index to suit the query. Creating an index with the region field means additional overhead on write operations. Also, there will be memory and storage considerations.
NOTES:
After adding the index, both the queries ("by location" and "by region") need to be verified using explain if they are using their respective indexes. This will require some testing; a trial-and-error process.
Again, adding new data, storing data in a different format, adding new indexes requires to consider these:
Careful testing and verifying that the other existing queries perform as usual.
The change in data capture needs.
Testing the new queries and verifying if the new design performs as expected.
Honestly your aggregation is pretty much as optimized as it can get, especially if you have { product: 1, createdAt: -1, location: 1 } as an index like you stated.
I'm not exactly sure how your entire product is built, however the best solution in my opinion is to have another collection containing just the "relevant" documents from the past week.
Then you could query that collection with ease, This is quite easy to do in Mongo as well using a TTL Index.
If this not an option you could add a temporary field to the "relevant" documents and query on that making it somewhat faster to retrieve them, but maintaining this field will require you to have a process running every X time which could make your results now 100% accurate depending when you decide to run it.

MongoDB: aggregate and group by splitting the id

My schema implementation is influenced from this tutorial on official mongo site
{
_id: String,
data:[
{
point_1: Number,
ts: Date
}
]
}
This is basically schema designed for time series data and I store data for each hour per device in an array in a single document. I create _id field combining device id which is sending the data and time. For example if a device having id xyz1234 sends a data at 2018-09-11 12:30:00 then my _id field becomes xyz1234:2018091112.
I create new doc if the document for that hour for that device doesn't exist otherwise I just push my data to the data array.
client.db('iot')
.collection('iotdata')
.update({_id:id},{$push:{data:{point_1,ts:date}}},{upsert:true});
Now I am facing problem while doing aggregation. I am trying to get these types of values
Min point_1 value for many devices in last 24 hours by grouping on device id
Max point_1 value for many devices in last 24 hours by grouping on device id
Average point_1 for many devices in last 24 hours by grouping on device id
I thought this is very simple aggregation then I realized device id is not direct but mixed with time data so it's not so direct to group data by device id. How can I split the _id and group based on device id? I tried my level best to write the question as clearly as possible so please ask questions in comments if any part of the question is not clear.
You can start with $unwind on data to get single document per entry. Then you can get deviceId using $substr and $indexOfBytes operators. Then you can apply your filtering condition (last 24 hours) and use $group to get min, max and avg
db.col.aggregate([
{
$unwind: "$data"
},
{
$project: {
point_1: "$data.point_1",
deviceId: { $substr: [ "$_id", 0, { $indexOfBytes: [ "$_id", ":" ] } ] },
dateTime: "$data.ts"
}
},
{
$match: {
dateTime: { $gte: ISODate("2018-09-10T12:00:00Z") }
}
},
{
$group: {
_id: "$deviceId",
min: { $min: "$point_1" },
max: { $max: "$point_1" },
avg: { $avg: "$point_1" }
}
}
])
You can use below query in 3.6.
db.colname.aggregate([
{"$project":{
"deviceandtime":{"$split":["$_id", ":"]},
"minpoint":{"$min":"$data.point_1"},
"maxpoint":{"$min":"$data.point_1"},
"sumpoint":{"$sum":"$data.point_1"},
"count":{"$size":"$data.point_1"}
}},
{"$match":{"$expr":{"$gte":[{"$arrayElemAt":["$deviceandtime",1]},"2018-09-10 00:00:00"]}}},
{"$group":{
"_id":{"$arrayElemAt":["$deviceandtime",0]},
"minpoint":{"$min":"$minpoint"},
"maxpoint":{"$max":"$maxpoint"},
"sumpoint":{"$sum":"$sumpoint"},
"countpoint":{"$sum":"$count"}
}},
{"$project":{
"minpoint":1,
"maxpoint":1,
"avgpoint":{"$divide":["$sumpoint","$countpoint"]}
}}
])