MongoDB aggregation: Get samples at specific intervals - mongodb

I have a MongoDB collection containing timestamped documents. The important part of their shape is:
{
receivedOn: {
date: ISODate("2018-10-01T07:50:06.836Z")
}
}
They are indexed on the date.
These documents relate to and contain data from UDPs constantly arriving at a server. The rate of the UDPs vary, but there are usually around 20 per second
I'm trying to take samples from this collection. I have a list of timestamps, and I want to get the documents closest to these timestamps in the past.
For example, if I have the following documents
{_id: 1, "receivedOn.date": ISODate("2018-10-01T00:00:00.000Z")}
{_id: 2, "receivedOn.date": ISODate("2018-10-01T00:00:02.000Z")}
{_id: 3, "receivedOn.date": ISODate("2018-10-01T00:00:04.673Z")}
{_id: 4, "receivedOn.date": ISODate("2018-10-01T00:00:05.001Z")}
{_id: 5, "receivedOn.date": ISODate("2018-10-01T00:00:09.012Z")}
{_id: 6, "receivedOn.date": ISODate("2018-10-01T00:00:10.065Z")}
and the timestamps
new Date("2018-10-01T00:00:05.000Z")
new Date("2018-10-01T00:00:10.000Z")
I want the result to be
[
{_id: 3, "receivedOn.date": ISODate("2018-10-01T00:00:04.673Z")},
{_id: 5, "receivedOn.date": ISODate("2018-10-01T00:00:09.012Z")}
]
Using aggregation, I made this work. The following code gives the correct result, but it is slow and appears to have complexity O(n*m), where n is number of matched documents and m is number of timestamps
const timestamps = [
new Date("2018-10-01T00:00:00.000Z")
new Date("2018-10-01T00:00:05.000Z")
new Date("2018-10-01T00:00:10.000Z")
];
collection.aggregate([
{$match: {
$and: [
{"receivedOn.date": {$lte: new Date("2018-10-01T00:00:10.000Z")}},
{"receivedOn.date": {$gte: new Date("2018-10-01T00:00:00.000Z")}}
]},
{$project: ...},
{$sort: {"receivedOn.date": -1}},
{$bucket: {
groupBy: "$receivedOn.date",
boundaries: timestamps,
output: {
docs: {$push: "$$CURRENT"}
}
}},
// The buckets contain sorted arrays. The first element is the newest
{$project: {
doc: {
$arrayElemAt: ["$docs", 0]
}
}},
// Lift the document out of its bucket wrapper
{$replaceRoot: {newRoot: "$doc"}}
]);
Is there a way to make this faster? Like somehow telling $bucket that the data is sorted? I assume what is taking most time here is $bucket trying to figure out which bucket to put the document in. Or is there another, better way to do this?
I've also tried running one findOne query per timestamp in parallel. That also gives the correct result, and is much faster, but having a few thousand timestamps is not uncommon. I don't want to do thousands of queries each time I need to do this.

Related

MongoDB - Safely sort inner array after group

I'm trying to look up all records that match a certain condition, in this case _id being certain values, and then return only the top 2 results, sorted by the name field.
This is what I have
db.getCollection('col1').aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$sort: {fk: 1, name: -1}},
{$group: {_id: "$fk", items: {$push: "$$ROOT"} }},
{$project: {items: {$slice: ["$items", 2]} }}
])
and it works, BUT, it's not guaranteed. According to this Mongo thread $group does not guarantee document order.
This would also mean that all of the suggested solutions here and elsewhere, which recommend using $unwind, followed by $sort, and then $group, would also not work, for the same reason.
What is the best way to accomplish this with Mongo (any version)? I've seen suggestions that this could be accomplished in the $project phase, but I'm not quite sure how.
You are correct in saying that the result of $group is never sorted.
$group does not order its output documents.
Hence doing a;
{$sort: {fk: 1}}
then grouping with
{$group: {_id: "$fk", ... }},
will be a wasted effort.
But there is a silver lining with sorting before $group stage with name: -1. Since you are using $push (not an $addToSet), inserted objects will retain the order they've had in the newly created items array in the $group result. You can see this behaviour here (copy of your pipeline)
The items array will always have;
"items": [
{
..
"name": "Michael"
},
{
..
"name": "George"
}
]
in same order, therefore your nested array sort is a non-issue! Though I am unable to find an exact quote in documentation to confirm this behaviour, you can check;
this,
or this where it is confirmed.
Also, accumulator operator list for $group, where $addToSet has "Order of the array elements is undefined." in its description, whereas the similar operator $push does not, which might be an indirect evidence? :)
Just a simple modification of your pipeline where you move the fk: 1 sort from pre-$group stage to post-$group stage;
db.getCollection('col1').aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$sort: {name: -1}},
{$group: {_id: "$fk", items: {$push: "$$ROOT"} }},
{$sort: {_id: 1}},
{$project: {items: {$slice: ["$items", 2]} }}
])
should be sufficient to have the main result array order fixed as well. Check it on mongoplayground
$group doesn't guarantee the document order but it would keep the grouped documents in the sorted order for each bucket. So in your case even though the documents after $group stage are not sorted by fk but each group (items) would be sorted by name descending. If you would like to keep the documents sorted by fk you could just add the {$sort:{fk:1}} after $group stage
You could also sort by order of values passed in your match query should you need by adding a extra field for each document. Something like
db.getCollection('col1').aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$addField:{ifk:{$indexOfArray:[[1, 2],"$fk"]}}},
{$sort: {ifk: 1, name: -1}},
{$group: {_id: "$ifk", items: {$push: "$$ROOT"}}},
{$sort: {_id : 1}},
{$project: {items: {$slice: ["$items", 2]}}}
])
Update to allow array sort without group operator : I've found the jira which is going to allow sort array.
You could try below $project stage to sort the array.There maybe various way to do it. This should sort names descending.Working but a slower solution.
{"$project":{"items":{"$reduce":{
"input":"$items",
"initialValue":[],
"in":{"$let":{
"vars":{"othis":"$$this","ovalue":"$$value"},
"in":{"$let":{
"vars":{
//return index as 0 when comparing the first value with initial value (empty) or else return the index of value from the accumlator array which is closest and less than the current value.
"index":{"$cond":{
"if":{"$eq":["$$ovalue",[]]},
"then":0,
"else":{"$reduce":{
"input":"$$ovalue",
"initialValue":0,
"in":{"$cond":{
"if":{"$lt":["$$othis.name","$$this.name"]},
"then":{"$add":["$$value",1]},
"else":"$$value"}}}}
}}
},
//insert the current value at the found index
"in":{"$concatArrays":[
{"$slice":["$$ovalue","$$index"]},
["$$othis"],
{"$slice":["$$ovalue",{"$subtract":["$$index",{"$size":"$$ovalue"}]}]}]}
}}}}
}}}}
Simple example with demonstration how each iteration works
db.b.insert({"items":[2,5,4,7,6,3]});
othis ovalue index concat arrays (parts with counts) return value
2 [] 0 [],0 [2] [],0 [2]
5 [2] 0 [],0 [5] [2],-1 [5,2]
4 [5,2] 1 [5],1 [4] [2],-1 [5,4,2]
7 [5,4,2] 0 [],0 [7] [5,4,2],-3 [7,5,4,2]
6 [7,5,4,2] 1 [7],1 [6] [5,4,2],-3 [7,6,5,4,2]
3 [7,6,5,4,2] 4 [7,6,5,4],4 [3] [2],-1 [7,6,5,4,3,2]
Reference - Sorting Array with JavaScript reduce function
There is a bit of a red herring in the question as $group does guarantee that it will be processing incoming documents in order (and that's why you have to sort of them before $group to get an ordered arrays) but there is an issue with the way you propose doing it, since pushing all the documents into a single grouping is (a) inefficient and (b) could potentially exceed maximum document size.
Since you only want top two, for each of the unique fk values, the most efficient way to accomplish it is via a "subquery" using $lookup like this:
db.coll.aggregate([
{$match: {fk: {$in: [1, 2]}}},
{$group:{_id:"$fk"}},
{$sort: {_id: 1}},
{$lookup:{
from:"coll",
as:"items",
let:{fk:"$_id"},
pipeline:[
{$match:{$expr:{$eq:["$fk","$$fk"]}}},
{$sort:{name:-1}},
{$limit:2},
{$project:{_id:0, fk:1, name:1}}
]
}}
])
Assuming you have an index on {fk:1, name:-1} as you must to get efficient sort in your proposed code, the first two stages here will use that index via DISTINCT_SCAN plan which is very efficient, and for each of them, $lookup will use that same index to filter by single value of fk and return results already sorted and limited to first two. This will be the most efficient way to do this at least until https://jira.mongodb.org/browse/SERVER-9377 is implemented by the server.

Meteor + Mongo (2.6.7) Pushing Document to Array in Sorted Order

I have a document with an array (which should be denormalised, but can't be because the reactive events will fire "add" too many times at client startup).
I need to be able to push a document to that array, and keep it in sorted (or roughly sorted) order. I've tried this query:
{ $push: {
'events': {
$each: [{'id': new Mongo.ObjectID, 'start':startDate,...}],
$sort: {'start': 1},
$slice: -1
}
}
But it requires the $slice operator to be present... I don't want to delete all my old data, I just want to be able to insert data into an array, and then have that array be sorted so that I can query the array later and say "slice greater than or equal to time X".
Is this possible?
Edit:
This mongo aggregate query nearly works, except for one level of document in the result array, but aggregating is not reactive (probably because they're expensive computations). Here is the aggregate query if anyone can see how to translate it to a find, or why it can't be translated:
Coll.aggregate({$unwind: '$events'},
{$sort: {'events.start':1}},
{$match: {'events.start': {$gte: new Date()}}},
{$group: {_id: '$_id', 'events': {$push: '$events'} }})

MongoDB: calculate average value for the document & then do the same thing across entire collection

I have collection of documents (Offers) with subdocuments (Salary) like this:
{
_id: ObjectId("zzz"),
sphere: ObjectId("xxx"),
region: ObjectId("yyy"),
salary: {
start: 10000,
end: 50000
}
}
And I want to calculate average salary across some region & sphere for the entire collection. I created query for this, it works, but it takes care only about salary start value.
db.offer.aggregate(
[
{$match:
{$and: [
{"salary.start": {$gt: 0}},
{region: ObjectId("xxx")},
{sphere: ObjectId("yyy")}
]}
},
{$group: {_id: null, avg: {$avg: "$salary.start"}}}
]
)
But firstly I want to calculate avarage salary (start & end) of the offer. How can I do this?
Update.
If value for "salary.end" may be missing in your data, you need to add one additional "$project" iteration to replace missing "salary.end" with existing "salary.start". Otherwise, the result of the average function will be calculated wrong due to ignoring documents with the lack of "salary.end" values.
db.offer.aggregate([
{$match:
{$and: [
{"salary.start": {$gt: 0}},
{"region": ObjectId("xxx")},
{"sphere": ObjectId("yyy")}
]}
},
{$project:{"_id":1,
"sphere":1,
"region":1,
"salary.start":1,
"salary.end":1,
"salary.end": {$ifNull: ["$salary.end", "$salary.start"]}
}
},
{$project:{"_id":1,
"sphere":1,
"region":1,
"avg_salary":{$divide:[
{$add:["$salary.start","$salary.end"]}
,2
]}}},
{$group:{"_id":{"sphere":"$sphere","region":"$region"},
"avg":{$avg:"$avg_salary"}}}
])
The way you aggregate has to be modified:
Match the required region,sphere and where salary > 0.
Project a extra field for each offer, which holds the average of
start and end.
Now group together the records with the same region and sphere, and
apply the $avg aggregation operator on the avg_salary for each offer
in that group,to get the average salary.
The Code:
db.offer.aggregate([
{$match:
{$and: [
{"salary.start": {$gt: 0}},
{"region": ObjectId("xxx")},
{"sphere": ObjectId("yyy")}
]}
},
{$project:{"_id":1,
"sphere":1,
"region":1,
"avg_salary":{$divide:[
{$add:["$salary.start","$salary.end"]}
,2
]}}},
{$group:{"_id":{"sphere":"$sphere","region":"$region"},
"avg":{$avg:"$avg_salary"}}}
])

mongodb index. how to index a single object on a document, nested in an array

I have the following document:
{
'date': date,
'_id': ObjectId,
'Log': [
{
'lat': float,
'lng': float,
'date': float,
'speed': float,
'heading': float,
'fix': float
}
]
}
for 1 document, the Log array can be some hundred entries.
I need to query the first and last date element of Log on each document. I know how to query it, but I need to do it fast, so I would like to build an index for that. I don't want to index Log.date since it is too big... how can I index them?
In fact it's hard to advise without knowing how you work with the documents. One of the solutions could be to use a sparse index. You just need to add a new field to every first and last array element, let's call it shouldIndex. Then just create a sparse index which includes shouldIndex and date fields. Here's a short example:
Assume we have this document
{"Log":
[{'lat': 1, 'lng': 2, 'date': new Date(), shouldIndex : true},
{'lat': 3, 'lng': 4, 'date': new Date()},
{'lat': 5, 'lng': 6, 'date': new Date()},
{'lat': 7, 'lng': 8, 'date': new Date(), shouldIndex : true}]}
Please note the first element and the last one contain shouldIndex field.
db.testSparseIndex.ensureIndex( { "Log.shouldIndex": 1, "Log.date":1 }, { spar
se: true } )
This index should contain entries only for your first and last elements.
Alternatively you may store first and last elements date field in a seperate array.
For more info on sparse indexes please refer to this article.
Hope it helps!
So there was an answer about indexing that is fundamentally correct. As of writing though it seems a little unclear whether you are talking about indexing at all. It almost seems like what you want to do is get the first and last date from the elements in your array.
With that in mind there are a few approaches:
1. The elements in your array have been naturally inserted in increasing date values
So if the way all writes that are made to this field is done, only with use of the $push operator over a period of time, and you never update these items, at least in so much as changing a date, then your items are already in order.
What this means is you just get the first and last element from the array
db.collection.find({ _id: id },{ Log: {$slice: 1 }}); // gets the first element
db.collection.find({ _id: id },{ Log: {$slice: -1 }}); // gets the last element
Now of course that is two queries but it's a relatively simple operation and not costly.
2. For some reason your elements are not naturally ordered by date
If this is the case, or indeed if you just can't live with the two query form, then you can get the first and last values in aggregation, but using $min and $max modifiers
db.collection.aggregate([
// You might want to match first. Just doing one _id here. (commented)
//{"$match": { "_id": id }},
//Unwind the array
{"$unwind": "$Log" },
//
{"$group": {
"_id": "$_id",
"firstDate": {"$min": "$Log.Date" },
"lastDate": {"$max": "$Log.Date" }
}}
])
So finally, if your use case here is getting the details of the documents that have the first and last date, we can do that as well, mirroring the initial two query form, somewhat. Using $first and $last :
db.collection.aggregate([
// You might want to match first. Just doing one _id here. (commented)
//{"$match": { "_id": id }},
//Unwind the array
{"$unwind": "$Log" },
// Sort the results on the date
{"$sort": { "_id._id": 1, "Log.date": 1 }},
// Group using $first and $last
{"$group": {
"_id": "$_id",
"firstLog": {"$first": "$Log" },
"lastLog": {"$last": "$Log" }
}}
])
Your mileage may vary, but those approaches may obviate the need to index if this indeed would the the only usage for that index.

MongoDB - sort by subdocument match

Say I have a users collection in MongoDB. A typical user document contains a name field, and an array of subdocuments, representing the user's characteristics. Say something like this:
{
"name": "Joey",
"characteristics": [
{
"name": "shy",
"score": 0.8
},
{
"name": "funny",
"score": 0.6
},
{
"name": "loving",
"score": 0.01
}
]
}
How can I find the top X funniest users, sorted by how funny they are?
The only way I've found so far, was to use the aggregation framework, in a query similar to this:
db.users.aggregate([
{$project: {"_id": 1, "name": 1, "characteristics": 1, "_characteristics": '$characteristics'}},
{$unwind: "$_characteristics"},
{$match: {"_characteristics.name": "funny"}},
{$sort: {"_characteristics.score": -1}},
{$limit: 10}
]);
Which seems to be exactly what I want, except for the fact that according to MongoDB's documentation on using indexes in pipelines, once I call $project or $unwind in an aggregation pipeline, I can no longer utilize indexes to match or sort the collection, which renders this solution somewhat unfeasible for a very large collection.
I think you are half way there. I would do
db.users.aggregate([
{$match: { 'characteristics.name': 'funny' }},
{$unwind: '$characteristics'},
{$match: {'characteristics.name': 'funny'}},
{$project: {_id: 0, name: 1, 'characteristics.score': 1}},
{$sort: { 'characteristics.score': 1 }},
{$limit: 10}
])
I add a match stage to get rid of users without the funny attribute (which can be easily indexed).
unwind and match again to keep only the certain part of the data
keep only the necessary data with project
sort by the correct score
and limit the results.
that way you can use an index for the first match.
The way I see it, if the characteristics you are interested about are not too many, IMO it would be better to have your structure as
{
"name": "Joey",
"shy": 0.8
"funny": 0.6
"loving": 0.01
}
That way you can use an index (sparse or not) to make your life easier!