MongoDB timestamp field sampling and aggregation - mongodb

I'm kind of new to MongoDB, so bear with me.
Consider a collection which is built from documents in the form of the following:
{
"_id" : ObjectId("538d87a36da0bab7ff1a827d"),
"resource_id", "some_id",
"server_ts" : 1401784227674.05214213,
"location" : [
34.8383953,
32.1098175
],
"__v" : 0
}
Documents are being added per resource in a relatively fast rate, so I end up with a high resolution of timestamped locations (approx. half a second resolution) based on server_ts.
I'd like to be able to query the collection based on a resource id, but return documents in a a lower resolution (e.g. 5 seconds resolution, rather than the original 0.5).
In another words I'd like to divide the time to ranges of 5 seconds, and for each range, fetch one document which falls in that range (if it actually exists).
Is there a convenient way in mongodb either in the Aggregation framework or in the standard query interface to 'sample' data based on this criteria?
Obviously this can be done in server side code (Node.js in my case), but I still wonder if there's a better alternative.
Thanks!

If you store timestamp as an integer you can use modulo operator.
db.coll.find( { ts: { $mod: [ 5, 0 ] } } )
This will return all documents where value of the ts is e.g. 1401784227670, 1401784227675, 1401784227680...
Of course, this only works if you have only one document in the same second.
To filter out "duplicates" you can use aggregation like this:
db.x.aggregate([
{ $match : { ts : { $mod : [ 5, 0] } } },
{ $sort : { ts : 1 } }, /* without it $first is unpredictable */
{ $group : { _id : "$ts", location : { $first : "$location" } /* etc. */ } }
]);

Related

MongoDB Aggregation - Buckets Boundaries to Referenced Array

To whom this may concern:
I would like to know if there is some workaround in MongoDB to set the "boundaries" field of a "$bucket" aggregation pipeline stage to an array that's already in the previous aggregation stage. (Or some other aggregation pipeline that will get me the same result). I am using this data to create a histogram of a bunch of values. Rather than retrieve 1 million-or-so values, I can receive 20 buckets with their respective counts.
The previous stages of yield the following result:
{
"_id" : ObjectId("5cfa6fad883d3a9b8c6ad50a"),
"boundaries" : [ 73.0, 87.25, 101.5, 115.75, 130.0 ],
"value" : 83.58970621935025
},
{
"_id" : ObjectId("5cfa6fe0883d3a9b8c6ad5a8"),
"boundaries" : [ 73.0, 87.25, 101.5, 115.75, 130.0 ],
"value" : 97.3261380262403
},
...
The "boundaries" field for every document is a result a facet/unwind/addfield with some statistical mathematics involving "value" fields in the pipeline. Therefore, every "boundaries" field value is an array of evenly spaced values in ascending order, all with the same length and values.
The following stage of the aggregation I am trying to perform is:
$bucket: {
groupBy: "$value",
boundaries : "$boundaries" ,
default: "no_group",
output: { count: { $sum: 1 } }
}
I get the following error from the explain when I try to run this aggregation:
{
"ok" : 0.0,
"errmsg" : "The $bucket 'boundaries' field must be an array, but found type: string.",
"code" : NumberInt(40200),
"codeName" : "Location40200"
}
The result I would like to get is something like this, which is the result of a basic "$bucket" pipeline operator:
{
"_id" : 73.0, // range of [73.0,87.25)
"count" : 2 // number of documents with "value" in this range.
}, {
"_id" : 87.25, // range of [87.25,101.5)
"count" : 7 // number of documents with "value" in this range.
}, {
"_id" : 101.5,
"count" : 3
}, ...
What I know:
The JIRA documentation says
'boundaries' must be constant values (can't use "$x", but can use {$add: [4, 5]}), and must be sorted.
What I've tried:
$bucketAuto does not have a linear "granularity" setting. By default, it tries to evenly distribute the values amongst the buckets, and the bucket ranges are therefore spaced differently.
Building the constant array by retrieving the pipeline results, and then adding the constant array into the pipeline again. This is effective but inefficient and not atomic, as it creates an O(2N) time complexity. I can live with this solution if needs be.
There HAS to be a solution to this. Any workaround or alternative solutions are greatly appreciated.
Thank you for your time!

MongoDB Calculate Values from Two Arrays, Sort and Limit

I have a MongoDB database storing float arrays. Assume a collection of documents in the following format:
{
"id" : 0,
"vals" : [ 0.8, 0.2, 0.5 ]
}
Having a query array, e.g., with values [ 0.1, 0.3, 0.4 ], I would like to compute for all elements in the collection a distance (e.g., sum of differences; for the given document and query it would be computed by abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9).
I tried to use the aggregation function of MongoDB to achieve this, but I can't work out how to iterate over the array. (I am not using the built-in geo operations of MongoDB, as the arrays can be rather long)
I also need to sort the results and limit to the top 100, so calculation after reading the data is not desired.
Current Processing is mapReduce
If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:
db.test.mapReduce(
function() {
var input = [0.1,0.3,0.4];
var value = Array.sum(this.vals.map(function(el,idx) {
return Math.abs( el - input[idx] )
}));
emit(null,{ "output": [{ "_id": this._id, "value": value }]});
},
function(key,values) {
var output = [];
values.forEach(function(value) {
value.output.forEach(function(item) {
output.push(item);
});
});
output.sort(function(a,b) {
return a.value < b.value;
});
return { "output": output.slice(0,100) };
},
{ "out": { "inline": 1 } }
)
So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.
The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.
That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.
I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:
{
"results" : [
{
"_id" : null,
"value" : {
"output" : [
{
"_id" : ObjectId("56558d93138303848b496cd4"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d96138303848b49906e"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d93138303848b496d9a"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d93138303848b496ef2"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497861"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497b58"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497ba5"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497c43"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d95138303848b49842b"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d96138303848b498db4"),
"value" : 2.1
}
]
}
}
],
"timeMillis" : 1758,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 100,
"output" : 1
},
"ok" : 1
}
So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.
Future Processing is Aggregate
As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:
db.test.aggregate([
{ "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
{ "$group": {
"_id": "$_id",
"result": {
"$sum": {
"$abs": {
"$subtract": [
"$vals",
{ "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] }
]
}
}
}
}},
{ "$sort": { "result": -1 } },
{ "$limit": 100 }
])
Also limitting to the same 10 results for brevity, you get output like this:
{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
This is made possible largely due to $unwind being modified to project a field in results that contains the array index, and also due to $arrayElemAt which is a new operator that can extract an array element as a singular value from a provided index.
This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal operator so $arrayElemAt does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind for comparison.
The math is done by $subtract and of course another new operator in $abs to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group stage accumulating all array members per document and applying the addition of entries via the $sum accumulator.
Finally all result documents are processed with $sort and then the $limit is applied to just return the top results.
Summary
Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.
So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.
As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.
Sample
Of course the expected result for the sample document provided in the question is 0.9 by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:
var bulk = db.test.initializeUnorderedBulkOp();
var x = 10000;
while ( x-- ) {
var vals = [0,0,0];
vals = vals.map(function(val) {
return Math.round((Math.random()*10),1)/10;
});
bulk.insert({ "vals": vals });
if ( x % 1000 == 0) {
bulk.execute();
bulk = db.test.initializeUnorderedBulkOp();
}
}
The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.

MongoDB - Get highest value of child

I'm trying to get the highest value of a child value. If I have two documents like this
{
"_id" : ObjectId("5585b8359557d21f44e1d857"),
"test" : {
"number" : 1,
"number2" : 1
}
}
{
"_id" : ObjectId("5585b8569557d21f44e1d858"),
"test" : {
"number" : 2,
"number2" : 1
}
}
How would I get the highest value of key "number"?
Using dot notation:
db.testSOF.find().sort({'test.number': -1}).limit(1)
To get the highest value of the key "number" you could use two approaches here. You could use the aggregation framework where the pipeline would look like this
db.collection.aggregate([
{
"$group": {
"_id": 0,
"max_number": {
"$max": "$test.number"
}
}
}
])
Result:
/* 0 */
{
"result" : [
{
"_id" : 0,
"max_number" : 2
}
],
"ok" : 1
}
or you could use the find() cursor as follows
db.collection.find().sort({"test.number": -1}).limit(1)
max() does not work the way you would expect it to in SQL for Mongo.
This is perhaps going to change in future versions but as of now,
max,min are to be used with indexed keys primarily internally for
sharding.
see http://www.mongodb.org/display/DOCS/min+and+max+Query+Specifiers
Unfortunately for now the only way to get the max value is to sort the
collection desc on that value and take the first.
db.collection.find("_id" => x).sort({"test.number" => -1}).limit(1).first()
quoted from: Getting the highest value of a column in MongoDB

Only retrieve back select sub properties and limit how many items they contain

I have a very simple document:
{
"_id" : ObjectId("5347ff73e4b0e4fcbbb7886b"),
"userName" : "ztolley",
"firstName" : "Zac",
"lastName" : "Tolley"
"data" : {
"temperature" : [
{
"celsius" : 22,
"timestamp" : 1212140000
}
]
}
}
I want to find a way to write a query that searches for userName = 'ztolley' and only returns back the last 10 temperature readings. I've been able to say just return the data field but I couldn't find a way to say just return data.temperature (there are many different data properties).
When I tried
db.user.find({userName:'ztolley'},{data: {temperature: {$slice: -10}}})
I got unsupported projection.
I'd try using the Aggregation framework http://docs.mongodb.org/manual/aggregation/ .
Using your schema this should work:
db.user.aggregate([{$match:{"userName":"ztolley"}},{$unwind:"$data.temperature"},{$sort:{"data.temperature.timestamp":-1}},{$limit:10}, {$project:{"data.temperature":1, _id:0}}])
It returns the temperature readings for that user, in reverse sorted order by timestamp, limited to 10.
It looks you write wrong projection, try it with dot '.' notation:
db.user.find( { userName:'ztolley' },{ 'data.temperature': { $slice: -10 } } );

Mongo does not have a max() function, how do I work around this?

I have a MongoDB collection and need to find the max() value of a certain field across all docs. This value is the timestamp and I need to find the latest doc by finding the largest timestamp. Sorting it and getting the first one gets inefficient really fast. Shall I just maintain a 'maxval' separately and update it whenever a doc arrives with a larger value for that field? Any better suggestions?
Thanks much.
if you have an index on the timestsamp field, finding the highest value is efficientl something like
db.things.find().sort({ts:-1}).limit(1)
but if having an index is too much overhead storing the max in a separate collection might be good.
For sure if it will be big collection and if you need always display max timestamp you may need create separate collection and store statistic data there instead of order big collection each time.
statistic
{
_id = 1,
id_from_time_stamp_collection = 'xxx',
max_timestamp: value
}
And whenever new doc come just update statistic collection with id = 1(with $gt condition in query, so if new timestamp will be greater than max_timestamp then max_timestamp will be updated, otherwise - no).
Also probably you can store and update other statistic data within statistic collection.
Try with db.collection.group
For example, with this collection:
> db.foo.find()
{ "_id" : ObjectId("..."), "a" : 1 }
{ "_id" : ObjectId("..."), "a" : 200 }
{ "_id" : ObjectId("..."), "a" : 230 }
{ "_id" : ObjectId("..."), "a" : -2230 }
{ "_id" : ObjectId("..."), "a" : 5230 }
{ "_id" : ObjectId("..."), "a" : 530 }
{ "_id" : ObjectId("..."), "a" : 1530 }
You can use group using
> db.foo.group({
initial: { },
reduce: function(doc, acc) {
if(acc.hasOwnProperty('max')) {
if(acc.max < doc.a)
acc.max = doc.a;
} else {
acc.max = doc.a
}
}
})
[ { "max" : 5230 } ]
Since there is no key value in group all the objects are grouped in a single result