Agregating by field value on MongoDB - mongodb

I have a collection composed of documents similar to one below:
{
"_id" : ObjectId("5dc916a72440b14b3f0ec096"),
"date" : ISODate("2019-11-11T11:07:03.968+03:00"),
"actions" : [
{
"type" : "Type1",
"action" : true
},
{
"type" : "Type2",
"action" : true
},
{
"type" : "Type3",
"action" : false
}
]
}
I am trying to count all the action types based on the boolean value of the actions.action property.
This is how I came so far:
db.Actions.aggregate(
{
$group: {
_id: {
year: { $year: "$date" },
month: { $month: "$date" },
day: { $dayOfMonth: "$date" },
},
count: { $sum: 1 }
}
}
);
As you can see this only gives me the count of the documents in the collection grouped by the action date.
What I need is something like this:
{
"_id" : {
"year" : 2019,
"month" : 10,
"day" : 13
},
"Type1": 300,
"Type2": 200,
"Type3": 120,
"count" : 305
}
Is this possible with a query or should I go in the direction of creating a cursor and agregating the values with it?

db.Actions.aggregate([
// Unwind to de-normalize the array
{ "$unwind": "$actions" },
// Group on both day and "type"
{ "$group": {
"_id": {
"date": {
"$toDate": {
"$subtract": [
{ "$toLong": "$date" },
{ "$mod": [{ "$toLong": { "$toDate": "$date" } }, 1000 * 60 * 60 * 24 ] }
]
}
},
"type": "$actions.type"
},
"total": { "$sum": { "$toLong": "$actions.action" } }
}},
// Roll-up the grouping to just by "day"
{ "$group": {
"_id": "$_id.date",
"data": { "$push": { "k": "$_id.type", "v": "$total" } }
}},
// Convert to key/value output
{ "$replaceRoot": {
"newRoot": {
"$mergeObjects": [
{ "_id": "$_id", "count": { "$sum": "$data.v" } },
{ "$arrayToObject": "$data" }
]
}
}}
])
To summarize:
The $unwind is needed simply because you want to "group" on a value which is inside an array of a document. Using this "de-normalizes", or essentially makes each array element into a new document for the same property and all other "parent" properties of the document in which that array resides. In simple speak, you get a "copy" of the containing document for every array member as a new document.
The next $group basically uses a "Date math" approach to rounding to a singular day. This is a bit prettier than methods like $year and $month etc, and actually returns a Date object, which you client language of choice will understand.
Of course this is a compound grouping key, meaning that the other part is of course the type field from the array of actions. And since you only want true results to count, we apply $toLong again in order to translate the Boolean into a numeric value to $sum ( which basically means "count" when it's 0 or 1 ). In older releases you could also do this using $cond, but the simple type conversion is a lot more simple to read for intent.
The rest of this is basically about translating to the expected "key/value"* output of the question. Really, you got the desired result in the very first $group operation but to be "key/value" you need to put all those results into an array ( by "date" of course ) using $push, and then convert that array into the root document using the $arrayToObject function.

Related

Group by day with Multiple Date Fields

I have documents stored into MongoDB like this :
{
"_id" : "XBpNKbdGSgGfnC2MJ",
"po" : 72134185,
"machine" : 40940,
"location" : "02A01",
"inDate" : ISODate("2017-07-19T06:10:13.059Z"),
"requestDate" : ISODate("2017-07-19T06:17:04.901Z"),
"outDate" : ISODate("2017-07-19T06:30:34Z")
}
And I want give the sum, by day, of inDate and outDate.
I can retrieve of both side the sum of documents by inDate day and, on other side, the sum of documents by outDate, but I would like the sum of each.
Currently, I use this pipeline :
$group: {
_id: {
yearA: { $year: '$inDate' },
monthA: { $month: '$inDate' },
dayA: { $dayOfMonth: '$inDate' },
},
count: { $sum: 1 },
},
and I give :
{ "_id" : { "year" : 2017, "month" : 7, "day" : 24 }, "count" : 1 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 21 }, "count" : 11 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 19 }, "count" : 20 }
But I would like, if it's possible :
{ "_id" : { "year" : 2017, "month" : 7, "day" : 24 }, "countIn" : 1, "countOut" : 4 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 21 }, "countIn" : 11, "countOut" : 23 }
{ "_id" : { "year" : 2017, "month" : 7, "day" : 19 }, "countIn" : 20, "countOut" : 18 }
Any idea ?
Many thanks :-)
You can also split the documents at the source, by essentially combining each value into an array of entries by "type" for "in" and "out". You can do this simply using $map and $cond to select the fields, then $unwind the array and then determine which field to "count" again by inspecting with $cond:
collection.aggregate([
{ "$project": {
"dates": {
"$filter": {
"input": {
"$map": {
"input": [ "in", "out" ],
"as": "type",
"in": {
"type": "$$type",
"date": {
"$cond": {
"if": { "$eq": [ "$$type", "in" ] },
"then": "$inDate",
"else": "$outDate"
}
}
}
}
},
"as": "dates",
"cond": { "$ne": [ "$$dates.date", null ] }
}
}
}},
{ "$unwind": "$dates" },
{ "$group": {
"_id": {
"year": { "$year": "$dates.date" },
"month": { "$month": "$dates.date" },
"day": { "$dayOfMonth": "$dates.date" }
},
"countIn": {
"$sum": {
"$cond": {
"if": { "$eq": [ "$dates.type", "in" ] },
"then": 1,
"else": 0
}
}
},
"countOut": {
"$sum": {
"$cond": {
"if": { "$eq": [ "$dates.type", "out" ] },
"then": 1,
"else": 0
}
}
}
}}
])
That's a safe way to do this that does not risk breaking the BSON limit, no matter what size of data you send at it.
Personally I would rather run as separate processes and "combine" the aggregated results separately, but that would depend on the environment you are running in, which is not mentioned in the question.
For an example of "parallel" execution, you can structure in Meteor somewhere along these lines:
import { Meteor } from 'meteor/meteor';
import { Source } from '../imports/source';
import { Target } from '../imports/target';
Meteor.startup(async () => {
// code to run on server at startup
await Source.remove({});
await Target.remove({});
console.log('Removed');
Source.insert({
"_id" : "XBpNKbdGSgGfnC2MJ",
"po" : 72134185,
"machine" : 40940,
"location" : "02A01",
"inDate" : new Date("2017-07-19T06:10:13.059Z"),
"requestDate" : new Date("2017-07-19T06:17:04.901Z"),
"outDate" : new Date("2017-07-19T06:30:34Z")
});
console.log('Inserted');
await Promise.all(
["In","Out"].map( f => new Promise((resolve,reject) => {
let cursor = Source.rawCollection().aggregate([
{ "$match": { [`${f.toLowerCase()}Date`]: { "$exists": true } } },
{ "$group": {
"_id": {
"year": { "$year": `$${f.toLowerCase()}Date` },
"month": { "$month": `$${f.toLowerCase()}Date` },
"day": { "$dayOfYear": `$${f.toLowerCase()}Date` }
},
[`count${f}`]: { "$sum": 1 }
}}
]);
cursor.on('data', async (data) => {
cursor.pause();
data.date = data._id;
delete data._id;
await Target.upsert(
{ date: data.date },
{ "$set": data }
);
cursor.resume();
});
cursor.on('end', () => resolve('done'));
cursor.on('error', (err) => reject(err));
}))
);
console.log('Mapped');
let targets = await Target.find().fetch();
console.log(targets);
});
Which is essentially going to output to the target collection as was mentioned in comments like:
{
"_id" : "XdPGMkY24AcvTnKq7",
"date" : {
"year" : 2017,
"month" : 7,
"day" : 200
},
"countIn" : 1,
"countOut" : 1
}
Riiiight. I came up with the following query. Admittedly, I have seen simpler and nicer ones in my life but it certainly gets the job done:
db.getCollection('test').aggregate
(
{
$facet: // split aggregation into two pipelines
{
"in": [
{ "$match": { "inDate": { "$ne": null } } }, // get rid of null values
{ $group: { "_id": { "y": { "$year": "$inDate" }, "m": { "$month": "$inDate" }, "d": { "$dayOfMonth": "$inDate" } }, "cIn": { $sum : 1 } } }, // compute sum per inDate
],
"out": [
{ "$match": { "outDate": { "$ne": null } } }, // get rid of null values
{ $group: { "_id": { "y": { "$year": "$outDate" }, "m": { "$month": "$outDate" }, "d": { "$dayOfMonth": "$outDate" } }, "cOut": { $sum : 1 } } }, // compute sum per outDate
]
}
},
{ $project: { "result": { $setUnion: [ "$in", "$out" ] } } }, // merge results into new array
{ $unwind: "$result" }, // unwind array into individual documents
{ $replaceRoot: { newRoot: "$result" } }, // get rid of the additional field level
{ $group: { _id: { year: "$_id.y", "month": "$_id.m", "day": "$_id.d" }, "countIn": { $sum: "$cIn" }, "countOut": { $sum: "$cOut" } } } // group into final result
)
As always with MongoDB aggregations you can get an idea of what's going on by simply reducing the projection stages step by step starting from the end of the query.
EDIT:
As you can see in the comments below there was a bit of a discussion around document size limits and the general applicability of this solution.
So let's look at those aspects in greater detail and let's also compare the performance of the $facet based solution to the one based on $map (suggested by #NeilLunn to avoid potential document size issues).
I created 2 million test records that have random dates assigned to both the "inDate" and the "outDate" field:
{
"_id" : ObjectId("597857e0fa37b3f66959571a"),
"inDate" : ISODate("2016-07-29T22:00:00.000Z"),
"outDate" : ISODate("1988-07-14T22:00:00.000Z")
}
The data range covered was from 01.01.1970 all the way to 01.01.2050, that's a total of 29220 distinct days. Given the random distribution of the 2 million test records across this time range both queries can be expected to return the full 29220 possible results (which both did).
Then I ran both queries five times after restarting my single MongoDB instance freshly and the results in milliseconds I got looked like this:
$facet: 5663, 5400, 5380, 5460, 5520
$map: 9648, 9134, 9058, 9085, 9132
I also measured the size of the single document returned by the facet stage which was 3.19MB so reasonably far away from the MongoDB document size limit (16MB at the time of writing) which, however, only applies to the result document anyway and wouldn't be a problem during pipeline processing.
Bottom line: If you want performance, use the solution suggested here. Be careful about the document size limit, though, in particular if your use case is not the exact one described in the question above (e.g. when you need to collect even more/bigger data). Also, I am not sure if in a sharded scenario both solutions still expose the same performance characteristics...

MongoDB aggregate - average on specific values in array of documents

I'm currently working on a database with the following structure:
{"_id" : ObjectId("1abc2"),
"startdatetime" : ISODate("2016-09-11T18:00:37Z"),
"diveValues" : [
{
"temp" : 15.269,
"depth" : 0.0,
},
{
"temp" : 14.779257384,
"depth" : 1.0,
},
{
"temp" : 14.3940253165,
"depth" : 2.0,
},
{
"temp" : 13.9225795455,
"depth" : 3.0,
},
{
"temp" : 13.8214431818,
"depth" : 4.0,
},
{
"temp" : 13.6899553571,
"depth" : 5.0,
}
]}
The database has information about depth n metres in water, and the temperature on given depth. This is stored in the "diveValues" array. I have been successful on averaging on all depths between to dates, both monthly average and daily average. What I'm having a serious issue with is to get the average between to depths, say between 1 and 4 metres, for every month the last 6 months.
Here is an example of average temperature for each month from January to June, for all depths:
db.collection.aggregate(
[
{$unwind:"$diveValues"},
{$match:
{'startdatetime':
{$gt:new ISODate("2016-01-10T06:00:29Z"),
$lt:new ISODate("2016-06-10T06:00:29Z")}
}
},
{$group:
{_id:
{ year: { $year: "$startdatetime" },
month: { $month: "$startdatetime" }},
avgTemp: { $avg: "$diveValues.temp" }}
},
{$sort:{_id:1}}
]
)
Resulting in:
{ "_id" : { "year" : 2016, "month" : 1 }, "avgTemp" : 7.575706502958313 }
{ "_id" : { "year" : 2016, "month" : 3 }, "avgTemp" : 6.85037457740135 }
{ "_id" : { "year" : 2016, "month" : 4 }, "avgTemp" : 7.215702831902588 }
{ "_id" : { "year" : 2016, "month" : 5 }, "avgTemp" : 9.153453683614638 }
{ "_id" : { "year" : 2016, "month" : 6 }, "avgTemp" : 11.497953009390237 }
Now, I can not seem to figure out how to get average temperature between 1 and 4 metres for the same period.
I have been trying to group the values by wanted depths, but have not managed it - more often than not ending up with bad syntax. Also, if I'm not wrong, the $match pipeline would return all depths as long as the dive has values for 1 and 4 metres, so that will not work.
With the find() tool I am using $slice to return the values I intend from the array - but have not been successful along with the aggregate() function.
Is there a way to solve this? Thanks in advance, much appreciated!
You'd need to place your $match pipeline before $unwind to optimize your aggregation operation as doing an $unwind operation on the whole collection could potentially cause some performance issues since it produces a copy of each document per array entry and that uses more memory (possible memory cap on aggregation pipelines of 10% total memory) thus takes "time" to produce the flattened arrays as well as "time" to process it. Hence it's better to limit the number of documents getting into the pipeline to be flattened.
db.collection.aggregate([
{
"$match": {
"startdatetime": {
"$gt": new ISODate("2016-01-10T06:00:29Z"),
"$lt": new ISODate("2016-06-10T06:00:29Z")
},
"diveValues.depth": { "$gte": 1, "$lte": 4 }
}
},
{ "$unwind": "$diveValues" },
{ "$match": { "diveValues.depth": { "$gte": 1, "$lte": 4 } } },
{
"$group": {
"_id": {
"year": { "$year": "$startdatetime" },
"month": { "$month": "$startdatetime" }
},
"avgTemp": { "$avg": "$diveValues.temp" }
}
}
])
If you want results to contain the average temps for all depths and for the 1-4 depth range, then you would need to run this pipeline which would use the $cond tenary operator to feed the $avg operator the accumulated temperatures within a group based on the depth range:
db.collection.aggregate([
{
"$match": {
"startdatetime": {
"$gt": new ISODate("2016-01-10T06:00:29Z"),
"$lt": new ISODate("2016-06-10T06:00:29Z")
}
}
},
{ "$unwind": "$diveValues" },
{
"$group": {
"_id": {
"year": { "$year": "$startdatetime" },
"month": { "$month": "$startdatetime" }
},
"avgTemp": { "$avg": "$diveValues.temp" },
"avgTempDepth1-4": {
"$avg": {
"$cond": [
{
"$and": [
{ "$gte": [ "$diveValues.depth", 1 ] },
{ "$lte": [ "$diveValues.depth", 4 ] }
]
},
"$diveValues.temp",
null
]
}
}
}
}
])
First of all, the date $match operator should be used at the beginning of the pipeline so that indexes can be used.
Now, to the question, you just need to filter the depth interval like you did with the dates:
db.col.aggregate([
{"$match": {
'startdatetime': {
"$gt": new ISODate("2016-01-10T06:00:29Z"),
"$lt": new ISODate("2016-11-10T06:00:29Z")
}
}},
{"$unwind": "$diveValues"},
{"$match": {
"diveValues.depth": {
"$gte": 1.0,
"$lt": 4.0
}
}},
{"$group": {
"_id": {
"year": {"$year": "$startdatetime" },
"month": {"$month": "$startdatetime" }
},
"avgTemp": { "$avg": "$diveValues.temp" }}
}
])
This will give you the average only for the chosen depth interval.

How to group by different fields

I want to find all users named 'Hans' and aggregate their 'age' and number of 'childs' by grouping them.
Assuming I have following in my database 'users'.
{
"_id" : "01",
"user" : "Hans",
"age" : "50"
"childs" : "2"
}
{
"_id" : "02",
"user" : "Hans",
"age" : "40"
"childs" : "2"
}
{
"_id" : "03",
"user" : "Fritz",
"age" : "40"
"childs" : "2"
}
{
"_id" : "04",
"user" : "Hans",
"age" : "40"
"childs" : "1"
}
The result should be something like this:
"result" :
[
{
"age" :
[
{
"value" : "50",
"count" : "1"
},
{
"value" : "40",
"count" : "2"
}
]
},
{
"childs" :
[
{
"value" : "2",
"count" : "2"
},
{
"value" : "1",
"count" : "1"
}
]
}
]
How can I achieve this?
This should almost be a MongoDB FAQ, mostly because it is a real example concept of how you should be altering your thinking from SQL processing and embracing what engines like MongoDB do.
The basic principle here is "MongoDB does not do joins". Any way of "envisioning" how you would construct SQL to do this essentially requires a "join" operation. The typical form is "UNION" which is in fact a "join".
So how to do it under a different paradigm? Well first, let's approach how not to do it and understand the reasons why. Even if of course it will work for your very small sample:
The Hard Way
db.docs.aggregate([
{ "$group": {
"_id": null,
"age": { "$push": "$age" },
"childs": { "$push": "$childs" }
}},
{ "$unwind": "$age" },
{ "$group": {
"_id": "$age",
"count": { "$sum": 1 },
"childs": { "$first": "$childs" }
}},
{ "$sort": { "_id": -1 } },
{ "$group": {
"_id": null,
"age": { "$push": {
"value": "$_id",
"count": "$count"
}},
"childs": { "$first": "$childs" }
}},
{ "$unwind": "$childs" },
{ "$group": {
"_id": "$childs",
"count": { "$sum": 1 },
"age": { "$first": "$age" }
}},
{ "$sort": { "_id": -1 } },
{ "$group": {
"_id": null,
"age": { "$first": "$age" },
"childs": { "$push": {
"value": "$_id",
"count": "$count"
}}
}}
])
That will give you a result like this:
{
"_id" : null,
"age" : [
{
"value" : "50",
"count" : 1
},
{
"value" : "40",
"count" : 3
}
],
"childs" : [
{
"value" : "2",
"count" : 3
},
{
"value" : "1",
"count" : 1
}
]
}
So why is this bad? The main problem should be apparent in the very first pipeline stage:
{ "$group": {
"_id": null,
"age": { "$push": "$age" },
"childs": { "$push": "$childs" }
}},
What we asked to do here is group up everything in the collection for the values we want and $push those results into an array. When things are small then this works, but real world collections would result in this "single document" in the pipeline that exceeds the 16MB BSON limit that is allowed. That is what is bad.
The rest of the logic follows the natural course by working with each array. But of course real world scenarios would almost always make this untenable.
You could avoid this somewhat, by doing things like "duplicating" the documents to be of "type" "age or "child" and grouping documents individually by type. But it's all a bit to "over complex" and not a solid way of doing things.
The natural response is "what about a UNION?", but since MongoDB does not do the "join" then how to approach that?
A Better Way ( aka A New Hope )
Your best approach here both architecturally and performance wise is to simply submit "both" queries ( yes two ) in "parallel" to the server via your client API. As the results are received you then "combine" them into a single response you can then send back as a source of data to your eventual "client" application.
Different languages have different approaches to this, but the general case is to look for an "asynchronous processing" API that allows you to do this in tandem.
My example purpose here uses node.js as the "asynchronous" side is basically "built in" and reasonably intuitive to follow. The "combination" side of things can be any type of "hash/map/dict" table implementation, just doing it the simple way for example only:
var async = require('async'),
MongoClient = require('mongodb');
MongoClient.connect('mongodb://localhost/test',function(err,db) {
var collection = db.collection('docs');
async.parallel(
[
function(callback) {
collection.aggregate(
[
{ "$group": {
"_id": "$age",
"type": { "$first": { "$literal": "age" } },
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": -1 } }
],
callback
);
},
function(callback) {
collection.aggregate(
[
{ "$group": {
"_id": "$childs",
"type": { "$first": { "$literal": "childs" } },
"count": { "$sum": 1 }
}},
{ "$sort": { "_id": -1 } }
],
callback
);
}
],
function(err,results) {
if (err) throw err;
var response = {};
results.forEach(function(res) {
res.forEach(function(doc) {
if ( !response.hasOwnProperty(doc.type) )
response[doc.type] = [];
response[doc.type].push({
"value": doc._id,
"count": doc.count
});
});
});
console.log( JSON.stringify( response, null, 2 ) );
}
);
});
Which gives the cute result:
{
"age": [
{
"value": "50",
"count": 1
},
{
"value": "40",
"count": 3
}
],
"childs": [
{
"value": "2",
"count": 3
},
{
"value": "1",
"count": 1
}
]
}
So the key thing to note here is that the "separate" aggregation statements themselves are actually quite simple. The only thing you face is combining those in your final result. There are many approaches to "combining", particularly to deal with large results from each of the queries, but this is the basic example of the execution model.
Key points here.
Shuffling data in the aggregation pipeline is possible but not performant for large data sets.
Use a language implementation and API that support "parallel" and "asynchronous" execution so you can "load up" all or "most" of your operations at once.
The API should support some method of "combination" or otherwise allow a separate "stream" write to process each result set received into one.
Forget about the SQL way. The NoSQL way delegates the processing of such things as "joins" to your "data logic layer", which is what contains the code as shown here. It does it this way because it is scalable to very large datasets. It is rather the job of your "data logic" handling nodes in large applications to deliver this to the end API.
This is fast compared to any other form of "wrangling" I could possibly describe. Part of "NoSQL" thinking is to "Unlearn what you have learned" and look at things a different way. And if that way doesn't perform better, then stick with the SQL approach for storage and query.
That's why alternatives exist.
That was a tough one!
First, the bare solution:
db.test.aggregate([
{ "$match": { "user": "Hans" } },
// duplicate each document: one for "age", the other for "childs"
{ $project: { age: "$age", childs: "$childs",
data: {$literal: ["age", "childs"]}}},
{ $unwind: "$data" },
// pivot data to something like { data: "age", value: "40" }
{ $project: { data: "$data",
value: {$cond: [{$eq: ["$data", "age"]},
"$age",
"$childs"]} }},
// Group by data type, and count
{ $group: { _id: {data: "$data", value: "$value" },
count: { $sum: 1 },
value: {$first: "$value"} }},
// aggregate values in an array for each independant (type,value) pair
{ $group: { _id: "$_id.data", values: { $push: { count: "$count", value: "$value" }} }} ,
// project value to the correctly name field
{ $project: { result: {$cond: [{$eq: ["$_id", "age"]},
{age: "$values" },
{childs: "$values"}]} }},
// group all data in the result array, and remove unneeded `_id` field
{ $group: { _id: null, result: { $push: "$result" }}},
{ $project: { _id: 0, result: 1}}
])
Producing:
{
"result" : [
{
"age" : [
{
"count" : 3,
"value" : "40"
},
{
"count" : 1,
"value" : "50"
}
]
},
{
"childs" : [
{
"count" : 1,
"value" : "1"
},
{
"count" : 3,
"value" : "2"
}
]
}
]
}
And now, for some explanations:
One of the major issues here is that each incoming document has to be part of two different sums. I solved that by adding a literal array ["age", "childs"] to your documents, and then unwinding them by that array. That way, each document will be presented twice in the later stage.
Once that done, to ease processing, I change the data representation to something much more manageable like { data: "age", value: "40" }
The following steps will perform the data aggregation per-se. Up to the third $project step that will map the value fields to the corresponding age or childs field.
The final two steps will simply wrap the two documents in one, removing the unneeded _id field.
Pfff!

Mongodb aggregation, finding within an array of values

I have a schemea that creates documents using the following structure:
{
"_id" : "2014-07-16:52TEST",
"date" : ISODate("2014-07-16T23:52:59.811Z"),
"name" : "TEST"
"values" : [
[
1405471921000,
0.737121
],
[
1405471922000,
0.737142
],
[
1405471923000,
0.737142
],
[
1405471924000,
0.737142
]
]
}
In the values, the first index is a timestamp. What I'm trying to do is query a specific timestamp to find the closest value ($gte).
I've tried the following aggregate query:
[
{ "$match": {
"values": {
"$elemMatch": { "0": {"$gte": 1405471923000} }
},
"name" : 'TEST'
}},
{ "$project" : {
"name" : 1,
"values" : 1
}},
{ "$unwind": "$values" },
{ "$match": { "values.0": { "$gte": 1405471923000 } } },
{ "$limit" : 1 },
{ "$sort": { "values.0": -1 } },
{ "$group": {
"_id": "$name",
"values": { "$push": "$values" },
}}
]
This seems to work, but it doesn't pull the closest value. It seems to pull anything greater or equal to and the sort doesn't seem to get applied, so it will pull a timestamp that is far in the future.
Any suggestions would be great!
Thank you
There are a couple of things wrong with the approach here even though it is a fair effort. You are right that you need to $sort here, but the problem is that you cannot "sort" on an inner element with an array. In order to get a value that can be sorted you must $unwind the array first as it otherwise will not sort on an array position.
You also certainly do not want $limit in the pipeline. You might be testing this against a single document, but "limit" will actually act on the entire set of documents in the pipeline. So if more than one document was matching your condition then they would be thrown away.
The key thing you want to do here is use $first in your $group stage, which is applied once you have sorted to get the "closest" element that you want.
db.collection.aggregate([
// Documents that have an array element matching the condition
{ "$match": {
"values": { "$elemMatch": { "0": {"$gte": 1405471923000 } } }
}},
// Unwind the top level array
{ "$unwind": "$values" },
// Filter just the elements that match the condition
{ "$match": { "values.0": { "$gte": 1405471923000 } } },
// Take a copy of the inner array
{ "$project": {
"date": 1,
"name": 1,
"values": 1,
"valCopy": "$values"
}},
// Unwind the inner array copy
{ "$unwind": "$valCopy" },
// Filter the inner elements
{ "$match": { "valCopy": { "$gte": 1405471923000 } }},
// Sort on the now "timestamp" values ascending for nearest
{ "$sort": { "valCopy": 1 } },
// Take the "first" values
{ "$group": {
"_id": "$_id",
"date": { "$first": "$date" },
"name": { "$first": "$name" },
"values": { "$first": "$values" },
}},
// Optionally push back to array to match the original structure
{ "$group": {
"_id": "$_id",
"date": { "$first": "$date" },
"name": { "$first": "$name" },
"values": { "$push": "$values" },
}}
])
And this produces your document with just the "nearest" timestamp value matching the original document form:
{
"_id" : "2014-07-16:52TEST",
"date" : ISODate("2014-07-16T23:52:59.811Z"),
"name" : "TEST",
"values" : [
[
1405471923000,
0.737142
]
]
}

Server Side Looping

I’ve solved this problem but looking for a better way to do it on the mongodb server rather that client.
I have one collection of Orders with a placement datetime (iso date) and a product.
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:3, datetime:“T3”, product:”Pear”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
Goal: For a given time (or set of times) show the last order for EACH product in the set of my products before that time. Products are finite and known.
eg. query for time T6 will return:
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
T4 will return:
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
i’ve implemented this by creating a composite index on orders [datetime:descending, product:ascending]
Then on the java client:
findLastOrdersForTimes(times) {
for (time: times) {
for (product: products) {
db.orders.findOne(product:product, datetime: { $lt: time}}
}
}
}
Now that is pretty fast since it hits the index and only fetching the data i need. However I need to query for many time points (100000+) which will be a lot of calls over the network. Also my orders table will be very large. So how can I do this on the server in one hit, i.e return a collection of time->array products? If it was oracle, id create a stored proc with a cursor that loops back in time and collects the results for every time point and breaks when it gets to the last product after the last time point. I’ve looked at the aggregation framework and mapreduce but can’t see how to achieve this kind of loop. Any pointers?
If you truly want the last order for each product, then the aggregation framework comes in:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$group": {
"_id": "$product",
"datetime": { "$max": "$datetime" }
}}
])
Example with an array of products:
var products = ['Apple', 'Orange', 'Pear'];
{ "_id" : "Pear", "datetime" : "T4" }
{ "_id" : "Orange", "datetime" : "T2" }
{ "_id" : "Apple", "datetime" : "T5" }
Or if the _id from the original document is important to you, use the $sort with $last instead:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$sort": { "datetime": 1 } },
{ "$group": {
"_id": "$product",
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" }
}}
])
And that is what you most likely really want to do in either of those last cases. But the index you really want there is on "product":
db.times.ensureIndex({ "product": 1 })
So even if you need to iterate that with an additional $match condition for $lt a certain timepoint, then that is better or otherwise you can modify the "grouping" to include the "datetime" as well as keeping a set in the $match.
It seems better at any rate, so perhaps this helps at least to modify your thinking.
If I'm reading out your notes correctly you seem to simply be looking for turning this on it's head and finding the last product for each point in time. So the statement is not much different:
db.times.aggregate([
{ "$match": {
"datetime": { "$in": ["T4","T5"] },
}},
{ "$sort": { "product": 1, "datetime": 1 } },
{ "$group": {
"_id": "$datetime",
"id": { "$last": "$_id" },
"product": { "$last": "$product" }
}}
])
That is in theory it is like that based on how you present the question. I have the feeling though that you are abstracting this though and "datetime" is possibly actual timestamps as date object types.
So you might not be aware of the date aggregation operators you can apply, for example to get the boundary of each hour:
db.times.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$datetime" },
"dayOfYear": { "$dayOfYear": "$datetime" },
"hour": { "$hour": "$datetime" }
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Or even using date math instead of the operators if a epoch based timestamp
db.times.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
{ "$mod": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
1000*60*60
]}
]
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Of course you can add a range query for dates in the $match with $gt and $lt operators to keep the data within the range you are particularly looking at.
Your overall solution is probably a combination of ideas, but as I said, your question seem to be about matching the last entries on certain time boundaries, so the last examples possibly in combination with filtering certain products is what you need rather than looping .findOne() requests.