I have the following query that computes the moving average on a MySQL table:
SELECT m1.x AS x, m1.y AS y, AVG(m2.y) AS average
FROM measured_signal AS m1
JOIN measured_signal AS m2 ON (m2.x BETWEEN m1.x - 5000 AND m1.x + 5000)
WHERE m1.x BETWEEN 5000 AND 15000 AND m2.x BETWEEN 0 AND 20000
GROUP BY m1.x
It works just fine, but now I am migrating to MongoDB and I need to perform the same operation.
I have read this question that is quite similar but doesn't cover my specific case.
So far I have written down the following pipeline:
db.getCollection("measured_signal").aggregate([
{ $match: {x: { $gt: 0, $lte: 20000 } } },
{ $sort: { x: 1 } },
{
$group:{
_id: null,
rows: {
$push: { x: "$x", y: "$y" }
}
}
},
{
$addFields: {
rows: {
$map: {
input: {
$filter: {
input: "$rows",
cond: {
$gte: ["$$this.x", {$subtract: ["$$this.x", 5000]}],
$lte: ["$$this.x", {$add: ["$$this.x", 5000]}]
}
}
},
in: {
x: "$$this.x",
y: "$$this.y",
average: { $avg: "$$this.x" },
}
}
}
}
},
{ $unwind: "$rows" },
{ $match: {x: { $gt: 5000, $lte: 15000 } } }
],{allowDiskUse: true});
but it doesn't work.
Should I try something completely different? Or what should I change in this? Thanks for your help.
EDIT
To better understand the problem, I'm adding an example of input data
{x:3628, y: 0.1452},
{x:7256, y: 0.1358},
{x:10884, y: 0.1327},
{x:14512, y: 0.1285},
{x:18140, y: 0.1256},
{x:21768, y: 0.1268},
{x:25396, y: 0.1272},
{x:29024, y: 0.1301},
...
and the desired output, considering a window size of 5000:
{x:7256, y: 0.1358, average: 0.1379}, // average computed on rows between 2256 and 12256
{x:10884, y: 0.1327, average: 0.1323}, // average computed on rows between 5884 and 15884
{x:14512, y: 0.1285, average: 0.1289}, // average computed on rows between 9512 and 19512
{x:18140, y: 0.1256, average: 0.1270}, // average computed on rows between 13140 and 23140
{x:21768, y: 0.1268, average: 0.1265}, // average computed on rows between 16768 and 26768
{x:25396, y: 0.1272, average: 0.1280}, // average computed on rows between 20396 and 30396
...
From your SQL and what I see as the "literal interpretation" to a MongoDB statement I actually only get three results from the eight documents posted in the question.
The statement I see as identical is actually:
db.measured_signal.aggregate([
{ "$match": { "x": { "$gt": 5000, "$lt": 15000 } } },
{ "$lookup": {
"from": "measured_signal",
"let": { "x": "$x", "y": "$y" },
"pipeline": [
{ "$match": {
"x": { "$gt": 0, "$lt": 20000 },
"$expr": {
"$and": [
{ "$gt": [ "$x", { "$subtract": [ "$$x", 5000 ] }] },
{ "$lt": [ "$x", { "$add": [ "$$x", 5000 ] }] }
]
}
}},
],
"as": "results"
}},
{ "$unwind": "$results" },
{ "$group": {
"_id": "$x",
"y": { "$first": "$y" },
"average": { "$avg": "$results.y" }
}},
{ "$addFields": {
"_id": "$$REMOVE",
"x": "$_id"
}},
{ "$sort": { "x": 1 } }
]).map(({ x, y, average }) => ({ x, y, average }))
And the result:
{
"x" : 7256,
"y" : 0.1358,
"average" : 0.1379
},
{
"x" : 10884,
"y" : 0.1327,
"average" : 0.13233333333333333
},
{
"x" : 14512,
"y" : 0.1285,
"average" : 0.12893333333333334
}
If you work though its pretty logical.
Aggregation pipelines in MongoDB should typically start with a $match condition. This is basically the WHERE clause in a declarative SQL statement, but in an aggregation pipeline this "filter" condition is done first. Notably the JOIN is not yet done so the initial $match only looks at the initial ( or m1 ) view of the collection/table.
The next thing would the the JOIN. This is done via $lookup, and here we can actually create an expression on which to "join" on being equal to the conditions presented in SQL. Here the second part of the WHERE is included in the $match within the pipeline argument of $lookup. This actually means another "filter" on the foreign documents ( in this case a "self join" ).
The other thing to note is the let argument in the $lookup, along with the $expr in the $match for the inner pipeline. This allows values from the initial collection ( or m1 ) to be compared with the foreign collection ( or m2 ). As you can see the expressions inside the $expr are done a little differently, since these are actual "aggregation expressions" of the comparison operators for $gt and $lt and these versions return a Boolean value on the compared values. In short we make variables referring to values from the initial document and compare those to the values in the foreign collection to determine part of the "join" condition.
The output of $lookup is always an "array" added into the initial document containing the matched foreign results. This is always an array, even if there is only one result. The new field in the initial document containing this array is named by the as argument. To be literal to the SQL, a JOIN would produce denormalized output where there are many copies of the parent document to each foreign child. The literal translation of this is $unwind, but alternately you could skip that step and just change the line with the $avg later to:
"average": { "$avg": { "$avg": "$results.y" } }
On to that "average", the next thing is of course $group, where just like in the SQL you want to GROUP BY the x value from the initial collection document ( still called x by MongoDB ) and of course MongoDB is a bit more literal than SQL in this regard so you must use an accumulator for anything not in the GROUP BY or _id of the $group statement. This means using the $first operator as the appropriate "accumulator" for the y value.
The "average" is of course obtained by $avg, either straight on the singular denormalized values produced from [$unwind][5], or on the "array" content first and then per "grouped document". Hence the demonstrated second example where $avg is specified twice for those two purposes.
Since $group requires it's GROUP BY key to be named _id by convention, if you want this renamed you would need the $addFields stage. That's how you make MongoDB return the names you want from an aggregation pipeline, but personally I would probably stick with the _id in the returned result and just rename via a .map() or similar action. Demonstrated also on the above listing since an $addFields and other $project operations actually would keep the order of defined fields from the $group output. Basically meaning that x would be the last field in output documents rather than the first.
So the last parts really are cosmetic and you need not do them just to see the desired result. And of course the output from $group does not have a default order like a GROUP BY, so you want a $sort at the end of the actual pipeline execution, or optionally sort the resulting documents after translation to an array if the result is small enough and you prefer that.
NOTE Since the pipeline expression in $lookup is in fact a full pipeline in itself, you could and in fact probably should perform the $avg operation before returning the result array in as. This however would not actually change the fact that it still must return an array, but the results would be significantly less and far safer in the case of a "large join" result, since you only return the one number needed.
Since this is "still" an array, it would not change the need for either the $unwind or the *double $avg statement as demonstrated. It's just nicer to not return a large array of things you don't really need for the end result.
Just to show these are in fact the same things, I have your SQL code running in a self contained listing and another running the statement on MongoDB. As you can see, both produce the same results.
NodeJS code just for convenience of the author to run against two engines.
SQL Listing
const { Op, DOUBLE, SMALLINT } = Sequelize = require('sequelize');
const logging = log = data => console.log(JSON.stringify(data, undefined, 2));
const sequelize = new Sequelize('sqlite:dbname.db', { logging });
const MeasuredSignal = sequelize.define('measured_signal', {
id: { type: SMALLINT, primaryKey: true },
x: DOUBLE,
y: DOUBLE
}, { freezeTableName: true });
(async function() {
try {
await sequelize.authenticate();
await MeasuredSignal.sync({ force: true });
let result = await sequelize.transaction(transaction =>
Promise.all(
[
{x:3628, y: 0.1452},
{x:7256, y: 0.1358},
{x:10884, y: 0.1327},
{x:14512, y: 0.1285},
{x:18140, y: 0.1256},
{x:21768, y: 0.1268},
{x:25396, y: 0.1272},
{x:29024, y: 0.1301}
].map(d => MeasuredSignal.create(d, { transaction }))
)
);
let output = await sequelize.query(
`
SELECT m1.x AS x, m1.y AS y, AVG(m2.y) as average
FROM measured_signal as m1
JOIN measured_signal as m2
ON ( m2.x BETWEEN m1.x - 5000 AND m1.x + 5000)
WHERE m1.x BETWEEN 5000 AND 15000 AND m2.x BETWEEN 0 AND 20000
GROUP BY m1.x
`, { type: sequelize.QueryTypes.SELECT });
log(output);
} catch (e) {
console.error(e)
} finally {
process.exit()
}
})()
Output:
"Executing (default): SELECT 1+1 AS result"
"Executing (default): DROP TABLE IF EXISTS `measured_signal`;"
"Executing (default): CREATE TABLE IF NOT EXISTS `measured_signal` (`id` INTEGER PRIMARY KEY, `x` DOUBLE PRECISION, `y` DOUBLE PRECISION, `createdAt` DATETIME NOT NULL, `updatedAt` DATETIME NOT NULL);"
"Executing (default): PRAGMA INDEX_LIST(`measured_signal`)"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): BEGIN DEFERRED TRANSACTION;"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): INSERT INTO `measured_signal` (`id`,`x`,`y`,`createdAt`,`updatedAt`) VALUES ($1,$2,$3,$4,$5);"
"Executing (7c7d0f4d-719a-4b4c-ad6a-5d5c209b8fa1): COMMIT;"
"Executing (default): SELECT m1.x AS x, m1.y AS y, AVG(m2.y) as average\n FROM measured_signal as m1\n JOIN measured_signal as m2\n ON ( m2.x BETWEEN m1.x - 5000 AND m1.x + 5000)\n WHERE m1.x BETWEEN 5000 AND 15000 AND m2.x BETWEEN 0 AND 20000\n GROUP BY m1.x"
[
{
"x": 7256,
"y": 0.1358,
"average": 0.13790000000000002
},
{
"x": 10884,
"y": 0.1327,
"average": 0.13233333333333333
},
{
"x": 14512,
"y": 0.1285,
"average": 0.12893333333333332
}
]
MongoDB listing
const { Schema } = mongoose = require('mongoose');
const uri = 'mongodb://localhost:27017/test';
const opts = { useNewUrlParser: true };
mongoose.set('useFindAndModify', false);
mongoose.set('useCreateIndex', true);
mongoose.set('debug', true);
const signalSchema = new Schema({
x: Number,
y: Number
});
const MeasuredSignal = mongoose.model('MeasuredSignal', signalSchema, 'measured_signal');
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const conn = await mongoose.connect(uri, opts);
await Promise.all(
Object.entries(conn.models).map(([k,m]) => m.deleteMany())
);
await MeasuredSignal.insertMany([
{x:3628, y: 0.1452},
{x:7256, y: 0.1358},
{x:10884, y: 0.1327},
{x:14512, y: 0.1285},
{x:18140, y: 0.1256},
{x:21768, y: 0.1268},
{x:25396, y: 0.1272},
{x:29024, y: 0.1301}
]);
let result = await MeasuredSignal.aggregate([
{ "$match": { "x": { "$gt": 5000, "$lt": 15000 } } },
{ "$lookup": {
"from": MeasuredSignal.collection.name,
"let": { "x": "$x", "y": "$y" },
"pipeline": [
{ "$match": {
"x": { "$gt": 0, "$lt": 20000 },
"$expr": {
"$and": [
{ "$gt": [ "$x", { "$subtract": [ "$$x", 5000 ] } ] },
{ "$lt": [ "$x", { "$add": [ "$$x", 5000 ] } ] }
]
}
}}
],
"as": "results"
}},
{ "$group": {
"_id": "$x",
"y": { "$first": "$y" },
"average": { "$avg": { "$avg": "$results.y" } }
}},
{ "$sort": { "_id": 1 } }
]);
result = result.map(({ _id: x, y, average }) => ({ x, y, average }));
log(result);
} catch(e) {
console.error(e)
} finally {
mongoose.disconnect()
}
})()
Output:
Mongoose: measured_signal.deleteMany({}, {})
Mongoose: measured_signal.insertMany([ { _id: 5cb7158c50641f1837a7b272, x: 3628, y: 0.1452, __v: 0 }, { _id: 5cb7158c50641f1837a7b273, x: 7256, y: 0.1358, __v: 0 }, { _id: 5cb7158c50641f1837a7b274, x: 10884, y: 0.1327, __v: 0 }, { _id: 5cb7158c50641f1837a7b275, x: 14512, y: 0.1285, __v: 0 }, { _id: 5cb7158c50641f1837a7b276, x: 18140, y: 0.1256, __v: 0 }, { _id: 5cb7158c50641f1837a7b277, x: 21768, y: 0.1268, __v: 0 }, { _id: 5cb7158c50641f1837a7b278, x: 25396, y: 0.1272, __v: 0 }, { _id: 5cb7158c50641f1837a7b279, x: 29024, y: 0.1301, __v: 0 } ], {})
Mongoose: measured_signal.aggregate([ { '$match': { x: { '$gt': 5000, '$lt': 15000 } } }, { '$lookup': { from: 'measured_signal', let: { x: '$x', y: '$y' }, pipeline: [ { '$match': { x: { '$gt': 0, '$lt': 20000 }, '$expr': { '$and': [ { '$gt': [ '$x', { '$subtract': [Array] } ] }, { '$lt': [ '$x', { '$add': [Array] } ] } ] } } } ], as: 'results' } }, { '$group': { _id: '$x', y: { '$first': '$y' }, average: { '$avg': { '$avg': '$results.y' } } } }, { '$sort': { _id: 1 } } ], {})
[
{
"x": 7256,
"y": 0.1358,
"average": 0.1379
},
{
"x": 10884,
"y": 0.1327,
"average": 0.13233333333333333
},
{
"x": 14512,
"y": 0.1285,
"average": 0.12893333333333334
}
]
My Mongo-Db dataset is this:
{
"_id" : ObjectId("5a267533754884223467604a"),
"user_id" : "5a20ee1acdacc7086ce7742c",
"tv_count" : 1,
"ac_count" : 0,
"fridge_count" : 0,
"blower_count" : 0,
"chair_count" : 0,
"sofa_count" : 0,
"D2H_count" : 2,
"lastmodified" : ISODate("2017-12-05T10:30:30.559Z"),
"__v" : 0
}
So I know i want to do some modification during the time of Sum.
My Sum Code is this:
Accessories.aggregate([
{$match: { "lastmodified":{$gt: newTime}}},
{
$project: {
total: {
$add: [ "$tv_count", "$ac_count", "$fridge_count", "$blower_count", "$chair_count", "$sofa_count", "$D2H_count"]
}
}
}
]);
So it will return the result
[ { _id: 5a267533754884223467604a, total: 3 } ]
Now I want to do some extra calculation. Example is
Earlier Result will be 1+0+0+0+0+0+2 = 3
My desire result will be like (1*2)+0+0+0+0+0+(2*4) = 10
Any Help will be appreciated.
You can simply include additional nested calculations inside your existing projection like so:
$project: {
total: {
$add: [ { $multiply: [ "$tv_count", 2 ] }, "$ac_count", "$fridge_count", "$blower_count", "$chair_count", "$sofa_count", { $multiply: [ "$D2H_count", 4 ] }]
}
}
You should not use the $addFields or any other additional stage for this type of thing, really, since adding stages will slow down the aggregation pipeline and adding fields in particular will inflate the interim result documents between the stages so again slow down your query.
I would add an $addFields stage in the aggregation pipeline between the $match and the #project stage.
In the $addFields stage you can define new "helper fields" - eg.
{ $addFields: {
"tv_temp": { $multiply: [ "$tv_count", 2 ] },
"D2H_temp": { $multiply: [ "$D2H_count", 4 ] },
}
}
Using these fields, you can now calculate the total:
{ $project: {
total: {
$add: [ "$tv_temp", "$ac_count", "$fridge_count", "$blower_count", "$chair_count", "$sofa_count", "$D2H_temp"]
}
}}