How to do LEFT JOIN in MongoDB aggregate function? [duplicate] - mongodb

I have a collection of users where each document has following structure:
{
"_id": "<id>",
"login": "xxx",
"solved": [
{
"problem": "<problemID>",
"points": 10
},
...
]
}
The field solved may be empty or contain arbitrary many subdocuments. My goal is to get a list of users together with the total score (sum of points) where users that haven't solved any problem yet will be assigned total score of 0. Is this possible to do this with a single query (ideally using aggregation framework)?
I was trying to use following query in aggregation framework:
{ "$group": {
"_id": "$_id",
"login": { "$first": "$login" },
"solved": { "$addToSet": { "points": 0 } }
} }
{ "$unwind": "$solved" }
{ "$group": {
"_id": "$_id",
"login": { "$first": "$login" },
"solved": { "$sum": "$solved.points" }
} }
However I am getting following error:
exception: The top-level _id field is the only field currently supported for exclusion
Thank you in advance

With MongoDB 3.2 version and newer, the $unwind operator now has some options where in particular the preserveNullAndEmptyArrays option will solve this.
If this option is set to true and if the path is null, missing, or an empty array, $unwind outputs the document. If false, $unwind does not output a document if the path is null, missing, or an empty array. In your case, set it to true:
db.collection.aggregate([
{ "$unwind": {
"path": "$solved",
"preserveNullAndEmptyArrays": true
} },
{ "$group": {
"_id": "$_id",
"login": { "$first": "$login" },
"solved": { "$sum": "$solved.points" }
} }
])

Here is the solution - it assumes that the field "solved" is either absent, is equal to null or has an array of problems and scores solved. The case it does not handle is "solved" being an empty array - although that would be a simple additional adjustment you could add.
project = {$project : {
"s" : {
"$ifNull" : [
"$solved",
[
{
"points" : 0
}
]
]
},
"login" : 1
}
};
unwind={$unwind:"$s"};
group= { "$group" : {
"_id" : "$_id",
"login" : {
"$first" : "$login"
},
"score" : {
"$sum" : "$s.points"
}
}
}
db.students.aggregate( [ project, unwind, group ] );

$lookup then $unwind inside look up array and that could be empty
let posts = await Post.aggregate<ActivityDoc>([
{
$match: {
_id: new mongoose.Types.ObjectId(req.params.id),
},
},
{
$lookup: {
from: 'users',
localField: 'user',
foreignField: '_id',
as: 'user',
},
},
{
$unwind: '$user',
},
{
$unwind: {
path: '$user.follower',
preserveNullAndEmptyArrays: true,
},
},
{
$match: {
$or: [
{
$and: [
{
'privacy.mode': {
$eq: PrivacyMode.EveryOne,
},
},
],
},
{
$and: [
{
'privacy.mode': {
$eq: PrivacyMode.MyCircle,
},
},
{
'user.follower.id': {
$eq: req.currentUser?.id,
},
},
],
},
],
},
},
]);

Related

Aggregation error: $arrayElemAt's first argument must be an array, but is object

I'm trying to aggregate a collection in mongo using the following pipeline:
const results = await Price.aggregate([
{ $match: { date: today } },
{ $unwind: '$points' },
{ $match: { 'points.time': { $gte: start, $lte: now } } },
{ $sort: { 'points.time': 1 } },
{ $project: {
'high': { $max: '$points.price' },
'low': { $min: '$points.price' },
'open': { $arrayElemAt: ['$points', 0] },
'close': { $arrayElemAt: ['$points', -1] }
} }
])
However the $arrayElemAt operator isn't working preseumably because one of the preceding stages ($unwind I believe) converts the array of points I have in my documents to an object. How can I fix this?
Example document:
{
"_id" : ObjectId("5c93ac3ab89045027259a23f"),
"date" : ISODate("2019-03-21T00:00:00Z"),
"symbol" : "CC6P",
"points" : [
{
"_id" : ObjectId("5c93ac3ab89045027259a244"),
"volume" : 553,
"time" : ISODate("2019-03-21T09:35:34.239Z"),
"price" : 71
},
{
"_id" : ObjectId("5c93ac3ab89045027259a243"),
"volume" : 1736,
"time" : ISODate("2019-03-21T09:57:34.239Z"),
"price" : 49
},
....
],
My expected result is an array of objects where the points that should be passed to the project stage should be points in the specified range in the second $match. I tried combining the two $match stages and removing the $unwind stage and the error is gone however the time range isn't being applied
I believe you are missing a $group stage to rollback your points array
const results = await Price.aggregate([
{ "$match": { "date": today } },
{ "$unwind": "$points" },
{ "$match": { "points.time": { "$gte": start, "$lte": now } } },
{ "$sort": { "points.time": 1 } },
{ "$group": {
"_id": "$_id",
"points": { "$push": "$points" },
"date": { "$first": "$date" },
"symbol": { "$first": "$symbol" }
}},
{ "$project": {
"high": { "$max": "$points.price" },
"low": { "$min": "$points.price" },
"open": { "$arrayElemAt": ["$points", 0] },
"close": { "$arrayElemAt": ["$points", -1] }
}}
])

Lookup and group from two fields in one aggregation

I have an aggregation that looks like this:
userSchema.statics.getCounts = function (req, type) {
return this.aggregate([
{ $match: { organization: req.user.organization._id } },
{
$lookup: {
from: 'tickets', localField: `${type}Tickets`, foreignField: '_id', as: `${type}_tickets`,
},
},
{ $unwind: `$${type}_tickets` },
{ $match: { [`${type}_tickets.createdAt`]: { $gte: new Date(moment().subtract(4, 'd').startOf('day').utc()), $lt: new Date(moment().endOf('day').utc()) } } },
{
$group: {
_id: {
groupDate: {
$dateFromParts: {
year: { $year: `$${type}_tickets.createdAt` },
month: { $month: `$${type}_tickets.createdAt` },
day: { $dayOfMonth: `$${type}_tickets.createdAt` },
},
},
userId: `$${type}_tickets.assignee_id`,
},
ticketCount: {
$sum: 1,
},
},
},
{
$sort: { '_id.groupDate': -1 },
},
{ $group: { _id: '$_id.userId', data: { $push: { groupDate: '$_id.groupDate', ticketCount: '$ticketCount' } } } },
]);
};
Which outputs data like this:
[
{
_id: 5aeb6b71709f43359e0888bb,
data: [
{ "groupDate": 2018-05-07T00:00:000Z", ticketCount: 4 }
}
]
Ideally though, I would have data like this:
[
{
_id: 5aeb6b71709f43359e0888bb,
data: [
{ "groupDate": 2018-05-07T00:00:000Z", assignedCount: 4, resolvedCount: 8 }
}
]
The difference being that the object for the user would output both the total number of assigned tickets and the total number of resolved tickets for each date.
My userSchema is like this:
const userSchema = new Schema({
firstName: String,
lastName: String,
assignedTickets: [
{
type: mongoose.Schema.ObjectId,
ref: 'Ticket',
index: true,
},
],
resolvedTickets: [
{
type: mongoose.Schema.ObjectId,
ref: 'Ticket',
index: true,
},
],
}, {
timestamps: true,
});
An example user doc is like this:
{
"_id": "5aeb6b71709f43359e0888bb",
"assignedTickets": ["5aeb6ba7709f43359e0888bd", "5aeb6bf3709f43359e0888c2", "5aec7e0adcdd76b57af9e889"],
"resolvedTickets": ["5aeb6bc2709f43359e0888be", "5aeb6bc2709f43359e0888bf"],
"firstName": "Name",
"lastName": "Surname",
}
An example ticket doc is like this:
{
"_id": "5aeb6ba7709f43359e0888bd",
"ticket_id": 120292,
"type": "assigned",
"status": "Pending",
"assignee_email": "email#gmail.com",
"assignee_id": "5aeb6b71709f43359e0888bb",
"createdAt": "2018-05-02T20:05:59.147Z",
"updatedAt": "2018-05-03T20:05:59.147Z",
}
I've tried adding multiple lookups and group stages, but I keep getting an empty array. If I only do one lookup and one group, I get the correct counts for the searched on field, but I'd like to have both fields in one query. Is it possible to have the query group on two lookups?
In short you seem to be coming to terms with setting up your models in mongoose and have gone overboard with references. In reality you really should not keep the arrays within the "User" documents. This is actually an "anti-pattern" which was just something mongoose used initially as a convention for keeping "references" for population where it did not understand how to translate the references from being kept in the "child" to the "parent" instead.
You actually have that data in each "Ticket" and the natural form of $lookup is to use that "foreignField" in reference to the detail from the local collection. In this case the "assignee_id" on the tickets will suffice for looking at matching back to the "_id" of the "User". Though you don't state it, your "status" should be an indicator of whether the data is actually either "assigned" as when in "Pending" state or "resolved" when it is not.
For the sake of simplicity we are going to consider the state "resolved" if it is anything other than "Pending" in value, but extending on the logic from the example for actual needs is not the problem here.
Basically then we resolve to a single $lookup operation by actually using the natural "foreign key" as opposed to keeping separate arrays.
MongoDB 3.6 and greater
Ideally you would use features from MongoDB 3.6 with sub-pipeline processing here:
// Better date calculations
const oneDay = (1000 * 60 * 60 * 24);
var now = Date.now(),
end = new Date((now - (now % oneDay)) + oneDay),
start = new Date(end.valueOf() - (4 * oneDay));
User.aggregate([
{ "$match": { "organization": req.user.organization._id } },
{ "$lookup": {
"from": Ticket.collection.name,
"let": { "id": "$_id" },
"pipeline": [
{ "$match": {
"createdAt": { "$gte": start, "$lt": end },
"$expr": {
"$eq": [ "$$id", "$assignee_id" ]
}
}},
{ "$group": {
"_id": {
"status": "$status",
"date": {
"$dateFromParts": {
"year": { "$year": "$createdAt" },
"month": { "$month": "$createdAt" },
"day": { "$dayOfMonth": "$createdAt" }
}
}
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.date",
"data": {
"$push": {
"k": {
"$cond": [
{ "$eq": ["$_id.status", "Pending"] },
"assignedCount",
"resolvedCount"
]
},
"v": "$count"
}
}
}},
{ "$sort": { "_id": -1 } },
{ "$replaceRoot": {
"newRoot": {
"$mergeObjects": [
{ "groupDate": "$_id", "assignedCount": 0, "resolvedCount": 0 },
{ "$arrayToObject": "$data" }
]
}
}}
],
"as": "data"
}},
{ "$project": { "data": 1 } }
])
From MongoDB 3.0 and upwards
Or where you lack those features we use a different pipeline process and a little data transformation after the results are returned from the server:
User.aggregate([
{ "$match": { "organization": req.user.organization._id } },
{ "$lookup": {
"from": Ticket.collection.name,
"localField": "_id",
"foreignField": "assignee_id",
"as": "data"
}},
{ "$unwind": "$data" },
{ "$match": {
"data.createdAt": { "$gte": start, "$lt": end }
}},
{ "$group": {
"_id": {
"userId": "$_id",
"date": {
"$add": [
{ "$subtract": [
{ "$subtract": [ "$data.createdAt", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$data.createdAt", new Date(0) ] },
oneDay
]}
]},
new Date(0)
]
},
"status": "$data.status"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"userId": "$_id.userId",
"date": "$_id.date"
},
"data": {
"$push": {
"k": {
"$cond": [
{ "$eq": [ "$_id.status", "Pending" ] },
"assignedCount",
"resolvedCount"
]
},
"v": "$count"
}
}
}},
{ "$sort": { "_id.userId": 1, "_id.date": -1 } },
{ "$group": {
"_id": "$_id.userId",
"data": {
"$push": {
"groupDate": "$_id.date",
"data": "$data"
}
}
}}
])
.then( results =>
results.map( ({ data, ...d }) =>
({
...d,
data: data.map(di =>
({
groupDate: di.groupDate,
assignedCount: 0,
resolvedCount: 0,
...di.data.reduce((acc,curr) => ({ ...acc, [curr.k]: curr.v }),{})
})
)
})
)
)
Which just really goes to show that even with the fancy features in modern releases, you really don't need them because there pretty much has always been ways to work around this. Even the JavaScript parts just had slightly longer winded versions before the current "object spread" syntax was available.
So that is really the direction you need to go in. What you certainly don't want is using "multiple" $lookup stages or even applying $filter conditions on what could potentially be large arrays. Also both forms here do their best to "filter down" the number of items "joined" from the foreign collection so as not to cause a breach of the BSON limit.
Particularly the "pre 3.6" version actually has a trick where $lookup + $unwind + $match occur in succession which you can see in the explain output. All stages actually combine into "one" stage there which solely returns only the items which match the conditions in the $match from the foreign collection. Keeping things "unwound" until we reduce further avoids BSON limit problems, as does the new form with MongoDB 3.6 where the "sub-pipeline" does all the document reduction and grouping before any results are returned.
Your one document sample would return like this:
{
"_id" : ObjectId("5aeb6b71709f43359e0888bb"),
"data" : [
{
"groupDate" : ISODate("2018-05-02T00:00:00Z"),
"assignedCount" : 1,
"resolvedCount" : 0
}
]
}
Once I expand the date selection to include that date, which of course the date selection can also be improved and corrected from your original form.
So it seems to make sense that your relationships are actually defined that way but it's just that you recorded them "twice". You don't need to and even if that's not the definition then you should actually instead record on the "child" rather than an array in the parent. We can juggle and merge the parent arrays, but that's counterproductive to actually establishing the data relations correctly and using them correctly as well.
How about something like this?
db.users.aggregate([
{
$lookup:{ // lookup assigned tickets
from:'tickets',
localField:'assignedTickets',
foreignField:'_id',
as:'assigned',
}
},
{
$lookup:{ // lookup resolved tickets
from:'tickets',
localField:'resolvedTickets',
foreignField:'_id',
as:'resolved',
}
},
{
$project:{
"tickets":{ // merge all tickets into one single array
$concatArrays:[
"$assigned",
"$resolved"
]
}
}
},
{
$unwind:'$tickets' // flatten the 'tickets' array into separate documents
},
{
$group:{ // group by 'createdAt' and 'assignee_id'
_id:{
groupDate:{
$dateFromParts:{
year:{ $year:'$tickets.createdAt' },
month:{ $month:'$tickets.createdAt' },
day:{ $dayOfMonth:'$tickets.createdAt' },
},
},
userId:'$tickets.assignee_id',
},
assignedCount:{ // get the count of assigned tickets
$sum:{
$cond:[
{ // by checking the 'type' field for a value of 'assigned'
$eq:[
'$tickets.type',
'assigned'
]
},
1, // if matching count 1
0 // else 0
]
}
},
resolvedCount:{
$sum:{
$cond:[
{ // by checking the 'type' field for a value of 'resolved'
$eq:[
'$tickets.type',
'resolved'
]
},
1, // if matching count 1
0 // else 0
]
}
},
},
},
{
$sort:{ // sort by 'groupDate' descending
'_id.groupDate':-1
},
},
{
$group:{
_id:'$_id.userId', // group again but only by userId
data:{
$push:{ // create an array
groupDate:'$_id.groupDate',
assignedCount:{
$sum:'$assignedCount'
},
resolvedCount:{
$sum:'$resolvedCount'
}
}
}
}
}
])

Using the aggregation framework to compare array element overlap

I have a collections with documents structured like below:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
}
I would like to search the collection to see if there are any documents with the same carrier and flightNumber that also have dates in the dates array that over lap. For example:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
},
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-03T00:00:00Z"),
ISODate("2015-01-04T00:00:00Z"),
ISODate("2015-01-05T00:00:00Z")
]
}
If the above records were present in the collection I would like to return them because they both have carrier: abc, flightNumber: 123 and they also have the date ISODate("2015-01-03T00:00:00Z") in the dates array. If this date were not present in the second document then neither should be returned.
Typically I would do this by grouping and counting like below:
db.flights.aggregate([
{
$group: {
_id: { carrier: "$carrier", flightNumber: "$flightNumber" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}
},
{
$match: {
count: { $gt: 1 }
}
}
])
But I'm not sure how I could modify this to look for array overlap. Can anyone suggest how to achieve this?
You $unwind the array if you want to look at the contents as "grouped" within them:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } }
])
That does in fact tell you which documents where the "overlap" resides, because the "same dates" along with the other same grouping key values that you are concerned about have a "count" which occurs more than once. Indicating the overlap.
Anything after the $match is really just for "presentation" as there is no point reporting the same _id value for multiple overlaps if you just want to see the overlaps. In fact if you want to see them together it would probably be best to leave the "grouped set" alone.
Now you could add a $lookup to that if retrieving the actual documents was important to you:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } },
}},
{ "$lookup": {
"from": "flights",
"localField": "_id",
"foreignField": "_id",
"as": "_ids"
}},
{ "$unwind": "$_ids" },
{ "$replaceRoot": {
"newRoot": "$_ids"
}}
])
And even do a $replaceRoot or $project to make it return the whole document. Or you could have even done $addToSet with $$ROOT if it was not a problem for size.
But the overall point is covered in the first three pipeline stages, or mostly in just the "first". If you want to work with arrays "across documents", then the primary operator is still $unwind.
Alternately for a more "reporting" like format:
db.flights.aggregate([
{ "$addFields": { "copy": "$$ROOT" } },
{ "$unwind": "$dates" },
{ "$group": {
"_id": {
"carrier": "$carrier",
"flightNumber": "$flightNumber",
"dates": "$dates"
},
"count": { "$sum": 1 },
"_docs": { "$addToSet": "$copy" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$group": {
"_id": {
"carrier": "$_id.carrier",
"flightNumber": "$_id.flightNumber",
},
"overlaps": {
"$push": {
"date": "$_id.dates",
"_docs": "$_docs"
}
}
}}
])
Which would report the overlapped dates within each group and tell you which documents contained the overlap:
{
"_id" : {
"carrier" : "abc",
"flightNumber" : 123.0
},
"overlaps" : [
{
"date" : ISODate("2015-01-03T00:00:00.000Z"),
"_docs" : [
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b97"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-03T00:00:00.000Z"),
ISODate("2015-01-04T00:00:00.000Z"),
ISODate("2015-01-05T00:00:00.000Z")
]
},
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b96"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-01T00:00:00.000Z"),
ISODate("2015-01-02T00:00:00.000Z"),
ISODate("2015-01-03T00:00:00.000Z")
]
}
]
}
]
}

Querying mongoDB for some chart data - my pipeline seems convoluted

This is a long question. If you bother answering, I will be extra grateful.
I have some time series data that I am trying to query to create various charts. The data format isn't the most simple, but I think my aggregation pipeline is getting a bit out of hand. I am planning to use charts.js to visualise the data on the client.
I will post a sample of my data below as well as my pipeline, with the desired output.
My question is in two parts - answering either one could solve the problem.
Does charts.js accept data formats other than an array of numbers per row? This would mean my pipeline could try to do less.
My pipeline doesn't quite get to the result I need. Can you recommend any alterations to get the correct result from my pipeline? Is there is a simpler way to get my desired output format?
Sample data
Here is a real data sample - a brand with one facebook account and one twitter account. There is some data for some dates in June. Lots of null day and month fields have been omitted.
Brand
[{
"_id": "5943f427e7c11ac3ad3652b0",
"name": "Brand1",
"facebookAccounts": [
"5943f427e7c11ac3ad3652ac",
],
"twitterAccounts": [
"5943f427e7c11ac3ad3652aa",
],
}]
FacebookAccounts
[
{
"_id" : "5943f427e7c11ac3ad3652ac"
"name": "Brand 1 Name",
"years": [
{
"date": "2017-01-01T00:00:00.000Z",
"months": [
{
"date": "2017-06-01T00:00:00.000Z",
"days": [
{
"date": "2017-06-16T00:00:00.000Z",
"likes": 904025,
},
{
"date": "2017-06-17T00:00:00.000Z",
"likes": null,
},
{
"date": "2017-06-18T00:00:00.000Z",
"likes": 904345,
},
],
},
],
}
]
}
]
Twitter accounts
[
{
"_id": "5943f427e7c11ac3ad3652aa",
"name": "Brand 1 Name",
"vendorId": "twitterhandle",
"years": [
{
"date": "2017-01-01T00:00:00.000Z",
"months": [
{
"date": "2017-06-01T00:00:00.000Z",
"days": [
{
"date": "2017-06-16T00:00:00.000Z",
"followers": 69390,
},
{
"date": "2017-06-17T00:00:00.000Z",
"followers": 69397,
{
"date": "2017-06-18T00:00:00.000Z",
"followers": 69428,
},
{
"date": "2017-06-19T00:00:00.000Z",
"followers": 69457,
},
]
},
],
}
]
}
]
The query
For this example, I want, for each brand, a daily sum of facebook likes and twitter followers between June 16th and June 18th. So here, the required format is:
{
brand: Brand1,
date: ["2017-06-16T00:00:00.000Z", "2017-06-17T00:00:00.000Z", "2017-06-18T00:00:00.000Z"],
stat: [973415, 69397, 973773]
}
The pipeline
The pipeline seems more convoluted due to the population, but I accept that complexity and it is necessary. Here are the steps:
db.getCollection('brands').aggregate([
{ $match: { _id: { $in: [ObjectId("5943f427e7c11ac3ad3652b0") ] } } },
// Unwind all relevant account types. Make one row per account
{ $project: {
accounts: { $setUnion: [ '$facebookAccounts', '$twitterAccounts' ] } ,
name: '$name'
}
},
{ $unwind: '$accounts' },
// populate the accounts.
// These transform the arrays of facebookAccount ObjectIds into the objects described above.
{ $lookup: { from: 'facebookaccounts', localField: 'accounts', foreignField: '_id', as: 'facebookAccounts' } },
{ $lookup: { from: 'twitteraccounts', localField: 'accounts', foreignField: '_id', as: 'twitterAccounts' } },
// unwind the populated accounts. Back to one record per account.
{ $unwind: { path: '$facebookAccounts', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$twitterAccounts', preserveNullAndEmptyArrays: true } },
// unwind to the granularity we want. Here it is one record per day per account per brand.
{ $unwind: { path: '$facebookAccounts.years', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$facebookAccounts.years.months', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$facebookAccounts.years.months.days', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$facebookAccounts.years.months.days', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$twitterAccounts.years', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$twitterAccounts.years.months', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$twitterAccounts.years.months.days', preserveNullAndEmptyArrays: true } },
{ $unwind: { path: '$twitterAccounts.years.months.days', preserveNullAndEmptyArrays: true } },
// Filter each one between dates
{ $match: { $or: [
{ $and: [
{ 'facebookAccounts.years.months.days.date': { $gte: new Date('2017-06-16') } } ,
{ 'facebookAccounts.years.months.days.date': { $lte: new Date('2017-06-18') } }
]},
{ $and: [
{ 'twitterAccounts.years.months.days.date': { $gte: new Date('2017-06-16') } } ,
{ 'twitterAccounts.years.months.days.date': { $lte: new Date('2017-06-18') } }
]}
] }},
// Build stats and date arrays for each account
{ $group: {
_id: '$accounts',
brandId: { $first: '$_id' },
brandName: { $first: '$name' },
stat: {
$push: {
$sum: {
$add: [
{ $ifNull: ['$facebookAccounts.years.months.days.likes', 0] },
{ $ifNull: ['$twitterAccounts.years.months.days.followers', 0] }
]
}
}
},
date: { $push: { $ifNull: ['$facebookAccounts.years.months.days.date', '$twitterAccounts.years.months.days.date'] } } ,
}}
])
This gives me the output format
[{
_id: accountId, // facebook
brandName: 'Brand1'
date: ["2017-06-16T00:00:00.000Z", "2017-06-17T00:00:00.000Z", "2017-06-18T00:00:00.000Z"],
stat: [904025, null, 904345]
},
{
_id: accountId // twitter
brandName: 'Brand1',
date: ["2017-06-16T00:00:00.000Z", "2017-06-17T00:00:00.000Z", "2017-06-18T00:00:00.000Z"],
stat: [69457, 69390, 69397]
}]
So I now need to perform column-wise addition on my stat properties.And then I am stuck - I feel like there should be a more pipeline friendly way to sum these rather than column-wise addition.
Note I accept the extra work that the population required and am happy with that. Most of the repetition is done programmatically.
Thank you if you've gotten this far.
I can trim a lot of fat out of this and keep it compatible with MongoDB 3.2 ( which you must be using at least due to preserveNullAndEmptyArrays ) available operators with a few simple actions. Mostly by simply joining the arrays immediately after $lookup, which is the best place to do it:
Short Optimize
db.brands.aggregate([
{ "$lookup": {
"from": "facebookaccounts",
"localField": "facebookAccounts",
"foreignField": "_id",
"as": "facebookAccounts"
}},
{ "$lookup": {
"from": "twitteraccounts",
"localField": "twitterAccounts",
"foreignField": "_id",
"as": "twitterAccounts"
}},
{ "$project": {
"name": 1,
"all": {
"$concatArrays": [ "$facebookAccounts", "$twitterAccounts" ]
}
}},
{ "$match": {
"all.years.months.days.date": {
"$gte": new Date("2017-06-16"), "$lte": new Date("2017-06-18")
}
}},
{ "$unwind": "$all" },
{ "$unwind": "$all.years" },
{ "$unwind": "$all.years.months" },
{ "$unwind": "$all.years.months.days" },
{ "$match": {
"all.years.months.days.date": {
"$gte": new Date("2017-06-16"), "$lte": new Date("2017-06-18")
}
}},
{ "$group": {
"_id": {
"brand": "$name",
"date": "$all.years.months.days.date"
},
"total": {
"$sum": {
"$sum": [
{ "$ifNull": [ "$all.years.months.days.likes", 0 ] },
{ "$ifNull": [ "$all.years.months.days.followers", 0 ] }
]
}
}
}},
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": "$_id.brand",
"date": { "$push": "$_id.date" },
"stat": { "$push": "$total" }
}}
])
This gives the result:
{
"_id" : "Brand1",
"date" : [
ISODate("2017-06-16T00:00:00Z"),
ISODate("2017-06-17T00:00:00Z"),
ISODate("2017-06-18T00:00:00Z")
],
"stat" : [
973415,
69397,
973773
]
}
With MongoDB 3.4 we could probably speed it up a "little" more by filtering the arrays and breaking them down before we eventually $unwind to make this work across documents, or maybe even not worry about going across documents at all if the "name" from "brands" is unique. The pipeline operations to compact down the arrays "in place" though are quite cumbersome to code, if a "little" better on performance.
You seem to be doing this "per brand" or for a small sample, so it's likely of little consequence.
As for the chartjs data format, I don't seem to be able to get my hands on what I believe is a different data format to the array format here, but again this should have little bearing.
The main point I see addressed is we can easily move away from your previous output that separated the "facebook" and "twitter" data, and simply aggregate by date moving all the data together "before" the arrays are constructed.
That last point then obviates the need for further "convoluted" operations to attempt to "merge" those two documents and the arrays produced.
Alternate Optimize
As an alternate approach where this does in fact not aggregate across documents, then you can essentially do the "filter" on the array in place and then simply sum and reshape the received result in client code.
db.brands.aggregate([
{ "$lookup": {
"from": "facebookaccounts",
"localField": "facebookAccounts",
"foreignField": "_id",
"as": "facebookAccounts"
}},
{ "$lookup": {
"from": "twitteraccounts",
"localField": "twitterAccounts",
"foreignField": "_id",
"as": "twitterAccounts"
}},
{ "$project": {
"name": 1,
"all": {
"$map": {
"input": { "$concatArrays": [ "$facebookAccounts", "$twitterAccounts" ] },
"as": "all",
"in": {
"years": {
"$map": {
"input": "$$all.years",
"as": "year",
"in": {
"months": {
"$map": {
"input": "$$year.months",
"as": "month",
"in": {
"days": {
"$filter": {
"input": "$$month.days",
"as": "day",
"cond": {
"$and": [
{ "$gte": [ "$$day.date", new Date("2017-06-16") ] },
{ "$lte": [ "$$day.date", new Date("2017-06-18") ] }
]
}
}
}
}
}
}
}
}
}
}
}
}
}}
]).map(doc => {
doc.all = [].concat.apply([],[].concat.apply([],[].concat.apply([],doc.all.map(d => d.years)).map(d => d.months)).map(d => d.days));
doc.all = doc.all.reduce((a,b) => {
if ( a.findIndex( d => d.date.valueOf() == b.date.valueOf() ) != -1 ) {
a[a.findIndex( d => d.date.valueOf() == b.date.valueOf() )].stat += (b.hasOwnProperty('likes')) ? (b.likes || 0) : (b.followers || 0);
} else {
a = a.concat([{ date: b.date, stat: (b.hasOwnProperty('likes')) ? (b.likes || 0) : (b.followers || 0) }]);
}
return a;
},[]);
doc.date = doc.all.map(d => d.date);
doc.stat = doc.all.map(d => d.stat);
delete doc.all;
return doc;
})
This really leaves all the things that "need" to happen on the server, on the server. And it's then a fairly trivial task to "flatten" the array and process to "sum up" and reshape it. This would mean less load on the server, and the data returned is not really that much greater per document.
Gives the same result of course:
[
{
"_id" : ObjectId("5943f427e7c11ac3ad3652b0"),
"name" : "Brand1",
"date" : [
ISODate("2017-06-16T00:00:00Z"),
ISODate("2017-06-17T00:00:00Z"),
ISODate("2017-06-18T00:00:00Z")
],
"stat" : [
973415,
69397,
973773
]
}
]
Committing to the Diet
The biggest problem you really have is with the multiple collections and the heavily nested documents. Neither of these is doing you any favors here and will with larger results cause real performance problems.
The nesting in particular is completely unnecessary as well as not being very maintainable since there are limitations to "update" where you have nested arrays. See the positional $ operator documentation, as well as many posts about this.
Instead you really want a single collection with all those "days" entries in it. You can always work with that source easily for query as well as aggregation purposes and it should look something like this:
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac38097"),
"date" : ISODate("2017-06-16T00:00:00Z"),
"likes" : 904025,
"__t" : "Facebook",
"account" : ObjectId("5943f427e7c11ac3ad3652ac")
}
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac38098"),
"date" : ISODate("2017-06-17T00:00:00Z"),
"likes" : null,
"__t" : "Facebook",
"account" : ObjectId("5943f427e7c11ac3ad3652ac")
}
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac38099"),
"date" : ISODate("2017-06-18T00:00:00Z"),
"likes" : 904345,
"__t" : "Facebook",
"account" : ObjectId("5943f427e7c11ac3ad3652ac")
}
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac3809a"),
"date" : ISODate("2017-06-16T00:00:00Z"),
"followers" : 69390,
"__t" : "Twitter",
"account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac3809b"),
"date" : ISODate("2017-06-17T00:00:00Z"),
"followers" : 69397,
"__t" : "Twitter",
"account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac3809c"),
"date" : ISODate("2017-06-18T00:00:00Z"),
"followers" : 69428,
"__t" : "Twitter",
"account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
{
"_id" : ObjectId("5948cd5cd6eb0b7d6ac3809d"),
"date" : ISODate("2017-06-19T00:00:00Z"),
"followers" : 69457,
"__t" : "Twitter",
"account" : ObjectId("5943f427e7c11ac3ad3652aa")
}
Combining those referenced in the brands collection as well:
{
"_id" : ObjectId("5943f427e7c11ac3ad3652b0"),
"name" : "Brand1",
"accounts" : [
ObjectId("5943f427e7c11ac3ad3652ac"),
ObjectId("5943f427e7c11ac3ad3652aa")
]
}
Then you simply aggregate like this:
db.brands.aggregate([
{ "$lookup": {
"from": "social",
"localField": "accounts",
"foreignField": "account",
"as": "accounts"
}},
{ "$unwind": "$accounts" },
{ "$match": {
"accounts.date": {
"$gte": new Date("2017-06-16"), "$lte": new Date("2017-06-18")
}
}},
{ "$group": {
"_id": {
"brand": "$name",
"date": "$accounts.date"
},
"stat": {
"$sum": {
"$sum": [
{ "$ifNull": [ "$accounts.likes", 0 ] },
{ "$ifNull": [ "$accounts.followers", 0 ] }
]
}
}
}},
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": "$_id.brand",
"date": { "$push": "$_id.date" },
"stat": { "$push": "$stat" }
}}
])
This is actually the most efficient thing you can do, and it's mostly because of what actually happens on the server. We need to look at the "explain" output to see what happens to the pipeline here:
{
"$lookup" : {
"from" : "social",
"as" : "accounts",
"localField" : "accounts",
"foreignField" : "account",
"unwinding" : {
"preserveNullAndEmptyArrays" : false
},
"matching" : {
"$and" : [
{
"date" : {
"$gte" : ISODate("2017-06-16T00:00:00Z")
}
},
{
"date" : {
"$lte" : ISODate("2017-06-18T00:00:00Z")
}
}
]
}
}
}
This is what happens when you send $lookup -> $unwind -> $match to the server as the latter two stages are "hoisted" into the $lookup itself. This reduces the results in the actual "query" run on the collection to be joined.
Without that sequence, then $lookup potentially pulls in "a lot of data" with no constraint, and would break the 16MB BSON limit under most normal loads.
So not only is the process a lot more simple in the altered form, it actually "scales" where the present structure will not. This is something that you seriously should consider.

Perform union in mongoDB

I'm wondering how to perform a kind of union in an aggregate in MongoDB. Let's imaging the following document in a collection (the structure is for the sake of the example) :
{
linkedIn: {
people : [
{
name : 'Fred'
},
{
name : 'Matilda'
}
]
},
twitter: {
people : [
{
name : 'Hanna'
},
{
name : 'Walter'
}
]
}
}
How to make an aggregate that returns the union of the people in twitter and linkedIn ?
{
{ name :'Fred', source : 'LinkedIn'},
{ name :'Matilda', source : 'LinkedIn'},
{ name :'Hanna', source : 'Twitter'},
{ name :'Walter', source : 'Twitter'},
}
There are a couple of approaches to this that you can use the aggregate method for
db.collection.aggregate([
// Assign an array of constants to each document
{ "$project": {
"linkedIn": 1,
"twitter": 1,
"source": { "$cond": [1, ["linkedIn", "twitter"],0 ] }
}},
// Unwind the array
{ "$unwind": "$source" },
// Conditionally push the fields based on the matching constant
{ "$group": {
"_id": "$_id",
"data": { "$push": {
"$cond": [
{ "$eq": [ "$source", "linkedIn" ] },
{ "source": "$source", "people": "$linkedIn.people" },
{ "source": "$source", "people": "$twitter.people" }
]
}}
}},
// Unwind that array
{ "$unwind": "$data" },
// Unwind the underlying people array
{ "$unwind": "$data.people" },
// Project the required fields
{ "$project": {
"_id": 0,
"name": "$data.people.name",
"source": "$data.source"
}}
])
Or with a different approach using some operators from MongoDB 2.6:
db.people.aggregate([
// Unwind the "linkedIn" people
{ "$unwind": "$linkedIn.people" },
// Tag their source and re-group the array
{ "$group": {
"_id": "$_id",
"linkedIn": { "$push": {
"name": "$linkedIn.people.name",
"source": { "$literal": "linkedIn" }
}},
"twitter": { "$first": "$twitter" }
}},
// Unwind the "twitter" people
{ "$unwind": "$twitter.people" },
// Tag their source and re-group the array
{ "$group": {
"_id": "$_id",
"linkedIn": { "$first": "$linkedIn" },
"twitter": { "$push": {
"name": "$twitter.people.name",
"source": { "$literal": "twitter" }
}}
}},
// Merge the sets with "$setUnion"
{ "$project": {
"data": { "$setUnion": [ "$twitter", "$linkedIn" ] }
}},
// Unwind the union array
{ "$unwind": "$data" },
// Project the fields
{ "$project": {
"_id": 0,
"name": "$data.name",
"source": "$data.source"
}}
])
And of course if you simply did not care what the source was:
db.collection.aggregate([
// Union the two arrays
{ "$project": {
"data": { "$setUnion": [
"$linkedIn.people",
"$twitter.people"
]}
}},
// Unwind the union array
{ "$unwind": "$data" },
// Project the fields
{ "$project": {
"_id": 0,
"name": "$data.name",
}}
])
Not sure if using aggregate is recommended over a map-reduce for that kind of operation but the following is doing what you're asking for (dunno if $const can be used with no issue at all in the .aggregate() function) :
aggregate([
{ $project: { linkedIn: '$linkedIn', twitter: '$twitter', idx: { $const: [0,1] }}},
{ $unwind: '$idx' },
{ $group: { _id : '$_id', data: { $push: { $cond:[ {$eq:['$idx', 0]}, { source: {$const: 'LinkedIn'}, people: '$linkedIn.people' } , { source: {$const: 'Twitter'}, people: '$twitter.people' } ] }}}},
{ $unwind: '$data'},
{ $unwind: '$data.people'},
{ $project: { _id: 0, name: '$data.people.name', source: '$data.source' }}
])