Using the aggregation framework to compare array element overlap - mongodb

I have a collections with documents structured like below:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
}
I would like to search the collection to see if there are any documents with the same carrier and flightNumber that also have dates in the dates array that over lap. For example:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
},
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-03T00:00:00Z"),
ISODate("2015-01-04T00:00:00Z"),
ISODate("2015-01-05T00:00:00Z")
]
}
If the above records were present in the collection I would like to return them because they both have carrier: abc, flightNumber: 123 and they also have the date ISODate("2015-01-03T00:00:00Z") in the dates array. If this date were not present in the second document then neither should be returned.
Typically I would do this by grouping and counting like below:
db.flights.aggregate([
{
$group: {
_id: { carrier: "$carrier", flightNumber: "$flightNumber" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}
},
{
$match: {
count: { $gt: 1 }
}
}
])
But I'm not sure how I could modify this to look for array overlap. Can anyone suggest how to achieve this?

You $unwind the array if you want to look at the contents as "grouped" within them:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } }
])
That does in fact tell you which documents where the "overlap" resides, because the "same dates" along with the other same grouping key values that you are concerned about have a "count" which occurs more than once. Indicating the overlap.
Anything after the $match is really just for "presentation" as there is no point reporting the same _id value for multiple overlaps if you just want to see the overlaps. In fact if you want to see them together it would probably be best to leave the "grouped set" alone.
Now you could add a $lookup to that if retrieving the actual documents was important to you:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } },
}},
{ "$lookup": {
"from": "flights",
"localField": "_id",
"foreignField": "_id",
"as": "_ids"
}},
{ "$unwind": "$_ids" },
{ "$replaceRoot": {
"newRoot": "$_ids"
}}
])
And even do a $replaceRoot or $project to make it return the whole document. Or you could have even done $addToSet with $$ROOT if it was not a problem for size.
But the overall point is covered in the first three pipeline stages, or mostly in just the "first". If you want to work with arrays "across documents", then the primary operator is still $unwind.
Alternately for a more "reporting" like format:
db.flights.aggregate([
{ "$addFields": { "copy": "$$ROOT" } },
{ "$unwind": "$dates" },
{ "$group": {
"_id": {
"carrier": "$carrier",
"flightNumber": "$flightNumber",
"dates": "$dates"
},
"count": { "$sum": 1 },
"_docs": { "$addToSet": "$copy" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$group": {
"_id": {
"carrier": "$_id.carrier",
"flightNumber": "$_id.flightNumber",
},
"overlaps": {
"$push": {
"date": "$_id.dates",
"_docs": "$_docs"
}
}
}}
])
Which would report the overlapped dates within each group and tell you which documents contained the overlap:
{
"_id" : {
"carrier" : "abc",
"flightNumber" : 123.0
},
"overlaps" : [
{
"date" : ISODate("2015-01-03T00:00:00.000Z"),
"_docs" : [
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b97"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-03T00:00:00.000Z"),
ISODate("2015-01-04T00:00:00.000Z"),
ISODate("2015-01-05T00:00:00.000Z")
]
},
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b96"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-01T00:00:00.000Z"),
ISODate("2015-01-02T00:00:00.000Z"),
ISODate("2015-01-03T00:00:00.000Z")
]
}
]
}
]
}

Related

MongoDB aggregate nested grouping

I have Asset collection which has data like
{
"_id" : ObjectId("5bfb962ee2a301554915"),
"users" : [
"abc.abc#abc.com",
"abc.xyz#xyz.com"
],
"remote" : {
"source" : "dropbox",
"bytes" : 1234
}
{
"_id" : ObjectId("5bfb962ee2a301554915"),
"users" : [
"pqr.pqr#pqr.com",
],
"remote" : {
"source" : "google_drive",
"bytes" : 785
}
{
"_id" : ObjectId("5bfb962ee2a301554915"),
"users" : [
"abc.abc#abc.com",
"abc.xyz#xyz.com"
],
"remote" : {
"source" : "gmail",
"bytes" : 5647
}
What I am looking for is group by users and get the total of bytes according to its source like
{
"_id" : "abc.abc#abc.com",
"bytes" : {
"google_drive": 1458,
"dropbox" : 1254
}
}
I am not getting how to get the nested output using grouping.
I have tried with the query
db.asset.aggregate(
[
{$unwind : '$users'},
{$group:{
_id:
{'username': "$users",
'source': "$remote.source",
'total': {$sum: "$remote.bytes"}} }
}
]
)
This way I am getting the result with the repeated username.
With MongoDb 3.6 and newer, you can leverage the use of $arrayToObject operator within a $mergeObjects expression and a $replaceRoot pipeline to get the desired result.
You would need to run the following aggregate pipeline though:
db.asset.aggregate([
{ "$unwind": "$users" },
{ "$group": {
"_id": {
"users": "$users",
"source": "$remote.source"
},
"totalBytes": { "$sum": "$remote.bytes" }
} },
{ "$group": {
"_id": "$_id.users",
"counts": {
"$push": {
"k": "$_id.source",
"v": "$totalBytes"
}
}
} },
{ "$replaceRoot": {
"newRoot": {
"$mergeObjects": [
{ "bytes": { "$arrayToObject": "$counts" } },
"$$ROOT"
]
}
} },
{ "$project": { "counts": 0 } }
])
which yields
/* 1 */
{
"bytes" : {
"gmail" : 5647.0,
"dropbox" : 1234.0
},
"_id" : "abc.abc#abc.com"
}
/* 2 */
{
"bytes" : {
"google_drive" : 785.0
},
"_id" : "pqr.pqr#pqr.com"
}
/* 3 */
{
"bytes" : {
"gmail" : 5647.0,
"dropbox" : 1234.0
},
"_id" : "abc.xyz#xyz.com"
}
using the above sample documents.
You have to use $group couple of times here. First with the users and the source and count the total number of bytes using $sum.
And second with the users and $push the source and the bytes into an array
db.collection.aggregate([
{ "$unwind": "$users" },
{ "$group": {
"_id": {
"users": "$users",
"source": "$remote.source"
},
"bytes": { "$sum": "$remote.bytes" }
}},
{ "$group": {
"_id": "$_id.users",
"data": {
"$push": {
"source": "$_id.source",
"bytes": "$bytes"
}
}
}}
])
And even if you want to convert the source and the bytes into key value format then replace the last $group stage with the below two stages.
{ "$group": {
"_id": "$_id.users",
"data": {
"$push": {
"k": "$_id.source",
"v": "$bytes"
}
}
}},
{ "$project": {
"_id": 0,
"username": "$_id",
"bytes": { "$arrayToObject": "$data" }
}}

Mongo Group and sum with two fields

I have documents like:
{
"from":"abc#sss.ddd",
"to" :"ssd#dff.dff",
"email": "Hi hello"
}
How can we calculate count of sum "from and to" or "to and from"?
Like communication counts between two people?
I am able to calculate one way sum. I want to have sum both ways.
db.test.aggregate([
{ $group: {
"_id":{ "from": "$from", "to":"$to"},
"count":{$sum:1}
}
},
{
"$sort" :{"count":-1}
}
])
Since you need to calculate number of emails exchanged between 2 addresses, it would be fair to project a unified between field as following:
db.a.aggregate([
{ $match: {
to: { $exists: true },
from: { $exists: true },
email: { $exists: true }
}},
{ $project: {
between: { $cond: {
if: { $lte: [ { $strcasecmp: [ "$to", "$from" ] }, 0 ] },
then: [ { $toLower: "$to" }, { $toLower: "$from" } ],
else: [ { $toLower: "$from" }, { $toLower: "$to" } ] }
}
}},
{ $group: {
"_id": "$between",
"count": { $sum: 1 }
}},
{ $sort :{ count: -1 } }
])
Unification logic should be quite clear from the example: it is an alphabetically sorted array of both emails. The $match and $toLower parts are optional if you trust your data.
Documentation for operators used in the example:
$match
$exists
$project
$cond
$lte
$strcasecmp
$toLower
$group
$sum
$sort
You basically need to consider the _id for grouping as an "array" of the possible "to" and "from" values, and then of course "sort" them, so that in every document the combination is always in the same order.
Just as a side note, I want to add that "typically" when I am dealing with messaging systems like this, the "to" and "from" sender/recipients are usually both arrays to begin with anyway, so it usally forms the base of where different variations on this statement come from.
First, the most optimal MongoDB 3.2 statement, for single addresses
db.collection.aggregate([
// Join in array
{ "$project": {
"people": [ "$to", "$from" ],
}},
// Unwind array
{ "$unwind": "$people" },
// Sort array
{ "$sort": { "_id": 1, "people": 1 } },
// Group document
{ "$group": {
"_id": "$_id",
"people": { "$push": "$people" }
}},
// Group people and count
{ "$group": {
"_id": "$people",
"count": { "$sum": 1 }
}}
]);
Thats the basics, and now the only variations are in construction of the "people" array ( stage 1 only above ).
MongoDB 3.x and 2.6.x - Arrays
{ "$project": {
"people": { "$setUnion": [ "$to", "$from" ] }
}}
MongoDB 3.x and 2.6.x - Fields to array
{ "$project": {
"people": {
"$map": {
"input": ["A","B"],
"as": "el",
"in": {
"$cond": [
{ "$eq": [ "A", "$$el" ] },
"$to",
"$from"
]
}
}
}
}}
MongoDB 2.4.x and 2.2.x - from fields
{ "$project": {
"to": 1,
"from": 1,
"type": { "$const": [ "A", "B" ] }
}},
{ "$unwind": "$type" },
{ "$group": {
"_id": "$_id",
"people": {
"$addToSet": {
"$cond": [
{ "$eq": [ "$type", "A" ] },
"$to",
"$from"
]
}
}
}}
But in all cases:
Get all recipients into a distinct array.
Order the array to a consistent order
Group on the "always in the same order" list of recipients.
Follow that and you cannot go wrong.

Mongodb aggregation, finding within an array of values

I have a schemea that creates documents using the following structure:
{
"_id" : "2014-07-16:52TEST",
"date" : ISODate("2014-07-16T23:52:59.811Z"),
"name" : "TEST"
"values" : [
[
1405471921000,
0.737121
],
[
1405471922000,
0.737142
],
[
1405471923000,
0.737142
],
[
1405471924000,
0.737142
]
]
}
In the values, the first index is a timestamp. What I'm trying to do is query a specific timestamp to find the closest value ($gte).
I've tried the following aggregate query:
[
{ "$match": {
"values": {
"$elemMatch": { "0": {"$gte": 1405471923000} }
},
"name" : 'TEST'
}},
{ "$project" : {
"name" : 1,
"values" : 1
}},
{ "$unwind": "$values" },
{ "$match": { "values.0": { "$gte": 1405471923000 } } },
{ "$limit" : 1 },
{ "$sort": { "values.0": -1 } },
{ "$group": {
"_id": "$name",
"values": { "$push": "$values" },
}}
]
This seems to work, but it doesn't pull the closest value. It seems to pull anything greater or equal to and the sort doesn't seem to get applied, so it will pull a timestamp that is far in the future.
Any suggestions would be great!
Thank you
There are a couple of things wrong with the approach here even though it is a fair effort. You are right that you need to $sort here, but the problem is that you cannot "sort" on an inner element with an array. In order to get a value that can be sorted you must $unwind the array first as it otherwise will not sort on an array position.
You also certainly do not want $limit in the pipeline. You might be testing this against a single document, but "limit" will actually act on the entire set of documents in the pipeline. So if more than one document was matching your condition then they would be thrown away.
The key thing you want to do here is use $first in your $group stage, which is applied once you have sorted to get the "closest" element that you want.
db.collection.aggregate([
// Documents that have an array element matching the condition
{ "$match": {
"values": { "$elemMatch": { "0": {"$gte": 1405471923000 } } }
}},
// Unwind the top level array
{ "$unwind": "$values" },
// Filter just the elements that match the condition
{ "$match": { "values.0": { "$gte": 1405471923000 } } },
// Take a copy of the inner array
{ "$project": {
"date": 1,
"name": 1,
"values": 1,
"valCopy": "$values"
}},
// Unwind the inner array copy
{ "$unwind": "$valCopy" },
// Filter the inner elements
{ "$match": { "valCopy": { "$gte": 1405471923000 } }},
// Sort on the now "timestamp" values ascending for nearest
{ "$sort": { "valCopy": 1 } },
// Take the "first" values
{ "$group": {
"_id": "$_id",
"date": { "$first": "$date" },
"name": { "$first": "$name" },
"values": { "$first": "$values" },
}},
// Optionally push back to array to match the original structure
{ "$group": {
"_id": "$_id",
"date": { "$first": "$date" },
"name": { "$first": "$name" },
"values": { "$push": "$values" },
}}
])
And this produces your document with just the "nearest" timestamp value matching the original document form:
{
"_id" : "2014-07-16:52TEST",
"date" : ISODate("2014-07-16T23:52:59.811Z"),
"name" : "TEST",
"values" : [
[
1405471923000,
0.737142
]
]
}

Filtering a list of votes where more than x matches are found

I have the following vote data in a large collection:
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1a"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 5
},
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1b"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 3
},
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1c"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 3
},
...
I'm looking to filter this collection where more than 3 votes have been obtained for a single article (as above) and output as-is (excluding any vote entries on articles < 3 total votes).
Any help much appreciated. This collection can be huge so efficiency would be ideal.
Normally not something you do in a single operation, but you can do this if those really are your only fields and there are not too many matching documents.
db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"docs": {
"$push": {
"user_id": "$user_id",
"article_id": "$article_id",
"score": "$score"
}
},
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
{ "$unwind": "$docs" },
{ "$project": {
"user_id": "$docs.user_id",
"article_id": "$docs.article_id",
"score": "$docs.score"
}}
])
You can clean that up a little with MongoDB 2.6 and greater which provides a system variable in the pipeline for $$ROOT:
db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"docs": {
"$push": "$$ROOT"
},
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
{ "$unwind": "$docs" },
{ "$project": {
"user_id": "$docs.user_id",
"article_id": "$docs.article_id",
"score": "$docs.score"
}}
])
Otherwise you can accept that you are doing this in a few steps and process the list of "article_id" values returned with a "count" greater than three:
var ids = db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
]).toArray().map(function(x){ return x._id });
db.collection.find({ "article_id": { "$in": ids } })
If that was a shell operation then you would use the "results" key from the array of results that was returned by default in versions earlier to 2.6.

Perform union in mongoDB

I'm wondering how to perform a kind of union in an aggregate in MongoDB. Let's imaging the following document in a collection (the structure is for the sake of the example) :
{
linkedIn: {
people : [
{
name : 'Fred'
},
{
name : 'Matilda'
}
]
},
twitter: {
people : [
{
name : 'Hanna'
},
{
name : 'Walter'
}
]
}
}
How to make an aggregate that returns the union of the people in twitter and linkedIn ?
{
{ name :'Fred', source : 'LinkedIn'},
{ name :'Matilda', source : 'LinkedIn'},
{ name :'Hanna', source : 'Twitter'},
{ name :'Walter', source : 'Twitter'},
}
There are a couple of approaches to this that you can use the aggregate method for
db.collection.aggregate([
// Assign an array of constants to each document
{ "$project": {
"linkedIn": 1,
"twitter": 1,
"source": { "$cond": [1, ["linkedIn", "twitter"],0 ] }
}},
// Unwind the array
{ "$unwind": "$source" },
// Conditionally push the fields based on the matching constant
{ "$group": {
"_id": "$_id",
"data": { "$push": {
"$cond": [
{ "$eq": [ "$source", "linkedIn" ] },
{ "source": "$source", "people": "$linkedIn.people" },
{ "source": "$source", "people": "$twitter.people" }
]
}}
}},
// Unwind that array
{ "$unwind": "$data" },
// Unwind the underlying people array
{ "$unwind": "$data.people" },
// Project the required fields
{ "$project": {
"_id": 0,
"name": "$data.people.name",
"source": "$data.source"
}}
])
Or with a different approach using some operators from MongoDB 2.6:
db.people.aggregate([
// Unwind the "linkedIn" people
{ "$unwind": "$linkedIn.people" },
// Tag their source and re-group the array
{ "$group": {
"_id": "$_id",
"linkedIn": { "$push": {
"name": "$linkedIn.people.name",
"source": { "$literal": "linkedIn" }
}},
"twitter": { "$first": "$twitter" }
}},
// Unwind the "twitter" people
{ "$unwind": "$twitter.people" },
// Tag their source and re-group the array
{ "$group": {
"_id": "$_id",
"linkedIn": { "$first": "$linkedIn" },
"twitter": { "$push": {
"name": "$twitter.people.name",
"source": { "$literal": "twitter" }
}}
}},
// Merge the sets with "$setUnion"
{ "$project": {
"data": { "$setUnion": [ "$twitter", "$linkedIn" ] }
}},
// Unwind the union array
{ "$unwind": "$data" },
// Project the fields
{ "$project": {
"_id": 0,
"name": "$data.name",
"source": "$data.source"
}}
])
And of course if you simply did not care what the source was:
db.collection.aggregate([
// Union the two arrays
{ "$project": {
"data": { "$setUnion": [
"$linkedIn.people",
"$twitter.people"
]}
}},
// Unwind the union array
{ "$unwind": "$data" },
// Project the fields
{ "$project": {
"_id": 0,
"name": "$data.name",
}}
])
Not sure if using aggregate is recommended over a map-reduce for that kind of operation but the following is doing what you're asking for (dunno if $const can be used with no issue at all in the .aggregate() function) :
aggregate([
{ $project: { linkedIn: '$linkedIn', twitter: '$twitter', idx: { $const: [0,1] }}},
{ $unwind: '$idx' },
{ $group: { _id : '$_id', data: { $push: { $cond:[ {$eq:['$idx', 0]}, { source: {$const: 'LinkedIn'}, people: '$linkedIn.people' } , { source: {$const: 'Twitter'}, people: '$twitter.people' } ] }}}},
{ $unwind: '$data'},
{ $unwind: '$data.people'},
{ $project: { _id: 0, name: '$data.people.name', source: '$data.source' }}
])