Mongo aggregate group by multiple values - mongodb

I have a Mongo query which I want to effectively use the $group in the same way as GROUP BY in SQL.
This isn't working for me unless I set the _id of the new document one of the group categories which doesn't work for me and also, I am not able to get the values I want which come from potentially THREE documents which I am merging together in Mongo.
In SQL, I would write something like to illustrate the grouping and select that I am using as the basis of my aggregation in Mongo:
SELECT entity_id, connection_id, cycle_id, objectOriginAPI,accountBalance
FROM raw_originBusinessData
WHERE objectStatus = 'UPROCESSED'
AND (objectOriginAPI = 'Profit & Loss'
OR objectOriginAPI = 'Balance Sheet'
OR objectOriginAPI = 'Bank Summary')
GROUP BY entity_id, connection_id, cycle_id;
I have paraphrased to simplify what my Mongo script is doing with embedded arrays.
db.getCollection('raw_originBusinessData').aggregate([
{ "$match": {
objectStatus : "UNPROCESSED"
, $or: [
{ objectOriginAPI : "Profit & Loss"}
,{objectOriginAPI : "Balance Sheet"}
,{objectOriginAPI : "Bank Summary"}
]}
},
// don't worry about this, this is all good
{ "$unwind": "$objectRawOriginData.Reports" }
,{ "$unwind": "$objectRawOriginData.Reports.Rows" }
,{ "$unwind": "$objectRawOriginData.Reports.Rows.Rows" },
// this is where I believe I'm having my problem
{ "$group": {"_id": "$entity_id"
// , "$connection_id"
// , "objectCycleID"
, "accountBalances": { "$push": "$objectRawOriginData.Reports.Rows.Rows.Cells.Value" }
}},
{$project: {objectClass: {$literal: "Source Data"}
, objectCategory: {$literal: "Application"}
, objectType: {$literal: "Account Balances"}
, objectOrigin: {$literal: "Xero"}
, entity_ID: "$_id"
, connection_ID: "$connection_ID"
, accountBalances: "$accountBalances"}
}
]
// ,{$out: "std_sourceBusinessData"}
)
So each of the documents I am combining into a single document have the same entity_id, connection_id and cycle_id which I want to put into the new document. I also want to ensure that the new document has it's own unique object_id.
Your help is very much appreciated - Mongo documentation doesn't cover anything about $group other than _id is mandatory but if I don't set the _id to something that I want to group by (in the above script it is set to entity_id) it doesn't group properly.

Put simply, the _id needs to be a "composite" value, and therefore comprised of three "sub-keys":
{ "$group":{
"_id": {
"entity_id": "$entity_id"
"connection_id": "$connection_id",
"objectCycleID": "$objectCycleID"
},
"accountBalances": {
"$push": "$objectRawOriginData.Reports.Rows.Rows.Cells.Value"
}
}},
{ "$project": {
"_id": 0,
"objectClass": { "$literal": "Source Data" },
"objectCategory": { "$literal": "Application"},
"objectType": { "$literal": "Account Balances"},
"objectOrigin": { "$literal": "Xero"},
"entity_ID": "$_id.entity_id",
"connection_ID": "$_id.connection_id",
"accountBalances": "$accountBalances"
}}
And then of course, referncing any of those values in the later $project requires you now prefix with $_id as that is now the parent key.
Just as with any MongoDB document, the _id can be anything that is a valid BSON Object in representation. So in this case, the combination means "group on all these field values".

Related

How to merge two collections on a field using aggregation pipelines in pymongo where the field of one collection has to be edited using regex first

I have two collections: a ratings collection and an episodes collection. The ratings collection has a field tconst which would be a random id value that is different for every row (in a form similar to "tt19821") and the episodes collection has a field episodeLink which would be in the form "/title/tt19821/".
I would like to join these two tables (joining the episodeLink field with the corresponding episodeLink field) using pymongo.
pipeline = [{"$set": {"tconst": {"$regexFind": {"input": "$episodeLink", "regex": "/title/(.*?)/"}}}}, {"$merge": {"into": "ratings", "on": "tconst", "whenMatched": "merge", "whenNotMatched": "discard"}}]
client['data']['episodes'].aggregate(pipeline)
I got the following error for this code: How to overcome 'Cannot find index to verify that join fields will be unique' error using aggregate pipelines in pymongo. I'm not sure how i can make a column that has been set in the pipeline into a unique index?
'Cannot find index to verify that join fields will be unique'
The error says you need to create an index on on value/property of the merge collection, which means create a unique index on tconst of collection ratings,
Execute below command in your database,
db.getCollection('ratings').createIndex({ tconst: 1 }, { unique: true })
There are a few issues in your query,
$regexFind will result:
tconst: {
"match" : "/title/tt19821/",
"idx" : 0,
"captures" : [
"tt19821"
]
}
To access a specific match you need to use the below operation:
$let to declare a variable and get first matched from tconst.captures using $first
$unset _id to avoid the error "failed to update the matching document did you attempt to modify the _id or the shard key?"
$merge as it is
pipeline = [
{
"$set": {
"tconst": {
"$let": {
"vars": {
"t": {
"$regexFind": {
"input": "$episodeLink",
"regex": "/title/(.*?)/"
}
}
},
"in": { "$first": "$$t.captures" }
}
}
}
},
{ "$unset": "_id" },
{
"$merge": {
"into": "ratings",
"on": "tconst",
"whenMatched": "merge",
"whenNotMatched": "discard"
}
}
]
Playground

Trying to fetch data from Nested MongoDB Database?

I am beginner in MongoDB and struck at a place I am trying to fetch data from nested array but is it taking so long time as data is around 50K data, also it is not much accurate data, below is schema structure please see once -
{
"_id": {
"$oid": "6001df3312ac8b33c9d26b86"
},
"City": "Los Angeles",
"State":"California",
"Details": [
{
"Name": "Shawn",
"age": "55",
"Gender": "Male",
"profession": " A science teacher with STEM",
"inDate": "2021-01-15 23:12:17",
"Cars": [
"BMW","Ford","Opel"
],
"language": "English"
},
{
"Name": "Nicole",
"age": "21",
"Gender": "Female",
"profession": "Law student",
"inDate": "2021-01-16 13:45:00",
"Cars": [
"Opel"
],
"language": "English"
}
],
"date": "2021-01-16"
}
Here I am trying to filter date with date and Details.Cars like
db.getCollection('news').find({"Details.Cars":"BMW","date":"2021-01-16"}
it is returning details of other persons too which do not have cars- BMW , Only trying to display details of person like - Shawn which have BMW or special array value and date too not - Nicole, rest should not appear but is it not happening.
Any help is appreciated. :)
A combination of $match on the top-level fields and $filter on the array elements will do what you seek.
db.foo.aggregate([
{$match: {"date":"2021-01-16"}}
,{$addFields: {"Details": {$filter: {
input: "$Details",
as: "zz",
cond: { $in: ['BMW','$$zz.Cars'] }
}}
}}
,{$match: {$expr: { $gt:[{$size:"$Details"},0] } }}
]);
Notes:
$unwind is overly expensive for what is needed here and it likely means "reassembling" the data shape later.
We use $addFields where the new field to add (Details) already exists. This effectively means "overwrite in place" and is a common idiom when filtering an array.
The second $match will eliminate docs where the date matches but not a single entry in Details.Cars is a BMW i.e. the array has been filtered down to zero length. Sometimes you want to know this info so if this is the case, do not add the final $match.
I recommend you look into using real dates i.e. ISODate instead of strings so that you can easily take advantage of MongoDB date math and date formatting functions.
Is a common mistake think that find({nested.array:value}) will return only the nested object but actually, this query return the whole object which has a nested object with desired value.
The query is returning the whole document where value BMW exists in the array Details.Cars. So, Nicole is returned too.
To solve this problem:
To get multiple elements that match the criteria you can do an aggregation stage using $unwind to separate the different objects into array and match by the criteria you want.
db.collection.aggregate([
{
"$match": { "Details.Cars": "BMW", "date": "2021-01-26" }
},
{
"$unwind": "$Details"
},
{
"$match": { "Details.Cars": "BMW" }
}
])
This query first match by the criteria to avoid $unwind over all collection.
Then $unwind to get every document and $match again to get only the documents you want.
Example here
To get only one element (for example, if you match by _id and its unique) you can use $elemMatch in this way:
db.collection.find({
"Details.Cars": "BMW",
"date": "2021-01-16"
},
{
"Details": {
"$elemMatch": {
"Cars": "BMW"
}
}
})
Example here
You can use $elemenMatch into query or projection stage. Docs here and here
Using $elemMatch into query the way is this:
db.collection.find({
"Details": {
"$elemMatch": {
"Cars": "BMW"
}
},
"date": "2021-01-16"
},
{
"Details.$": 1
})
Example here
The result is the same. In the second case you are using positional operator to return, as docs says:
The first element that matches the query condition on the array.
That is, the first element where "Cars": "BMW".
You can choose the way you want.

MongoDB query for finding number of people with conflicting schedules [duplicate]

I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

Update Array Children Sorted Order

I have a collection containing objects with the following structure
{
"dep_id": "some_id",
"departament": "dep name",
"employees": [{
"name": "emp1",
"age": 31
},{
"name": "emp2",
"age": 35
}]
}
I would like to sort and save the array of employees for the object with id "some_id", by employees.age, descending. The best outcome would be to do this atomically using mongodb's query language. Is this possible?
If not, how can I rearrange the subdocuments without affecting the parent's other data or the data of the subdocuments? In case I have to download the data from the database and save back the sorted array of children, what would happen if something else performs an update to one of the children or children are added or removed in the meantime?
In the end, the data should be persisted to the database like this:
{
"dep_id": "some_id",
"departament": "dep name",
"employees": [{
"name": "emp2",
"age": 35
},{
"name": "emp1",
"age": 31
}]
}
The best way to do this is to actually apply the $sort modifier as you add items to the array. As you say in your comment "My actual objects have a "rank" and 'created_at'", which means that you really should have asked that in your question instead of writing a "contrived" case ( don't know why people do that ).
So for "sorting" by multiple properties, the following reference would adjust like this:
db.collection.update(
{ },
{ "$push": { "employees": { "$each": [], "$sort": { "rank": -1, "created_at": -1 } } } },
{ "multi": true }
)
But to update all the data you presently have "as is shown in the question", then you would sort on "age" with:
db.collection.update(
{ },
{ "$push": { "employees": { "$each": [], "$sort": { "age": -1 } } } },
{ "multi": true }
)
Which oddly uses $push to actually "modify" an array? Yes it's true, since the $each modifier says we are not actually adding anything new yet the $sort modifier is actually going to apply to the array in place and "re-order" it.
Of course this would then explain how "new" updates to the array should be written in order to apply that $sort and ensure that the "largest age" is always "first" in the array:
db.collection.update(
{ "dep_id": "some_id" },
{ "$push": {
"employees": {
"$each": [{ "name": "emp": 3, "age": 32 }],
"$sort": { "age": -1 }
}
}}
)
So what happens here is as you add the new entry to the array on update, the $sort modifier is applied and re-positions the new element between the two existing ones since that is where it would sort to.
This is a common pattern with MongoDB and is typically used in combination with the $slice modifier in order to keep arrays at a "maximum" length as new items are added, yet retain "ordered" results. And quite often "ranking" is the exact usage.
So overall, you can actually "update" your existing data and re-order it with "one simple atomic statement". No looping or collection renaming required. Furthermore, you now have a simple atomic method to "update" the data and maintain that order as you add new array items, or remove them.
In order to get what you want you can use the following query:
db.collection.aggregate({
$unwind: "$employees" // flatten employees array
}, {
$sort: {
"employees.name": -1 // sort all documents by employee name (descending)
}
}, {
$group: { // restore the previous structure
_id: "$_id",
"dep_id": {
$first: "$dep_id"
},
"departament": {
$first: "$departament"
},
"employees": {
$push: "$employees"
},
}
}, {
$out: "output" // write everything out to a separate collection
})
After this step you would want to drop your source table and rename the "output" collection to match your source table name.
This solution will, however, not deal with the concurrency issue. So you should remove write access from the collection first so nobody modifies it during the process and then restore it once you're done with the migration.
You could alternatively query all data first, then sort the employees array on the client side and then use either single update queries or - faster but more complicated - a bulk write operation with all the individual update calls in order to update the existing documents. Here, you could use the entire document that you've initially read as a filter for the update operation. So if an individual update does not modify any document you'd know straight away, that some other change must have modified the document you read before. Those cases you'd need to retry later (or straight away until the update does actually modify a document).

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?
No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.