Merging / aggregating multiple documents in MongoDB - mongodb

I have a little problem with my aggregations. I have a large collection of flat documents with the following schema:
{
_id:ObjectId("5dc027d38da295b969eca568"),
emp_no:10001,
salary:60117,
from_date:"1986-06-26",
to_date:"1987-06-26"
}
It's all about annual employee salaries. The data is exported from relational database so there are multiple documents with the same value of "emp_no" but the rest of their attributes vary. I need to aggregate them by values of attribute "emp_no" so as a result I will have something like this:
//one document
{
_id:ObjectId("5dc027d38da295b969eca568"),
emp_no:10001,
salaries: [
{
salary:60117,
from_date:"1986-06-26",
to_date:"1987-06-26"
},
{
salary:62102,
from_date:"1987-06-26",
to_date:"1988-06-25"
},
...
]
}
//another document
{
_id:ObjectId("5dc027d38da295b969eca579"),
emp_no:10002,
salaries: [
{
salary:65828,
from_date:"1996-08-03",
to_date:"1997-08-03"
},
...
]
}
//and so on
Last but not least there are almost 2.9m of documents so aggregating by "emp_no" manually would be a bit of a problem. Is there a way I can aggregate them using just mongo queries? How do I do this kind of thing? Thank you in advance for any help

The group stage of aggregation pipeline can be used to get this type of aggregates. Specify the attribute you want to group by as the value of _id field in the group stage.
How does the below query work for you?
db.collection.aggregate([
{
"$group": {
"_id": "$emp_no",
"salaries": {
"$push": {
"salary": "$salary",
"from_data": "$from_date",
"to_data": "$to_date"
}
},
"emp_no": {
"$first": "$emp_no"
},
"first_document_id": {
"$first": "$_id"
}
}
},
{
"$project": {
"_id": "$first_document_id",
"salaries": 1,
"emp_no": 1
}
}
])

Related

How do I sort results based on a specific array item in MongoDB?

I have an array of documents that looks like this:
patient: {
conditions: [
{
columnToSortBy: "value",
type: "PRIMARY"
},
{
columnToSortBy: "anotherValue",
type: "SECONDARY"
},
]
}
I need to be able to $sort by columnToSortBy, but using the item in the array where type is equal to PRIMARY. PRIMARY is not guaranteed to be the first item in the array every time.
How do I set my $sort up to accommodate this? Is there something akin to:
// I know this is invalid. It's for illustration purposes
$sort: "columnToSortBy", {$where: {type: "PRIMARY"}}
Is it possible to sort a field, but only when another field matches a query? I do not want the secondary conditions to affect the sort in any way. I am sorting on that one specific element alone.
You need to use aggregation framework
db.collection.aggregate([
{
$unwind: "$patient.conditions" //reshape the data
},
{
"$sort": {
"patient.conditions.columnToSortBy": -1 //sort it
}
},
{
$group: {
"_id": "$_id",
"conditions": { //re group it
"$push": "$patient.conditions"
}
}
},
{
"$project": { //project it
"_id": 1,
"patient.conditions": "$conditions"
}
}
])
Playground

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

How to group documents of a collection to a map with unique field values as key and count of documents as mapped value in mongodb?

I need a mongodb query to get the list or map of values with unique value of the field(f) as the key in the collection and count of documents having the same value in the field(f) as the mapped value. How can I achieve this ?
Example:
Document1: {"id":"1","name":"n1","city":"c1"}
Document2: {"id":"2","name":"n2","city":"c2"}
Document3: {"id":"3","name":"n1","city":"c3"}
Document4: {"id":"4","name":"n1","city":"c5"}
Document5: {"id":"5","name":"n2","city":"c2"}
Document6: {"id":"6,""name":"n1","city":"c8"}
Document7: {"id":"7","name":"n3","city":"c9"}
Document8: {"id":"8","name":"n2","city":"c6"}
Query result should be something like this if group by field is "name":
{"n1":"4",
"n2":"3",
"n3":"1"}
It would be nice if the list is also sorted in the descending order.
It's worth noting, using data points as field names (keys) is somewhat considered an anti-pattern and makes tooling difficult. Nonetheless if you insist on having data points as field names you can use this complicated aggregation to perform the query output you desire...
Aggregation
db.collection.aggregate([
{
$group: { _id: "$name", "count": { "$sum": 1} }
},
{
$sort: { "count": -1 }
},
{
$group: { _id: null, "values": { "$push": { "name": "$_id", "count": "$count" } } }
},
{
$project:
{
_id: 0,
results:
{
$arrayToObject:
{
$map:
{
input: "$values",
as: "pair",
in: ["$$pair.name", "$$pair.count"]
}
}
}
}
},
{
$replaceRoot: { newRoot: "$results" }
}
])
Aggregation Explanation
This is a 5 stage aggregation consisting of the following...
$group - get the count of the data as required by name.
$sort - sort the results with count descending.
$group - place results into an array for the next stage.
$project - use the $arrayToObject and $map to pivot the data such
that a data point can be a field name.
$replaceRoot - make results the top level fields.
Sample Results
{ "n1" : 4, "n2" : 3, "n3" : 1 }
For whatever reason, you show desired results having count as a string, but my results show the count as an integer. I assume that is not an issue, and may actually be preferred.

mongodb - $sort child documents only

When i do a find all in a mongodb collection, i want to get maintenanceList sorted by older maintenanceDate to newest.
The sort of maintenanceDate should not affect parents order in a find all query
{
"_id":"507f191e810c19729de860ea",
"color":"black",
"brand":"brandy",
"prixVenteUnitaire":200.5,
"maintenanceList":[
{
"cost":100.40,
"maintenanceDate":"2017-02-07T00:00:00.000+0000"
},
{
"cost":4000.40,
"maintenanceDate":"2019-08-07T00:00:00.000+0000"
},
{
"cost":300.80,
"maintenanceDate":"2018-08-07T00:00:00.000+0000"
}
]
}
Any guess how to do that ?
Thank you
Whatever order the fields are in with the previous pipeline stage, as operations like $project and $group effectively "copy" same position.So, it will not change the order of your fields in your aggregated result.
And the sort of maintenanceDate through aggregation will not affect parents order in a find all query.
So, simply doing this should work.
Assuming my collection name is example.
db.example.aggregate([
{
"$unwind": "$maintenanceList"
},
{
"$sort": {
"_id": 1,
"maintenanceList.maintenanceDate": 1
}
},
{
"$group": {
"_id": "$_id",
"color": {
$first: "$color"
},
"brand": {
$first: "$brand"
},
"prixVenteUnitaire": {
$first: "$prixVenteUnitaire"
},
"maintenanceList": {
"$push": "$maintenanceList"
}
}
}
])
Output:

aggregating metrics data in mongodb

I'm trying to pull report data out of a realtime metrics system inspired by the NYC MUG/SimpleReach schema, and maybe my mind is still stuck in SQL mode.
The data is stored in a document like so...
{
"_id": ObjectId("5209683b915288435894cb8b"),
"account_id": 922,
"project_id": 22492,
"stats": {
"2009": {
"04": {
"17": {
"10": {
"sum": {
"impressions": 11
}
},
"11": {
"sum": {
"impressions": 603
}
},
},
},
},
}}
and I've been trying different variations of the aggregation pipeline with no success.
db.metrics.aggregate({
$match: {
'project_id':22492
}}, {
$group: {
_id: "$project_id",
'impressions': {
//This works, but doesn't sum up the data...
$sum: '$stats.2009.04.17.10.sum.impressions'
/* none of these work.
$sum: ['$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions']
$sum: {'$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions'}
$sum: '$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions'
*/
}
}
any help would be appreciated.
(ps. does anyone have any ideas on how to do date range searches using this document schema? )
$group is designed to be applied to many documents, but here we only have one matched document.
Instead, $project could be used to sum up specific fields, like this:
db.metrics.aggregate(
{ $match: {
'project_id':22492
}
},
{ $project: {
'impressions': {
$add: [
'$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions'
]
}
}
})
I don't think there is an elegant way to do date range searches with this schema, because MongoDB operations/predictions are designed to be applied on values, rather than keys in a document. If I understand correctly, the most interesting point in the slides you mentioned is to cache/pre-aggregate metrics when updating. That's a good idea, but could be implemented with another schema. For example, using date and time with indexes, which are supported by MongoDB, might be a good choice for range searches. Even aggregation framework supports data operations, giving more flexibility.