MongoDB document merge without a-priori knowledge of fields - mongodb

I would like to merge several documents. Most of the fields have the same values but there might be one or two fields that have different values. These fields are unknown beforehand. Ideally I would like to merge all the documents keeping the fields that are the same as is but creating an array of values only for those fields that have some variation.
For my first approach I grouped by a common field to my documents and kept the first document, this however discards some information that varies in other fields.
group_documents = {
"$group": {
"_id": "$0020000E.Value",
"doc": {
"$first": "$$ROOT"
}
}
}
merge_documents = {
"$replaceRoot": {
"newRoot": "$doc"
}
}
write_collection = { "$out": { "db": "database", "coll": "records_nd" } }
objects = coll.aggregate(pipeline)
IF the fields that have different values where known I would have done something like this,
merge_sol1
or
merge_sol2
or
merge_sol3
The third solution is actually very close to my desired output and I could tweak it a bit. But these answers assume a-priori knowledge of the fields to be merged.

You can first convert $$ROOT to array of k-v tuples by $objectToArray. Then, $group all fields by $addToSet to put all distinct values into an array first. Then, check the size of the result array and conditionally pick the first item if the array size is 1 (i.e. the value is the same for every documents in the field); Otherwise, keep the result array. Finally, revert back to original document form by $arrayToObject.
db.collection.aggregate([
{
$project: {
_id: "$key",
arr: {
"$objectToArray": "$$ROOT"
}
}
},
{
"$unwind": "$arr"
},
{
$match: {
"arr.k": {
$nin: [
"key",
"_id"
]
}
}
},
{
$group: {
_id: {
id: "$_id",
k: "$arr.k"
},
v: {
"$addToSet": "$arr.v"
}
}
},
{
$project: {
_id: "$_id.id",
arr: [
{
k: "$_id.k",
v: {
"$cond": {
"if": {
$gt: [
{
$size: "$v"
},
1
]
},
"then": "$v",
"else": {
$first: "$v"
}
}
}
}
]
}
},
{
"$project": {
doc: {
"$arrayToObject": "$arr"
}
}
},
{
"$replaceRoot": {
"newRoot": {
"$mergeObjects": [
{
_id: "$_id"
},
"$doc"
]
}
}
}
])
Mongo Playground

Related

MongoDB: How to merge all documents into a single document in an aggregation pipeline

I have the current aggregation output as follows:
[
{
"courseCount": 14
},
{
"registeredStudentsCount": 1
}
]
The array has two documents. I would like to combine all the documents into a single document having all the fields in mongoDB
db.collection.aggregate([
{
$group: {
_id: 0,
merged: {
$push: "$$ROOT"
}
}
},
{
$replaceRoot: {
newRoot: {
"$mergeObjects": "$merged"
}
}
}
])
Explained:
Group the output documents in one field with push
Replace the document root with the merged objects
Plyaground
{
$group: {
"_id": "null",
data: {
$push: "$$ROOT"
}
}
}
When you add this as the last pipeline, it will put all the docs under data, but here data would be an array of objects.
In your case it would be
{ "data":[
{
"courseCount": 14
},
{
"registeredStudentsCount": 1
}
] }
Another approach would be,
db.collection.aggregate([
{
$group: {
"_id": "null",
f: {
$first: "$$ROOT",
},
l: {
$last: "$$ROOT"
}
}
},
{
"$project": {
"output": {
"courseCount": "$f.courseCount",
"registeredStudentsCount": "$l.registeredStudentsCount"
},
"_id": 0
}
}
])
It's not dynamic as first one. As you have two docs, you can use this approach. It outputs
[
{
"output": {
"courseCount": 14,
"registeredStudentsCount": 1
}
}
]
With extra pipeline in the second approach
{
"$replaceRoot": {
"newRoot": "$output"
}
}
You will get the output as
[
{
"courseCount": 14,
"registeredStudentsCount": 1
}
]

MongoDb Aggregate transform common objects in arrays

I'm stuck in an issue:
I need to transform:
[ {a:1 , b:2 , c:3} , {a:5, b:6, c:7} ]
Into:
[{a:[1,5], b:[2,6] , c: [3,7]}]
Just look for common keys and group that.
I'm not sure if i should use $project + $reduce or $group. Someone have a tip?
To do this, we should change the object to array first to be abble to group by key. You can check it here.
{
"$project": {
"_id": 0 // First we have to eliminate the _id and all the other fields that we dont want to group
}
},
{
"$project": {
"arr": {
"$objectToArray": "$$ROOT"
}
}
},
Then we sould unwind this array and group the keys.
{
"$unwind": "$arr"
},
{
"$group": {
"_id": "$arr.k",
"field": {
"$push": "$arr.v"
}
}
}
Finally we remap the information with the desired output.
{
$replaceRoot: {
newRoot: {
$arrayToObject: [
[
{
k: "$_id",
v: "$field"
}
]
]
}
}
}

Find the distinct values in array of field, count them and write to another collection as array of string

Is is possible to get the distinct values of field that is an array of strings and write the output of distinct to another collection. As shown below from src_coll to dst_coll.
src_coll
{"_id": ObjectId("61968a26c05149a23ad391f4"),"letters": ["aa", "ab", "ac", "ad", "aa", "af"] , "numbers":[11,12,13,14] }
{"_id": ObjectId("61968a26c05149a23ad391f5"),"letters": ["ab", "af", "ag", "ah", "ai", "aj"] , "numbers":[15,16,17,18] }
{"_id": ObjectId("61968a26c05149a23ad391f6"),"letters": ["ac", "ad", "ae", "af", "ag", "ah"] , "numbers":[16,17,18,19] }
{"_id": ObjectId("61968a26c05149a23ad391f7"),"letters": ["ae", "af", "ag", "ah", "ai", "aj"] , "numbers":[17,18,19,20] }
dst_coll
{"_id": ObjectId("61968a26c05149a23ad391f8"),"all_letters": ["aa", "ab", "ac", "ad", "ae", "af", "ag", "ah", "ai", "aj"] }
I have seen the answer using distinct:
db.src_coll.distinct('letters') and using aggregate (if collection is huge, because i was getting error Executor error during distinct command :: caused by :: distinct too big, 16mb cap). I used:
db.src_coll.aggregate([ { $group: { _id: "$letters" } }, { $count: "letters_count" }], { allowDiskUse: true })
I do no know how to write the output of distinct or aggregate as show in dst_coll.
My collection contains 522 documents, Total Size = 314 MB, but the field letters contains thousands of string values in array for each document.
I appreciate your time to reply.
Thanks
Method I
I am assuming you are trying to create a single document containing all the distinct values in letters field across all documents in src_col. You can create a collection based on aggregation output using either $out or $merge. But $out would replace your collection if it already exists.
The unwinding array here would run out of memory in which case you will have to use { allowDiskUse: true } option.
db.collection.aggregate([
{
$unwind: "$letters"
},
{
$group: {
_id: null,
all_letters: {
"$addToSet": "$letters"
}
}
},
{
$merge: {
into: "dst_coll"
}
}
])
Demo
Method II
Another way to do this without $unwind is to use $reduce function which is more efficient.
db.collection.aggregate([
{
$group: {
_id: null,
all_letters: {
"$addToSet": "$letters"
}
}
},
{
$project: {
"all_letters": {
$reduce: {
input: "$all_letters",
initialValue: [],
in: {
$setUnion: [
"$$value",
"$$this"
]
}
}
}
}
},
{
$merge: {
into: "dst_coll"
}
}
])
Demo
Method III
Since we are going to create a single document from a collection using group, for large collections it's likely to run into memory issues. A way to avoid this would be to break down grouping into multiple stages, so each stage would not have to keep in memory lot of documents.
db.collection.aggregate([
{
$unwind: "$letters"
},
{
$bucketAuto: {
groupBy: "$_id",
buckets: 10000, // adjust the bucket size so that it outputs multiples documents for a range of documents.
output: {
"all_letters": {
"$addToSet": "$letters"
}
}
}
},
{
$bucketAuto: {
groupBy: "$_id",
buckets: 1000,
output: {
"all_letters": {
"$addToSet": "$all_letters"
}
}
}
},
{
$project: {
"all_letters": {
$reduce: {
input: "$all_letters",
initialValue: [],
in: {
$setUnion: [
"$$value",
"$$this"
]
}
}
}
}
},
{
$group: {
_id: null,
all_letters: {
"$addToSet": "$all_letters"
}
}
},
{
$project: {
"all_letters": {
$reduce: {
input: "$all_letters",
initialValue: [],
in: {
$setUnion: [
"$$value",
"$$this"
]
}
}
}
}
},
{
$merge: {
into: "dst_coll"
}
}
])
Refer to $bucketAuto and Aggregation Pipeline Limits.
Demo
Here's my solution for it but I'm not sure if it's the optimal way.
Algorithm:
Unwind all the array
Group by letters which will give only unique results
Group them again to get a single result
Use the $out stage to write the result to another collection:
Aggregation pipeline:
db.collection.aggregate([
{
$project: {
letters: 1,
_id: 0
}
},
{
$unwind: "$letters"
},
{
$group: {
_id: "$letters"
}
},
{
$group: {
_id: null,
allLetters: {
"$addToSet": "$_id"
}
}
},
{
$out: "your-collection-name"
}
])
Kindly see the docs for $out stage yourself.
See the solution on mongodb playground: Query

Results different documents for the same query

db.getCollection('rien').aggregate([
{
$match: {
$and: [
{
"id": "10356"
},
{
$or: [
{
"sys_date": {
"$gte": newDate(ISODate().getTime()-90*24*60*60*1000)
}
},
{
"war_date": {
"$gte": newDate(ISODate().getTime()-90*24*60*60*1000)
}
}
]
}
]
}
},
{
$group: {
"_id": "$b_id",
count: {
$sum: 1
},
ads: {
$addToSet: {
"s": "$s",
"ca": "$ca"
}
},
files: {
$addToSet: {
"system": "$system",
"hostname": "$hostname"
}
}
}
},
{
$sort: {
"ads.s": -1
}
},
{
$group: {
"_id": "$b_id",
total_count: {
$sum: 1
},
"data": {
"$push": "$$ROOT"
}
}
},
{
$project: {
"_id": 0,
"total_count": 1,
results: {
$slice: [
"$data",
0,
50
]
}
}
}
])
When I execute this pipelines 5 times, it results in different set of documents. It is 3 node cluster. No sharding enabled. Have 10million documents. Data is static.
Any ideas about the inconsistent results? I feel I am missing some fundamentals here.
I can see 2 problems,
"ads.s": -1 will not work because, its an array field $sort will not apply in array field
$addToSet will not maintain sort order even its ordered from previous stage,
here mentioned in $addToSet documentation => Order of the elements in the output array is unspecified.
and also here mentioned in accumulators-group-addToSet => Order of the array elements is undefined
and also a JIRA ticket SERVER-8512 and DOCS-1114
You can use $setUnion operator for ascending order and $reduce for descending order result from $setUnion,
I workaround I am adding a solution below, I am not sure this is good option or not but you can use if this not affect performance of your query,
I am adding updated stages here only,
remain same
{ $match: {} }, // skipped
{ $group: {} }, // skipped
$sort, optional its up to your requirement if you want order by main document
{ $sort: { _id: -1 } },
$setUnion, treating arrays as sets. If an array contains duplicate entries, $setUnion ignores the duplicate entries, and second its return array in ascending order on the base of first field that we specified in $group stage is s, but make sure all element in array have s as first field,
$reduce to iterate loop of array and concat arrays current element $$this and initial value $$value, this will change order of array in descending order,
{
$addFields: {
ads: {
$reduce: {
input: { $setUnion: "$ads" },
initialValue: [],
in: { $concatArrays: [["$$this"], "$$value"] }
}
},
files: {
$reduce: {
input: { $setUnion: "$files" },
initialValue: [],
in: { $concatArrays: [["$$this"], "$$value"] }
}
}
}
},
remain same
{ $group: {} }, // skipped
{ $project: {} } // skipped
Playground
$setUnion mentioned in documentation: The order of the elements in the output array is unspecified., but I have tested every way its returning in ascending order perfectly, why I don't know,
I have asked question in MongoDB Developer Forum does-setunion-expression-operator-order-array-elements-in-ascending-order?, they replied => it will not guarantee of order!

Return original documents only from mongoose group/aggregation operation

I have a filter + group operation on a bunch of documents (books). The grouping is to return only latest versions of books that share the same book_id (name). The below code works, but it's untidy since it returns redundant information:
return Book.aggregate([
{ $match: generateMLabQuery(rawQuery) },
{
$sort: {
"published_date": -1
}
},
{
$group: {
_id: "$book_id",
books: {
$first: "$$ROOT"
}
}
}
])
I end up with an array of objects that looks like this:
[{ _id: "aedrtgt6854earg864", books: { singleBookObject } }, {...}, {...}]
Essentially I only need the singleBookObject part, which is the original document (and what I'd be getting if I had done only the $match operation). Is there a way to get rid of the redundant _id and books parts within the aggregation pipeline?
You can use $replaceRoot
Book.aggregate([
{ "$match": generateMLabQuery(rawQuery) },
{ "$sort": { "published_date": -1 }},
{ "$group": {
"_id": "$book_id",
"books": { "$first": "$$ROOT" }
}},
{ "$replaceRoot": { "newRoot": "$books" } }
])