MongoDB - duplicate documents removal - mongodb

Context: I have a MongoDB database with some duplicated documents.
Problem: I want to remove all duplicated documents. (For each duplicated document, I only want to save one, which can be arbitrarily chosen.)
Minimal illustrative example:
The documents all have the following fields (there are also other fields, but those are of no relevance here):
{
"_id": {"$oid":"..."},
"name": "string",
"user": {"$oid":"..."},
}
Duplicated documents: A document is considered duplicated if there are two or more documents with the same "name" and "user" (i.e. the document id is of no relevance here).
How can I remove the duplicated documents?

EDIT:
Since mongoDB version 4.2, one option is to use $group and $merge In order to move all unique documents to a new collection:
removeList = db.collection.aggregate([
{
$group: {
_id: {name: "$name", user: "$user"},
doc: {$first: "$$ROOT"}
}
},
{$replaceRoot: {newRoot: "$doc"}},
{$merge: {into: "newCollection"}}
])
See how it works on the playground example
For older version, you do the same using $out.
Another option is to get a list of all documents to remove and remove them with another query:
db.collection.aggregate([
{
$group: {
_id: {name: "$name", user: "$user"},
doc: {$first: "$$ROOT"},
remove: {$push: "$_id"}
}
},
{
$set: {
remove: {
$filter: {
input: "$remove",
cond: {$ne: ["$$this", "$doc._id"]}
}
}
}
},
{$group: {_id: 0, remove: { $push: "$remove"}}},
{$set: { _id: "$$REMOVE",
remove: {
$reduce: {
input: "$remove",
initialValue: [],
in: {$concatArrays: ["$$value", "$$this"]}
}
}
}
}
])
db.collection.deleteMany({_id: {$in: removeList}})

Related

concat array fields of all matching documents mongodb

I have below document structure in mongodb:
{name: String, location: [String]}
Example documents are:
{name: "XYZ", location: ["A","B","C","D"]},
{name: "XYZ", location: ["M","N"]},
{name: "ABC", location: ["P","Q","R","S"]}
I want to write a query that when searches for a specific name, concats all location arrays of resulting documents. For example, If I search for name XYZ, I should get:
{name:"XYZ",location:["A","B","C","D","M","N"]}
I guess this is possible using aggregation that might use $unwind operator, but I am unable to frame the query.
Please help me to frame the query.
Thanks!
$match - Filter document(s).
$group - Group by name. Add the location array into the location field via $push. It results in the location with the value of the nested array.
$project - Decorate the output document. With the $reduce operator transforms the original location array which is a nested array to be flattened by combining arrays into one via $concatArrays.
db.collection.aggregate([
{
$match: {
name: "XYZ"
}
},
{
$group: {
_id: "$name",
location: {
$push: "$location"
}
}
},
{
$project: {
_id: 0,
name: "$_id",
location: {
$reduce: {
input: "$location",
initialValue: [],
in: {
$concatArrays: [
"$$value",
"$$this"
]
}
}
}
}
}
])
Demo # Mongo Playground
This should do the trick:
Match the required docs. Unwind the location array. Group by name, and project the necessary output.
db.collection.aggregate([
{
"$match": {
name: "XYZ"
}
},
{
"$unwind": "$location"
},
{
"$group": {
"_id": "$name",
"location": {
"$push": "$location"
}
}
},
{
"$project": {
name: "$_id",
location: 1,
_id: 0
}
}
])
Playground link.

MongoDB aggregations group by and count with a join

I have MongoDB model called candidates
appliedJobs: [
{
job: { type: Schema.ObjectId, ref: "JobPost" },
date:Date
},
],
candidate may have multiple records in appliedJobs array. There I refer to the jobPost.
jobPost has the companyName, property.
companyName: String,
What I want is to get the company names with send job applications counts. For an example
|Company|Applications|
|--------|---------------|
|Facebook|10 applications|
|Google|5 applications|
I created this query
Candidate.aggregate([
{
$match: {
appliedJobs: { $exists: true },
},
},
{ $group: { _id: '$companyName', count: { $sum: 1 } } },
])
The problem here is I can't access the companyName like this. Because it's on another collection. How do I solve this?
In order to get data from another collection you can use $lookup (nore efficient) or populate (mongoose - considered more organized), so one option is:
db.candidate.aggregate([
{$match: {appliedJobs: {$exists: true}}},
{$unwind: "$appliedJobs"},
{$lookup: {
from: "JobPost",
localField: "appliedJobs.job",
foreignField: "_id",
as: "appliedJobs"
}
},
{$project: {companyName: {$first: "$appliedJobs.companyName"}}},
{$group: {_id: {candidate: "$_id", company: "$companyName"}, count: {$sum: 1}}},
{$group: {
_id: "$_id.candidate",
appliedJobs: {$push: {k: "$_id.company", v: "$count"}}
}},
{$project: {appliedJobs: {$arrayToObject: "$appliedJobs"}}}
])
See how it works on the playground example
Simply $unwind the appliedJobs array. Perform $lookup to get the companyName. Then, $group to get count of applications by company.
db.Candidate.aggregate([
{
$match: {
appliedJobs: {
$exists: true
}
}
},
{
$unwind: "$appliedJobs"
},
{
"$lookup": {
"from": "JobPost",
"localField": "appliedJobs._id",
"foreignField": "_id",
"as": "JobPostLookup"
}
},
{
$unwind: "$JobPostLookup"
},
{
"$group": {
"_id": "$JobPostLookup.companyName",
"Applications": {
"$sum": 1
}
}
}
])
Here is the Mongo Playground for your reference.

Alternative solution to `$lookup` needed because the collection in the `from` field is sharded

Query with arbitrary number of filter conditions that come from querying the same collection
I am referring to the question above.
Here is an additional requirement:
The score table is sharded. Hence, it can no longer be in the $lookup stage.
Is there an alternative solution that also only makes one trip to the MongoDB API?
One way to do it without lookup is using $group, for example:
db.score.aggregate([
{
$group: {
_id: "$test_id",
highestScore: {$max: "$score"},
results: {
$push: {score: "$score", "tester_id": "$tester_id"}
},
ourTester: {
$push: {score: "$score", "tester_id": "$tester_id"}
}
}
},
{$match: {"ourTester.tester_id": userId}},
{
$project: {
ourTester: {
$filter: {
input: "$ourTester",
as: "item",
cond: {$eq: ["$$item.tester_id", userId]}
}
},
results: {
$filter: {
input: "$results",
as: "item",
cond: {$eq: ["$$item.score", "$highestScore"]}}
}
}
},
{
$project: {
ourTester: {"$arrayElemAt": ["$ourTester", 0]},
highest: {"$arrayElemAt": ["$results", 0]}
}
},
{
$match: {
$expr: {$gt: ["$highest.score", "$ourTester.score"]}
}
},
{
$project: {
score: "$highest.score",
tester_id: "$highest.tester_id",
test_id: "$res._id"
}
}
])
As you can see here

Get original document field as part of aggregate result

I am wanting to get all of the document fields in my aggregate results but as soon as I use $group they are gone. Using $project allows me to readd whatever fields I have defined in $group but no luck on getting the other fields:
var doc = {
_id: '123',
name: 'Bob',
comments: [],
attendances: [{
answer: 'yes'
}, {
answer: 'no'
}]
}
aggregate({
$unwind: '$attendances'
}, {
$match: {
"attendances.answer": { $ne:"no" }
}
}, {
$group: {
_id: '$_id',
attendances: { $sum: 1 },
comments: { $sum: { $size: { $ifNull: [ "$comments", [] ] }}}
}
}, {
$project: {
comments: 1,
}
}
This results in:
[{
_id: 5317b771b6504bd4a32395be,
comments: 12
},{
_id: 53349213cb41af00009a94d0,
comments: 0
}]
How do I get 'name' in there? I have tried adding to $group as:
name: '$name'
as well as in $project:
name: 1
But neither will work
You can't project fields that are removed during the $group operation.
Since you are grouping by the original document _id and there will only be one name value, you can preserve the name field using $first:
db.sample.aggregate(
{ $group: {
_id: '$_id',
comments: { $sum: { $size: { $ifNull: [ "$comments", [] ] }}},
name: { $first: "$name" }
}}
)
Example output would be:
{ "_id" : "123", "comments" : 0, "name" : "Bob" }
If you are grouping by criteria where there could be multiple values to preserve, you should either $push to an array in the $group or use $addToSet if you only want unique names.
Projecting all the fields
If you are using MongoDB 2.6 and want to get all of the original document fields (not just name) without listing them individually you can use the aggregation variable $$ROOT in place of a specific field name.

MongoDB - Unwind array using aggregation and remove duplicates

I am unwinding an array using MongoDB aggregation framework and the array has duplicates and I need to ignore those duplicates while doing a grouping further.
How can I achieve that?
you can use $addToSet to do this:
db.users.aggregate([
{ $unwind: '$data' },
{ $group: { _id: '$_id', data: { $addToSet: '$data' } } }
]);
It's hard to give you more specific answer without seeing your actual query.
You have to use $addToSet, but at first you have to group by _id, because if you don't you'll get an element per item in the list.
Imagine a collection posts with documents like this:
{
body: "Lorem Ipsum...",
tags: ["stuff", "lorem", "lorem"],
author: "Enrique Coslado"
}
Imagine you want to calculate the most usual tag per author. You'd make an aggregate query like that:
db.posts.aggregate([
{$project: {
author: "$author",
tags: "$tags",
post_id: "$_id"
}},
{$unwind: "$tags"},
{$group: {
_id: "$post_id",
author: {$first: "$author"},
tags: {$addToSet: "$tags"}
}},
{$unwind: "$tags"},
{$group: {
_id: {
author: "$author",
tags: "$tags"
},
count: {$sum: 1}
}}
])
That way you'll get documents like this:
{
_id: {
author: "Enrique Coslado",
tags: "lorem"
},
count: 1
}
Previous answers are correct, but the procedure of doing $unwind -> $group -> $unwind could be simplified.
You could use $addFields + $reduce to pass to the pipeline the filtered array which already contains unique entries and then $unwind only once.
Example document:
{
body: "Lorem Ipsum...",
tags: [{title: 'test1'}, {title: 'test2'}, {title: 'test1'}, ],
author: "First Last name"
}
Query:
db.posts.aggregate([
{$addFields: {
"uniqueTag": {
$reduce: {
input: "$tags",
initialValue: [],
in: {$setUnion: ["$$value", ["$$this.title"]]}
}
}
}},
{$unwind: "$uniqueTag"},
{$group: {
_id: {
author: "$author",
tags: "$uniqueTag"
},
count: {$sum: 1}
}}
])