Remove duplicate in MongoDB - mongodb

I have a collection with the field called "contact_id".
In my collection I have duplicate registers with this key.
How can I remove duplicates, resulting in just one register?
I already tried:
db.PersonDuplicate.ensureIndex({"contact_id": 1}, {unique: true, dropDups: true})
But did not work, because the function dropDups is no longer available in MongoDB 3.x
I'm using 3.2

Yes, dropDups is gone for good. But you can definitely achieve your goal with little bit effort.
You need to first find all duplicate rows and then remove all except first.
db.dups.aggregate([{$group:{_id:"$contact_id", dups:{$push:"$_id"}, count: {$sum: 1}}},
{$match:{count: {$gt: 1}}}
]).forEach(function(doc){
doc.dups.shift();
db.dups.remove({_id : {$in: doc.dups}});
});
As you see doc.dups.shift() will remove first _id from array and then remove all documents with remaining _ids in dups array.
script above will remove all duplicate documents.

this is a good pattern for mongod 3+ that also ensures that you will not run our of memory which can happen with really big collections. You can save this to a dedup.js file, customize it, and run it against your desired database with: mongo localhost:27017/YOURDB dedup.js
var duplicates = [];
db.runCommand(
{aggregate: "YOURCOLLECTION",
pipeline: [
{ $group: { _id: { DUPEFIELD: "$DUPEFIELD"}, dups: { "$addToSet": "$_id" }, count: { "$sum": 1 } }},
{ $match: { count: { "$gt": 1 }}}
],
allowDiskUse: true }
)
.result
.forEach(function(doc) {
doc.dups.shift();
doc.dups.forEach(function(dupId){ duplicates.push(dupId); })
})
printjson(duplicates); //optional print the list of duplicates to be removed
db.YOURCOLLECTION.remove({_id:{$in:duplicates}});

We can also use an $out stage to remove duplicates from a collection by replacing the content of the collection with only one occurrence per duplicate.
For instance, to only keep one element per value of x:
// > db.collection.find()
// { "x" : "a", "y" : 27 }
// { "x" : "a", "y" : 4 }
// { "x" : "b", "y" : 12 }
db.collection.aggregate(
{ $group: { _id: "$x", onlyOne: { $first: "$$ROOT" } } },
{ $replaceWith: "$onlyOne" }, // prior to 4.2: { $replaceRoot: { newRoot: "$onlyOne" } }
{ $out: "collection" }
)
// > db.collection.find()
// { "x" : "a", "y" : 27 }
// { "x" : "b", "y" : 12 }
This:
$groups documents by the field defining what a duplicate is (here x) and accumulates grouped documents by only keeping one (the $first found) and giving it the value $$ROOT, which is the document itself. At the end of this stage, we have something like:
{ "_id" : "a", "onlyOne" : { "x" : "a", "y" : 27 } }
{ "_id" : "b", "onlyOne" : { "x" : "b", "y" : 12 } }
$replaceWith all existing fields in the input document with the content of the onlyOne field we've created in the $group stage, in order to find the original format back. At the end of this stage, we have something like:
{ "x" : "a", "y" : 27 }
{ "x" : "b", "y" : 12 }
$replaceWith is only available starting in Mongo 4.2. With prior versions, we can use $replaceRoot instead:
{ $replaceRoot: { newRoot: "$onlyOne" } }
$out inserts the result of the aggregation pipeline in the same collection. Note that $out conveniently replaces the content of the specified collection, making this solution possible.

maybe it be a good try to create a tmpColection, create unique index, then copy data from source, and last step will be swap names?
Other idea, I had is to get doubled indexes into array (using aggregation) and then loop thru calling the remove() method with the justOne parameter set to true or 1.
var itemsToDelete = db.PersonDuplicate.aggregate([
{$group: { _id:"$_id", count:{$sum:1}}},
{$match: {count: {$gt:1}}},
{$group: { _id:1, ids:{$addToSet:"$_id"}}}
])
and make a loop thru ids array
makes this sense for you?

I have used this approach:
Take the mongo dump of the particular collection.
Clear that collection
Add a unique key index
Restore the dump using mongorestore.

Related

MongoDB sort by value in embedded document array

I have a MongoDB collection of documents formatted as shown below:
{
"_id" : ...,
"username" : "foo",
"challengeDetails" : [
{
"ID" : ...,
"pb" : 30081,
},
{
"ID" : ...,
"pb" : 23995,
},
...
]
}
How can I write a find query for records that have a challengeDetails documents with a matching ID and sort them by the corresponding PB?
I have tried (this is using the NodeJS driver, which is why the projection syntax is weird)
const result = await collection
.find(
{ "challengeDetails.ID": challengeObjectID},
{
projection: {"challengeDetails.$": 1},
sort: {"challengeDetails.0.pb": 1}
}
)
This returns the correct records (documents with challengeDetails for only the matching ID) but they're not sorted.
I think this doesn't work because as the docs say:
When the find() method includes a sort(), the find() method applies the sort() to order the matching documents before it applies the positional $ projection operator.
But they don't explain how to sort after projecting. How would I write a query to do this? (I have a feeling aggregation may be required but am not familiar enough with MongoDB to write that myself)
You need to use aggregation to sort n array
$unwind to deconstruct the array
$match to match the value
$sort for sorting
$group to reconstruct the array
Here is the code
db.collection.aggregate([
{ "$unwind": "$challengeDetails" },
{ "$match": { "challengeDetails.ID": 2 } },
{ "$sort": { "challengeDetails.pb": 1 } },
{
"$group": {
"_id": "$_id",
"username": { "$first": "$username" },
"challengeDetails": { $push: "$challengeDetails" }
}
}
])
Working Mongo playground

Aggregate on array of embedded documents

I have a mongodb collection with multiple documents. Each document has an array with multiple subdocuments (or embedded documents i guess?). Each of these subdocuments is in this format:
{
"name": string,
"count": integer
}
Now I want to aggregate these subdocuments to find
The top X counts and their name.
Same as 1. but the names have to match a regex before sorting and limiting.
I have tried the following for 1. already - it does return me the top X but unordered, so I'd have to order them again which seems somewhat inefficient.
[{
$match: {
_id: id
}
}, {
$unwind: {
path: "$array"
}
}, {
$sort: {
'count': -1
}
}, {
$limit: x
}]
Since i'm rather new to mongodb this is pretty confusing for me. Happy for any help. Thanks in advance.
The sort has to include the array name in order to avoid an additional sort later on.
Given the following document to work with:
{
students: [{
count: 4,
name: "Ann"
}, {
count: 7,
name: "Brad"
}, {
count: 6,
name: "Beth"
}, {
count: 8,
name: "Catherine"
}]
}
As an example, the following aggregation query will match any name containing the letters "h" and "e". This needs to happen after the "$unwind" step in order to only keep the ones you need.
db.tests.aggregate([
{$match: {
_id: ObjectId("5c1b191b251d9663f4e3ce65")
}},
{$unwind: {
path: "$students"
}},
{$match: {
"students.name": /[he]/
}},
{$sort: {
"students.count": -1
}},
{$limit: 2}
])
This is the output given the above mentioned input:
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 8, "name" : "Catherine" } }
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 6, "name" : "Beth" } }
Both names contain the letters "h" and "e", and the output is sorted from high to low.
When setting the limit to 1, the output is limited to:
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 8, "name" : "Catherine" } }
In this case only the highest count has been kept after having matched the names.
=====================
Edit for the extra question:
Yes, the first $match can be changed to filter on specific universities.
{$match: {
university: "University X"
}},
That will give one or more matching documents (in case you have a document per year or so) and the rest of the aggregation steps would still be valid.
The following match would retrieve the students for the given university for a given academic year in case that would be needed.
{$match: {
university: "University X",
academic_year: "2018-2019"
}},
That should narrow it down to get the correct documents.

Match and Average in mongo keep producing null

I'm using the console to perform an aggregation, using $match to check that a nested field exists, and then pushing to the group and $avg operator. However the match works, just fine on the same variable and the code for count works too, but when it comes to the average I return null every time.
I'm looking in an array with .0 for example for the first element and then looking in a field for that element. It's very perplexing and difficult to debug. Are there any suggestions? Distinct shows that the values I look at are all numeric afaik. Are the any suggestions for how to debug this?
db.b.aggregate([ {$match: {"x.x.x.0.x": {$exists: true} } }, {$group: {_id: null, myAvg: { $avg: "$x.x.x.0.x"}}}])
Results in:
{ "_id" : null, "myAvg" : null }
This appears to be a limitation of the aggregation framework with respect to where you can actually use the "array.n" notation to access the nth element of an array.
More precisely, given the following sample document:
db.test.insertOne({
"a" : [
{
"x" : 1.0
}
]
})
...you can do the following to retrieve all documents where the first element of the "a" array matches 1:
db.test.aggregate({
$match: {
"a.0.x": 1
}
})
However, you cannot run the following:
db.test.aggregate({
$project: {
"a0x": "$a.0.x"
}
})
Well, you can but it will return an empty array like this which is a little surprising indeed:
{
"_id" : ...,
"a0x" : []
}
However, there is a special operator $arrayElemAt to access the nth element in this case like so:
db.test.aggregate({
$project: {
"a0x": { $arrayElemAt: [ "$a.x", 0 ] },
}
})
Kindly note that this will return the nth element only - so not nested inside an array anymore:
{
"a0x" : 1.0
}
So what you probably want to do is this:
db.b.aggregate({
$group: {
_id: null,
myAvg: {
$avg: {
$arrayElemAt: [ "$x.x.x.x", 0 ]
}
}
}
})

how to match the last value of array in mongo db? [duplicate]

I have a sample document like shown below
{
"_id" : "docID",
"ARRAY" : [
{
"k" : "value",
"T" : "20:15:35",
"I" : "Hai"
},
{
"K" : "some value",
"T" : "20:16:35",
"I" : "Hello"
},
{
"K" : "some other value",
"T" : "20:15:35",
"I" : "Update"
}
]
}
I am trying to update the last element in the "ARRAY" based on field "ARRAY.T"(which is only field i know at the point of update), but what my problem is first element in the array matches the query and its ARRAY.I field is updated.
Query used to update:
db.collection.update( { _id: "docID","ARRAY.T" : "20:15:35"},
{ $set: { "ARRAY.$.I": "Updated value" }
})
Actually i don't know index of the array where to update so i have to use ARRAY.I in the query, is there any way to to tell Mongodb to update the first element matched the query from last of the array.
I understand what you are saying in that you want to match the last element in this case or in fact process the match in reverse order. There is no way to modify this and the index stored in the positional $ operator will always be the "first" match.
But you can change your approach to this, as the default behavior of $push is to "append" to the end of the array. But MongoDB 2.6 introduced a $position modifier so you can in fact always "pre-pend" to the array meaning your "oldest" item is at the end.
Take this for example:
db.artest.update(
{ "array": { "$in": [5] } },
{ "$push": { "array": { "$each": [5], "$position": 0 } }},
{ "upsert": true }
)
db.artest.update(
{ "array": { "$in": [5] } },
{ "$push": { "array": { "$each": [6], "$position": 0 } }},
{ "upsert": true }
)
This results in a document that is the "reverse" of the normal $push behavior:
{ "_id" : ObjectId("53eaf4517d0dc314962c93f4"), "array" : [ 6, 5 ] }
Alternately you could apply the $sort modifier when updating your documents in order to "order" the elements so they were reversed. But that may not be the best option if duplicate values are stored.
So look into storing your arrays in "reverse" if you intend to match the "newest" items "first". Currently that is your only way of getting your "match from last" behavior.

Query number of sub collections Mongodb

I am new to mongodb and I am trying to figure out how to count all the returned query inside an array of documents like below:
"impression_details" : [
{
"date" : ISODate("2014-04-24T16:35:46.051Z"),
"ip" : "::1"
},
{
"date" : ISODate("2014-04-24T16:35:53.396Z"),
"ip" : "::1"
},
{
"date" : ISODate("2014-04-25T16:22:20.314Z"),
"ip" : "::1"
}
]
What I would like to do is count how many 2014-04-24 there are (which is 2). At the moment my query is like this and it is not working:
db.banners.find({
"impression_details.date":{
"$gte": ISODate("2014-04-24T00:00:00.000Z"),
"$lte": ISODate("2014-04-24T23:59:59.000Z")
}
}).count()
Not sure what is going on please help!
Thank you.
The concept here is that there is a distinct difference between selecting documents and selecting elements of a sub-document array. So what is happening currently in your query is exactly what should be happening. As the document contains at least one sub-document entry that matches your condition, then that document is found.
In order to "filter" the content of the sub-documents itself for more than one match, then you need to apply the .aggregate() method. And since you are expecting a count then this is what you want:
db.banners.aggregate([
// Matching documents still makes sense
{ "$match": {
"impression_details.date":{
"$gte": ISODate("2014-04-24T00:00:00.000Z"),
"$lte": ISODate("2014-04-24T23:59:59.000Z")
}
}},
// Unwind the array
{ "$unwind": "$impression_details" },
// Actuall filter the array contents
{ "$match": {
"impression_details.date":{
"$gte": ISODate("2014-04-24T00:00:00.000Z"),
"$lte": ISODate("2014-04-24T23:59:59.000Z")
}
}},
// Group back to the normal document form and get a count
{ "$group": {
"_id": "$_id",
"impression_details": { "$push": "$impression_details" },
"count": { "$sum": 1 }
}}
])
And that will give you a form that only has the elements that match your query in the array, as well as providing the count of those entries that were matched.
Use the $elemMatch operator would do what you want.
In your query it meas to find all the documents whose impression_details field contains a data between ISODate("2014-04-24T00:00:00.000Z") and ISODate("2014-04-24T23:59:59.000Z"). The point is, it will return the whole document which is not what you want. So if you want only the subdocuments that satisfies your condition:
var docs = db.banners.find({
"impression_details": {
$elemMatch: {
data: {
$gte: ISODate("2014-04-24T00:00:00.000Z"),
$lte: ISODate("2014-04-24T23:59:59.000Z")
}
}
}
});
var count = 0;
docs.forEach(function(doc) {
count += doc.impression_details.length;
});
print(count);