MongoDB: how to count number of keys in a document? - mongodb

Lets say a document is :
{
a: 1,
b: 1,
c: 2,
....
z: 2
}
How can I count the number of keys in such document?
Thank you

Quite possible if using MongoDB 3.6 and newer though the aggregation framework.
Use the $objectToArray operator within an aggregation pipeline to convert the document to an array. The return array contains an element for each field/value pair in the original document. Each element in the return array is a document that contains two fields k and v.
The reference to root the document is made possible through the $$ROOT system variable which references the top-level document currently being processed in the aggregation pipeline stage.
On getting the array, you can then leverage the use of $addFields pipeline step to create a field that holds the counts and the actual count is derived with the use of the $size operator.
All this can be done in a single pipeline by nesting the expressions as follows:
db.collection.aggregate([
{ "$addFields": {
"count": {
"$size": {
"$objectToArray": "$$ROOT"
}
}
} }
])
Example Output
{
"_id" : ObjectId("5a7cd94520a31e44e0e7e282"),
"a" : 1.0,
"b" : 1.0,
"c" : 2.0,
"z" : 2.0,
"count" : 5
}
To exclude the _id field, you can use the $filter operator as:
db.collection.aggregate([
{
"$addFields": {
"count": {
"$size": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "el",
"cond": { "$ne": [ "$$el.k", "_id" ] }
}
}
}
}
}
])
or as suggested by 0zkr PM simply add a $project pipeline step at the beginning:
db.collection.aggregate([
{ "$project": { "_id": 0 } },
{ "$addFields": {
"count": {
"$size": {
"$objectToArray": "$$ROOT"
}
}
} }
])

There's no built-in command for that. Fetch this document and count the keys yourself.

In Mongo (as in most NoSQL solutions) it's usually a good idea to pre-calculate such values if you want to do queries on them later (e.g. "where the number of different keys > 12") so maybe you should consider adding a new field "keyCount" which you increment every time a new key is added.

If you are in Mongo console, you can write javascript to do that.
doc = db.coll.findOne({}); nKeys =0; for( k in doc){nKeys ++;} print(nKeys);
This will still be considered "outside the mongo" though.

Sergio is right, you have to do it outside of mongo.
You can do something like this:
var count = 0;
for(k in obj) {count++;}
print(count);

const MongoClient = new require("mongodb").MongoClient(
"mongodb://localhost:27017",
{
useNewUrlParser: true,
useUnifiedTopology: true
}
);
(async () => {
const connection = await MongoClient.connect();
const dbStuff = await connection
.db("db")
.collection("weedmaps.com/brands")
.find({})
.toArray(); // <- Chuck all it into an Array
for (let thing of dbStuff) {
console.log(Object.keys(thing).length);
}
return await connection.close();
})();

If You are work with Documents like the mongoose.Document, You can try to count all keys inside the doc after transform in a object, 'cause, before it, the doc just have commom properties:
Object.keys(document); // returns [ '$__', 'isNew', 'errors', '_doc', '$locals' ]
Object.keys(document).length; // returns ALWAYS 5
So, try cast to an object, after pass it to Object.keys
console.dir(
Object.keys(document.toObject()) // Returns ['prop1', 'prop2', 'prop3']
);
And to show the keys count
console.log(
Object.keys(document.toObject().length)
); // Returns 3
And Be Happy!!!

Related

Query on all nested docs inside a nested doc in MongoDB

I have a collection with documents that look like this:
{ keyA1: "stringVal",
keyA2: "stringVal",
keyA3: { keyB1: { feild1: intVal,
feild2: intVal}
keyB2: { feild1: intVal,
feild2: intVal}
}
}
Currently the [keyB1, keyB2, ...] set is 7 keys, same for all documents in the collection. I want to query the intVals on specific fields for all keyB's. So, for example, I might want to find all documents where field2 has value greater than 100 regardless of whcih keyB it falls in.
For any one specific keyB, I simply use the dot notation: {"keyA3.keyB2.field2": {$gte: 100}}. Right now, I have the option of looping over all keyB's, but this may not be the case in the future where more keyB values can be added. I don't want to have to modify the code then, and would like to avoid harcoding those values in anyway. I also need the solution to be fairly fast, as the final deployment is expected to have over 20M documents.
How can I write a query that can "skip" the keyB field in the dot notation and just go through all the embedded docs?
FWIW, I'm implementing this in python using pymongo. Thanks.
first convert keyA3 object to array and add new field with $addFields
then filter the new array to match field2 value is greater than 100
then query the doc that size of matched array is greater than 0 , then remove extra field we add
db.collection.aggregate([
{
"$addFields": {
"arr": {
"$objectToArray": "$keyA3"
}
}
},
{
"$addFields": {
"matchArrSize": {
$size: {
"$filter": {
"input": "$arr",
"as": "z",
"cond": {
$gt: [
"$$z.v.feild2",
100
]
}
}
}
}
}
},
{
$match: {
matchArrSize: {
$gt: 0
}
}
},
{
$unset: [
"arr",
"matchArrSize"
]
}
])
https://mongoplayground.net/p/VumwL9y7Km1

Encapsulate/Transform an Array into an Array of Objects

I am currently in the process of modifying a schema and I need to do a relatively trivial transform using the aggregation framework and a bulkWrite.
I want to be able to take this array:
{
...,
"images" : [
"http://example.com/...",
"http://example.com/...",
"http://example.com/..."
]
}
and aggregate to a similar array where the original value is encapsulated:
{
...,
"images" : [
{url: "http://example.com/..."},
{url: "http://example.com/..."},
{url: "http://example.com/..."}
]
}
This slow query works, but it is ridiculously expensive to unwind an entire collection.
[
{
$match: {}
},
{
$unwind: {
path : "$images",
}
},
{
$group: {
_id: "$_id",
images_2: {$addToSet: {url: "$images"}}
}
},
]
How can this be achieved with project or some other cheaper aggregation?
$map expression should do the job, try this:
db.col.aggregate([
{
$project: {
images: {
$map: {
input: '$images',
as: 'url',
in: {
url: '$$url'
}
}
}
}
}
]);
You don't need to use the bulkWrite() method for this.
You can use the $map aggregation array operator to apply an expression to each element element in your array.
Here, the expression simply create a new object where the value is the item in the array.
let mapExpr = {
"$map": {
"input": "$images",
"as": "imageUrl",
"in": { "url": "$$imageUrl }
}
};
Finally you can use the $out aggregation pipeline operator to overwrite your collection or write the result into a different collection.
Of course $map is not an aggregation pipeline operator so which means that the $map expression must be use in a pipeline stage.
The way you do this depends on your MongoDB version.
The best way is in MongoDB 3.4 using $addFields to change the value of the "images" field in your document.
db.collection.aggregate([
{ "$addFields": { "images": mapExpr }},
{ "$out": "collection }
])
From MongoDB 3.2 backwards, you need to use the $project pipeline stage but you also need to include all the other fields manually in your document
db.collection.aggregate([
{ "$project": { "images": mapExpr } },
{ "$out": "collection }
])

Finding documents based on the minimum value in an array

my document structure is something like :
{
_id: ...,
key1: ....
key2: ....
....
min_value: //should be the minimum of all the values in options
options: [
{
source: 'a',
value: 12,
},
{
source: 'b',
value: 10,
},
...
]
},
{
_id: ...,
key1: ....
key2: ....
....
min_value: //should be the minimum of all the values in options
options: [
{
source: 'a',
value: 24,
},
{
source: 'b',
value: 36,
},
...
]
}
the value of various sources in options will keep getting updated on a frequent basis(evey few mins or hours),
assume the size of options array doesnt change, i.e. no extra elements are added to the list
my queries are of the following type:
-find all documents where the min_value of all the options falls between some limit.
I could first do an unwind on options(and then take min) and then run comparison queries, but I am new to mongo and not sure how performance
is affected by unwind operation. The number of documents of this type would be about a few million.
Or does anyone has any suggestions around changing the document structure which could help me simplify this query? ( apart from creating separate documents per source - it would involves lot of data duplication )
Thanks!
Using $unwind is indeed quite expensive, most notably so with larger arrays, but there is a cost in all cases of usage. There are a couple of way to approach not needing $unwind here without real structural changes.
Pure Aggregation
In the basic case, as of MongoDB 3.2.x release series the $min operator can work directly on an array of values in a "projection" sense in addition to it's standard grouping accumulator role. This means that with the help of the related $map operator for processing elements of an array, you can then get the minimal value without using $unwind:
db.collection.aggregate([
// Still makes sense to use an index to select only possible documents
{ "$match": {
"options": {
"$elemMatch": {
"value": { "$gte": minValue, "$lt": maxValue }
}
}
}},
// Provides a logical filter to remove non-matching documents
{ "$redact": {
"$cond": {
"if": {
"$let": {
"vars": {
"min_value": {
"$min": {
"$map": {
"input": "$options",
"as": "option",
"in": "$$option.value"
}
}
}
},
"in": { "$and": [
{ "$gte": [ "$$min_value", minValue ] },
{ "$lt": [ "$$min_value", maxValue ] }
]}
}
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
// Optionally return the min_value as a field
{ "$project": {
"min_value": {
"$min": {
"$map": {
"input": "$options",
"as": "option",
"in": "$$option.value"
}
}
}
}}
])
The basic case is to get the "minimum" value from the array ( done inside of $let since we want to use the result "twice" in logical conditions. Helps us not repeat ourselves ) is to first extract the "value" data from the "options" array. This is done using $map.
The output of $map is an array with just those values, so this is supplied as the argument to $min, which then returns the minimum value for that array.
Using $redact is sort of like a $match pipeline stage with the difference that rather than needing a field to be "present" in the document being examined, you instead just form a logical condition with calculations.
In this case the condition is $and where "both" the logical forms of $gte and $lt return true against the calculated value ( from $let as "$$min_value" ).
The $redact stage then has the special arguments to apply to $$KEEP the document when the condition is true or $$PRUNE the document from results when it is false.
It's all very much like doing $project and then $match to actually project the value into the document before filtering in another stage, but all done in one stage. Of course you might actually want to $project the resulting field in what you return, but it generally cuts the workload if you remove non-matched documents "first" using $redact instead.
Updating Documents
Of course I think the best option is to actually keep the "min_value" field in the document rather than work it out at run-time. So this is a very simple thing to do when adding to or altering array items during update.
For this there is the $min "update" operator. Use it when appending with $push:
db.collection.update({
{ "_id": id },
{
"$push": { "options": { "source": "a", "value": 9 } },
"$min": { "min_value": 9 }
}
})
Or when updating a value of an element:
db.collection.update({
{ "_id": id, "options.source": "a" },
{
"$set": { "options.$.value": 9 },
"$min": { "min_value": 9 }
}
})
If the current "min_value" in the document is greater than the argument in $min or the key does not yet exist then the value given will be written. If it is greater than, the existing value stays in place since it is already the smaller value.
You can even set all your existing data with a simple "bulk" operations update:
var ops = [];
db.collection.find({ "min_value": { "$exists": false } }).forEach(function(doc) {
// Queue operations
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$min": {
"min_value": Math.min.apply(
null,
doc.options.map(function(option) {
return option.value
})
)
}
}
}
});
// Write once in 1000 documents
if ( ops.length == 1000 ) {
db.collection.bulkWrite(ops);
ops = [];
}
});
// Clear any remaining operations
if ( ops.length > 0 )
db.collection.bulkWrite(ops);
Then with a field in place, it is just a simple range selection:
db.collection.find({
"min_value": {
"$gte": minValue, "$lt": maxValue
}
})
So it really should be in your best interests to keep a field ( or fields if you regularly need different conditions ) in the document since that provides the most efficient query.
Of course, the new functions of aggregation $min along with $map also make this viable to use without a field, if you prefer more dynamic conditions.

How do I query a mongo document containing subset of nested array

Here is a doc I have:
var docIHave = {
_id: "someId",
things: [
{
name: "thing1",
stuff: [1,2,3,4,5,6,7,8,9]
},
{
name: "thing2",
stuff: [4,5,6,7,8,9,10,11,12,13,14]
},
{
name: "thing3",
stuff: [1,4,6,8,11,21,23,30]
}
]
}
This is the doc I want:
var docIWant = {
_id: "someId",
things: [
{
name: "thing1",
stuff: [5,6,7,8,9]
},
{
name: "thing2",
stuff: [5,6,7,8,9,10,11]
},
{
name: "thing3",
stuff: [6,8,11]
}
]
}
stuff´s of docIWant should only contain items greater than min=4
and smaller than max=12.
Background:
I have a meteor app and I subscribe to a collection giving me docIHave. Based on parameters min and max I need the docIWant "on the fly". The original document should not be modified. I need a query or procedure that returns me docIWant with the subset of stuff.
A practical code example would be greatly appreciated.
Use the aggregation framework for this. In the aggregation pipeline, consider the $match operator as your first pipeline stage. This is quite necessary to optimize your aggregation as you would need to filter documents that match the given criteria first before passing them on further down the pipeline.
Next use the $unwind operator. This deconstructs the things array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.
Another $unwind operation would be needed on the things.stuff array as well.
The next pipeline stage would then filter dopcuments where the deconstructed things.stuff match the given min and max criteria. Use a $match operator for this.
A $group operator is then required to group the input documents by a specified identifier expression and applies the accumulator expression $push to each group. This creates an array expression to each group.
Typically your aggregation should end up like this (although I haven't actually tested it but this should get you going in the right direction):
db.collection.aggregate([
{
"$match": {
"things.stuff": { "$gt": 4, "$lte": 11 }
}
},
{
"$unwind": "$things"
},
{
"$unwind": "$things.stuff"
},
{
"$match": {
"things.stuff": { "$gt": 4, "$lte": 11 }
}
},
{
"$group": {
"_id": {
"_id": "$_id",
"things": "$things"
},
"stuff": {
"$push": "$things.stuff"
}
}
},
{
"$group": {
"_id": "$_id._id",
"things": {
"$push": {
"name": "$_id.things.name",
"stuff": "$stuff"
}
}
}
}
])
If you need to transform the document on the client for display purposes, you could do something like this:
Template.myTemplate.helpers({
transformedDoc: function() {
// get the bounds - maybe these are stored in session vars
var min = Session.get('min');
var max = Session.get('max');
// fetch the doc somehow that needs to be transformed
var doc = SomeCollection.findOne();
// transform the thing.stuff arrays
_.each(doc.things, function(thing) {
thing.stuff = _.reject(thing.stuff, function(n) {
return (n < min) || (n > max);
});
});
// return the transformed doc
return doc;
}
});
Then in your template: {{#each transformedDoc.things}}...{{/each}}
Use mongo aggregation like following :
First use $unwind this will unwind stuff and then use $match to find elements greater than 4. After that $group data based on things.name and add required fields in $project.
The query will be as following:
db.collection.aggregate([
{
$unwind: "$things"
}, {
$unwind: "$things.stuff"
}, {
$match: {
"things.stuff": {
$gt: 4,
$lt:12
}
}
}, {
$group: {
"_id": "$things.name",
"stuff": {
$push: "$things.stuff"
}
}
}, {
$project: {
"thingName": "$_id",
"stuff": 1
}
}])

Fastest way to remove duplicate documents in mongodb

I have approximately 1.7M documents in mongodb (in future 10m+). Some of them represent duplicate entry which I do not want. Structure of document is something like this:
{
_id: 14124412,
nodes: [
12345,
54321
],
name: "Some beauty"
}
Document is duplicate if it has at least one node same as another document with same name. What is the fastest way to remove duplicates?
dropDups: true option is not available in 3.0.
I have solution with aggregation framework for collecting duplicates and then removing in one go.
It might be somewhat slower than system level "index" changes. But it is good by considering way you want to remove duplicate documents.
a. Remove all documents in one go
var duplicates = [];
db.collectionName.aggregate([
{ $match: {
name: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { name: "$name"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
// If you want to Check all "_id" which you are deleting else print statement not needed
printjson(duplicates);
// Remove all duplicates in one go
db.collectionName.remove({_id:{$in:duplicates}})
b. You can delete documents one by one.
db.collectionName.aggregate([
// discard selection criteria, You can remove "$match" section if you want
{ $match: {
source_references.key: { "$ne": '' }
}},
{ $group: {
_id: { source_references.key: "$source_references.key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
) // You can display result until this and check duplicates
.forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
db.collectionName.remove({_id : {$in: doc.dups }}); // Delete remaining duplicates
})
Assuming you want to permanently delete docs that contain a duplicate name + nodes entry from the collection, you can add a unique index with the dropDups: true option:
db.test.ensureIndex({name: 1, nodes: 1}, {unique: true, dropDups: true})
As the docs say, use extreme caution with this as it will delete data from your database. Back up your database first in case it doesn't do exactly as you're expecting.
UPDATE
This solution is only valid through MongoDB 2.x as the dropDups option is no longer available in 3.0 (docs).
Create collection dump with mongodump
Clear collection
Add unique index
Restore collection with mongorestore
I found this solution that works with MongoDB 3.4:
I'll assume the field with duplicates is called fieldX
db.collection.aggregate([
{
// only match documents that have this field
// you can omit this stage if you don't have missing fieldX
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "doc" : {"$first": "$$ROOT"}}
},
{
$replaceRoot: { "newRoot": "$doc"}
}
],
{allowDiskUse:true})
Being new to mongoDB, I spent a lot of time and used other lengthy solutions to find and delete duplicates. However, I think this solution is neat and easy to understand.
It works by first matching documents that contain fieldX (I had some documents without this field, and I got one extra empty result).
The next stage groups documents by fieldX, and only inserts the $first document in each group using $$ROOT. Finally, it replaces the whole aggregated group by the document found using $first and $$ROOT.
I had to add allowDiskUse because my collection is large.
You can add this after any number of pipelines, and although the documentation for $first mentions a sort stage prior to using $first, it worked for me without it. " couldnt post a link here, my reputation is less than 10 :( "
You can save the results to a new collection by adding an $out stage...
Alternatively, if one is only interested in a few fields e.g. field1, field2, and not the whole document, in the group stage without replaceRoot:
db.collection.aggregate([
{
// only match documents that have this field
$match: {"fieldX": {$nin:[null]}}
},
{
$group: { "_id": "$fieldX", "field1": {"$first": "$$ROOT.field1"}, "field2": { "$first": "$field2" }}
}
],
{allowDiskUse:true})
The following Mongo aggregation pipeline does the deduplication and outputs it back to the same or different collection.
collection.aggregate([
{ $group: {
_id: '$field_to_dedup',
doc: { $first: '$$ROOT' }
} },
{ $replaceRoot: {
newRoot: '$doc'
} },
{ $out: 'collection' }
], { allowDiskUse: true })
My DB had millions of duplicate records. #somnath's answer did not work as is so writing the solution that worked for me for people looking to delete millions of duplicate records.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
tips to speed up, when only small portion of your documents are duplicated:
you need an index on the field to detect duplicates.
$group does not use the index, but it can take advantage of $sort and $sort use the index. so you should put a $sort step at the beginning
do inplace delete_many() instead of $out to new collection, this will save lots of IO time and disk space.
if you use pymongo you can do:
index_uuid = IndexModel(
[
('uuid', pymongo.ASCENDING)
],
)
col.create_indexes([index_uuid])
pipeline = [
{"$sort": {"uuid":1}},
{
"$group": {
"_id": "$uuid",
"dups": {"$addToSet": "$_id"},
"count": {"$sum": 1}
}
},
{
"$match": {"count": {"$gt": 1}}
},
]
it_cursor = col.aggregate(
pipeline, allowDiskUse=True
)
# skip 1st dup of each dups group
dups = list(itertools.chain.from_iterable(map(lambda x: x["dups"][1:], it_cursor)))
col.delete_many({"_id":{"$in": dups}})
performance
I test it on a database contain 30M documents and 1TB large.
Without index/sort it takes more than an hour to get the cursor (I do not even have the patient to wait for it).
with index/sort but use $out to output to a new collection. This is safer if your filesystem does not support snapshot. But it requires lots of disk space and takes more than 40mins to finish despite the fact that we are using SSDs. It will be much slower if you are on HDD RAID.
with index/sort and inplace delete_many, it takes around 5mins in total.
The following method merges documents with the same name while only keeping the unique nodes without duplicating them.
I found using the $out operator to be a simple way. I unwind the array and then group it by adding to set. The $out operator allows the aggregation result to persist [docs].
If you put the name of the collection itself it will replace the collection with the new data. If the name does not exist it will create a new collection.
Hope this helps.
allowDiskUse may have to be added to the pipeline.
db.collectionName.aggregate([
{
$unwind:{path:"$nodes"},
},
{
$group:{
_id:"$name",
nodes:{
$addToSet:"$nodes"
}
},
{
$project:{
_id:0,
name:"$_id.name",
nodes:1
}
},
{
$out:"collectionNameWithoutDuplicates"
}
])
Using pymongo this should work.
Add the fields that need to be unique for the collection in unique_field
unique_field = {"field1":"$field1","field2":"$field2"}
cursor = DB.COL.aggregate([{"$group":{"_id":unique_field, "dups":{"$push":"$uuid"}, "count": {"$sum": 1}}},{"$match":{"count": {"$gt": 1}}},{"$group":"_id":None,"dups":{"$addToSet":{"$arrayElemAt":["$dups",1]}}}}],allowDiskUse=True)
slice the dups array depending on the duplications count(here i had only one extra duplicate for all)
items = list(cursor)
removeIds = items[0]['dups']
hold.remove({"uuid":{"$in":removeIds}})
I don't know whether is it going to answer main question, but for others it'll be usefull.
1.Query the duplicate row using findOne() method and store it as an object.
const User = db.User.findOne({_id:"duplicateid"});
2.Execute deleteMany() method to remove all the rows with the id "duplicateid"
db.User.deleteMany({_id:"duplicateid"});
3.Insert the values stored in User object.
db.User.insertOne(User);
Easy and fast!!!!
First, you can find all the duplicates and remove those duplicates in the DB. Here we take the id column to check and remove duplicates.
db.collection.aggregate([
{ "$group": { "_id": "$id", "count": { "$sum": 1 } } },
{ "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
{ "$sort": { "count": -1 } },
{ "$project": { "name": "$_id", "_id": 0 } }
]).then(data => {
var dr = data.map(d => d.name);
console.log("duplicate Recods:: ", dr);
db.collection.remove({ id: { $in: dr } }).then(removedD => {
console.log("Removed duplicate Data:: ", removedD);
})
})
General idea is to use findOne https://docs.mongodb.com/manual/reference/method/db.collection.findOne/
to retrieve one random id from the duplicate records in the collection.
Delete all the records in the collection other than the random-id that we retrieved from findOne option.
You can do something like this if you are trying to do it in pymongo.
def _run_query():
try:
for record in (aggregate_based_on_field(collection)):
if not record:
continue
_logger.info("Working on Record %s", record)
try:
retain = db.collection.find_one(find_one({'fie1d1': 'x', 'field2':'y'}, {'_id': 1}))
_logger.info("_id to retain from duplicates %s", retain['_id'])
db.collection.remove({'fie1d1': 'x', 'field2':'y', '_id': {'$ne': retain['_id']}})
except Exception as ex:
_logger.error(" Error when retaining the record :%s Exception: %s", x, str(ex))
except Exception as e:
_logger.error("Mongo error when deleting duplicates %s", str(e))
def aggregate_based_on_field(collection):
return collection.aggregate([{'$group' : {'_id': "$fieldX"}}])
From the shell:
Replace find_one to findOne
Same remove command should work.