Removing duplicate records using MapReduce - mongodb

I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)
[
{ "MlsId": "12345"" },
{ "MlsId": "12345" },
{ "MlsId": "23456" },
{ "MlsId": "23456" },
{ "MlsId": "0" },
{ "MlsId": "0" },
{ "MlsId": "" },
{ "MlsId": "" }
]
A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed.
How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case.
Here is what I have so far, but it doesn't check if the MlsId is "0" or "":
m = function () {
emit(this.MlsId, 1);
}
r = function (k, vals) {
return Array.sum(vals);
}
res = db.Listing.mapReduce(m,r);
db[res.result].find({value: {$gt: 1}});
db[res.result].drop();

I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like
m = function () {
if(this.MlsId!=0 && this.MlsId!="") {
emit(this.MlsId, 1);
}
}
And reduce should return key-value pairs. So it should be:
r = function(k, vals) {
emit(k,Arrays.sum(vals);
}
After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?

In mongodb you can use a query to restrict documents that are passed in for mapping. You probably want to do that for the ones you don't care about. Then in the reduce function you can ignore the dups and only return one of the docs for each duplicate key.
I'm a little confused about your goal though. If you just want to find duplicates and remove all but one of them then you can just create a unique index on that field and use the dropDups option; the process of creating the index will drop duplicate docs. Keeping the index will ensure that it doesn't happen again.
http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues

You can use aggregation operation to remove duplicates. Unwind, introduce a dummy $group and $sum stage and ignore the counts in your next stage. Something like this,
db.myCollection.aggregate([
{
$unwind: '$list'
},
{
$group:{
'_id':
{
'listing_id':'$_id', 'MlsId':'$list.MlsId'
},
'count':
{
'$sum':1
}
}
},
{
$group:
{
'_id':'$_id.listing_id',
'list':
{
'$addToSet':
{
'MlsId':'$_id.MlsId'
}
}
}
}
]);

this is how I following the #harri answer to remove duplicates:
//contains duplicated documents id and numeber of duplicates
db.createCollection("myDupesCollection")
res = db.sampledDB.mapReduce(m, r, { out : "myDupesCollection" });
// iterate through duplicated docs and remove duplicates (keep one)
db.myDupesCollection.find({value: {$gt: 1}}).forEach(function(myDoc){
u_id = myDoc._id.MlsId;
counts =myDoc.value;
db.sampledDB.remove({MlsId: u_id},counts-1); //if there are 3 docs, remove 3-1=2 of them
});

Related

Split a string during MongoDB aggregate

Currently, I have just fullname stored in the User collection in MongoDB. I'd like to run a report that splits the first and last name so for now I'm trying to run an aggregate and split the string when a whitespace is found.
Here is what I have now, but I'd like to replace the hard coded end position with a variable based on where whitespace is found. Is this possible in an aggregate pipeline?
db.users.aggregate([{
$project : {
fullname:{ $toUpper:"$fullname" },
first: { $substr: [ "$fullname", 0, 2 ]}, _id:0 }
}, { $sort : { fullname : 1 }
}]);
The aggregation framework does not have any operator to perform a "split" based on a matched character or any such thing. There is only $substr which of course requires an index, and there is no operator to return a "index" of a matched character either.
You could use mapReduce, which can use JavaScript .split(), but of course there is no "sort stage" in mapReduce other than the results in the main key which are always pre-sorted before attempting to apply a reduce ( which would not be applied here with all unique keys ):
db.users.mapReduce(
function() {
var lastName = this.fullname.split(/\s/).reverse()[0].toUpperCase();
emit({ "lastName": lastName, "orig": this._id },this);
},
function(){}, // Never called on all unique
{ "out": { "inline": 1 } }
);
And that will basically extract the last name after a whitespace, convert it to uppercase and use it as a composite value in the primary key so results will be sorted by that key ( note you cannot use _id as any part of the key name or it will be sorted by that field instead ).
But if your real case here is "sorting", then you are better off storing the data that way, thus giving you a direct value to sort on without calculation:
var bulk = db.users.initializeOrderedBulkOp(),
count = 0;
db.users.find().forEach(user) {
bulk.find({ "_id": user._id }).updateOne({
"$set": { "lastName": user.fullname.split(/\s/).reverse()[0].toUpperCase() }
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.users.initializeOrderedBulkOp();
}
}
if ( count % 1000 != 0 )
bulk.execute();
Then with a solid field in place you just run your sort:
db.users.find().sort({ "lastName": 1 });
Which is going to be a lot faster than trying to calculate a value from which to perform a sort.
Of course if sorting is not the purpose and it's just for presentation, then just perform the split in client code where it makes the most sense to do so. The aggregation framework cannot restructure the data like that, and while mapReduce "could", it's output is very opinionated and not really purposed for such an operation.

Update nested array document

say i have this model
{
_id : 1,
ref: '1',
children: [
{
ref:'1.1',
grandchildren: [
{
ref:'1.1.1',
visible: true;
}
]
}
]
}
I'm aware that positional operator for nested arrays isn't available yet.
https://jira.mongodb.org/browse/SERVER-831
but wondered whether its possible to atomically update the document in the nested array?
In my example, i'd like to update the visible flag to false for the document for ref 1.1.1.
I have the children record ref == '1.1' and the grandchildrenref == '1.1.1'
thanks
Yes, this is possible only if you knew the index of the children array that has the grandchildren object to be updated beforehand and the update query will use the positional operator as follows:
db.collection.update(
{
"children.ref": "1.1",
"children.grandchildren.ref": "1.1.1"
},
{
"$set": {
"children.0.grandchildren.$.visible": false
}
}
)
However, if you don't know the array index positions beforehand, you should consider creating the $set conditions dynamically by using MapReduce. The basic idea with MapReduce is that it uses JavaScript as its query language but this tends to be fairly slower than the aggregation framework and not recommended for use in real-time data analysis.
In your MapReduce operation, you need to define a couple of steps i.e. the mapping step (which maps an operation into every document in the collection, and the operation can either do nothing or emit some object with keys and projected values) and reducing step (which takes the list of emitted values and reduces it to a single element).
For the map step, you ideally would want to get for every document in the collection, the index for each children array field and another key that contains the $set keys.
Your reduce step would be a function (which does nothing) simply defined as var reduce = function() {};
The final step in your MapReduce operation will then create a separate collection operations that contains the emitted operations array object along with a field with the $set conditions. This collection can be updated periodically when you run the MapReduce operation on the original collection.
Altogether, this MapReduce method would look like:
var map = function(){
for(var i = 0; i < this.children.length; i++){
emit(
{
"_id": this._id,
"index": i
},
{
"index": i,
"children": this.children[i],
"update": {
"ref": "children." + i.toString() + ".grandchildren.$.ref",
"visible": "children." + i.toString() + ".grandchildren.$.visible"
}
}
);
}
};
var reduce = function(){};
db.collection.mapReduce(
map,
reduce,
{
"out": {
"replace": "update_collection"
}
}
);
You can then use the cursor from the db.update_collection.find() method to iterate over and update your collection accordingly:
var cur = db.update_collection.find(
{
"value.children.ref": "1.1",
"value.children.grandchildren.ref": "1.1.1"
}
);
// Iterate through results and update using the update query object set dynamically by using the array-index syntax.
while (cur.hasNext()) {
var doc = cur.next();
var update = { "$set": {} };
// set the update query object
update["$set"][doc.value.update.visible] = false;
db.collection.update(
{
"children.ref": "1.1",
"children.grandchildren.ref": "1.1.1"
},
update
);
};

MongoDB Query Help - query on values of any key in a sub-object

I want to perform a query on this collection to determine which documents have any keys in things that match a certain value. Is this possible?
I have a collection of documents like:
{
"things": {
"thing1": "red",
"thing2": "blue",
"thing3": "green"
}
}
EDIT: for conciseness
If you don't know what the keys will be and you need it to be interactive, then you'll need to use the (notoriously performance challenged) $where operator like so (in the shell):
db.test.find({$where: function() {
for (var field in this.settings) {
if (this.settings[field] == "red") return true;
}
return false;
}})
If you have a large collection, this may be too slow for your purposes, but it's your only option if your set of keys is unknown.
MongoDB 3.6 Update
You can now do this without $where by using the $objectToArray aggregation operator:
db.test.aggregate([
// Project things as a key/value array, along with the original doc
{$project: {
array: {$objectToArray: '$things'},
doc: '$$ROOT'
}},
// Match the docs with a field value of 'red'
{$match: {'array.v': 'red'}},
// Re-project the original doc
{$replaceRoot: {newRoot: '$doc'}}
])
I'd suggest a schema change so that you can actually do reasonable queries in MongoDB.
From:
{
"userId": "12347",
"settings": {
"SettingA": "blue",
"SettingB": "blue",
"SettingC": "green"
}
}
to:
{
"userId": "12347",
"settings": [
{ name: "SettingA", value: "blue" },
{ name: "SettingB", value: "blue" },
{ name: "SettingC", value: "green" }
]
}
Then, you could index on "settings.value", and do a query like:
db.settings.ensureIndex({ "settings.value" : 1})
db.settings.find({ "settings.value" : "blue" })
The change really is simple ..., as it moves the setting name and setting value to fully indexable fields, and stores the list of settings as an array.
If you can't change the schema, you could try #JohnnyHK's solution, but be warned that it's basically worst case in terms of performance and it won't work effectively with indexes.
Sadly, none of the previous answers address the fact that mongo can contain nested values in arrays or nested objects.
THIS IS THE CORRECT QUERY:
{$where: function() {
var deepIterate = function (obj, value) {
for (var field in obj) {
if (obj[field] == value){
return true;
}
var found = false;
if ( typeof obj[field] === 'object') {
found = deepIterate(obj[field], value)
if (found) { return true; }
}
}
return false;
};
return deepIterate(this, "573c79aef4ef4b9a9523028f")
}}
Since calling typeof on array or nested object will return 'object' this means that the query will iterate on all nested elements and will iterate through all of them until the key with value will be found.
You can check previous answers with a nested value and the results will be far from desired.
Stringifying the whole object is a hit on performance since it has to iterate through all memory sectors one by one trying to match them. And creates a copy of the object as a string in ram memory (both inefficient since query uses more ram and slow since function context already has a loaded object).
The query itself can work with objectId, string, int and any basic javascript type you wish.

Counting field names (not values) in mongo database

Is there a way to count field names in mongodb? I have a mongo database of documents with other embedded documents within them. Here is an example of what the data might look like.
{
"incident": "osint181",
"summary":"Something happened",
"actor": {
"internal": {
"motive": [
"Financial"
],
"notes": "",
"role": [
"Malicious"
],
"variety": [
"Cashier"
]
}
}
}
Another document might look like this:
{
"incident": "osint182",
"summary":"Something happened",
"actor": {
"external": {
"motive": [
"Financial"
],
"notes": "",
"role": [
"Malicious"
],
"variety": [
"Hacker"
]
}
}
}
As you can see, the actor has changed from internal to external in the second document. What I would like to be able to do is count the number of incidents for each type of actor. My first attempt looked like this:
db.public.aggregate( { $group : { _id : "$actor", count : { $sum : 1 }}} );
But that gave me the entire subdocument and the count reflected how many documents were exactly the same. Rather I was hoping to get a count for internal and a count for external, etc. Is there an elegant way to do that? If not elegant, can someone give me a dirty way of doing that?
Best option for this kind of problem is using map-reduce of mongoDB , it will allow you to iterate through all the keys of the mongoDB document and easily you can add your complex logic . Check out map reduce examples here : http://docs.mongodb.org/manual/applications/map-reduce/
This was the answer I came up with based on the hint from Devesh. I create a map function that looks at the value of actor and checks if the document is an empty JSON object using the isEmptyObject function that I defined. Then I used mapReduce to go through the collection and check if the action field is empty. If the object is not empty then rather than returning the value of the key, I return the key itself which will be named internal, or external, or whatever.
The magic here was the scope call in mapReduce which makes it so that my isEmptyObject is in scope for mapReduce. The results are written to a collection which I named temp. After gathering the information I want from the temp collection, I drop it.
var isEmptyObject = function(obj) {
for (var name in obj) {
return false;
}
return true;
};
var mapFunction = function() {
if (isEmptyObject(this.action)) {
emit("Unknown",1); }
else {
for (var key in this.actor) { emit(key,1); } } };
var reduceFunction = function(inKeys,counter) {
return Array.sum(counter); };
db.public.mapReduce(mapFunction, reduceFunction, {out:"temp", scope:{isEmptyObject:isEmptyObject}} );
foo = db.temp.aggregate(
{ $sort : { value : -1 }});
db.temp.drop();
printjson(foo)

How to remove duplicates based on a key in Mongodb?

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
"source_references" : [
"_id" : ObjectId("5045xxxxxxxxxxxxxx"),
"name" : "xxx",
"key" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).
I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.
If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:
db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.
Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.
Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
This is the easiest query I used on my MongoDB 3.2
db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})
Index your customKey before running this to increase speed
While #Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options
Let the MongoDB do that for you using Map Reduce
Another way
You do programatically which is less efficient.
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
I had a similar requirement but I wanted to retain the latest entry. The following query worked with my collection which had millions of records and duplicates.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Expanding on Fernando's answer, I found that it was taking too long, so I modified it.
var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
var i = 0;
db.collection.find({ "field": fieldValue }).forEach(doc => {
if (i) {
db.collection.remove({ _id: doc._id });
}
i++;
x += 1;
if (x % 100 === 0) {
print(x); // Every time we process 100 docs.
}
});
});
The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.
Also, indexing the field before the operation helps.
pip install mongo_remove_duplicate_indexes
create a script in any language
iterate over your collection
create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name
for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection
db.createCollection("cname")
create new index
db.cname.createIndex({'genre':1},unique:1)
now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
now just insert the json format values u received into new collection and handle exception using exception handling
for ex pymongo.errors.DuplicateKeyError
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
If you have enough memory, you can in scala do something like that:
cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})