Find duplicates across multiple MongoDb collections

Find duplicates across multiple MongoDb collections - mongodb

Coming from MySQL, I am wondering, how can we find duplicates across multiple collections in MongoDB ?
Let say I have two (or more) collections :
human :
_id
firstname
cat:
_id
nickname
What would be an efficient solution to list duplicated names. This includes if a name is used by 2+ users only, by 2+ cats only, or by at least one user and one cat. Our result should therefor contains duplicates of both collections AND duplicates across those collections (cats and humans with the same name)
Expected result :
The list of the duplicated values, the number of occurence could be interesting but is not essential.
Question is not about whether or not the proposed db schema would be appropriate in this situation, but about the best MongoDB solution.
Edit
My description of a duplicate was not really what I intended it, if it is not existing in one collection but is duplicated in another collection it still is a duplicate

For two collections with MongoDB 3.2 you can use $lookup aggregation (it's equivalent to left outer join you have used in MySQL):
db.human.aggregate([
{$group: {_id: "$firstname"}},
{$lookup: {
from: "cat",
localField: "_id",
foreignField: "nickname",
as: "cats"
}
},
{$match:{cats:{$ne:[]}}},
{$project: {catsCount:{$size:"$cats"}}}])
Stages:
Group humans by name as there could be humans with same name
Attach to each group array of cats which have same nickname as group id (i.e. human firstname)
Filter out those humans which don't have any matches in cats collection
Project result to get only names and count of matches
Result will look like
[
{ _id: "Bob", catsCount: 2 },
{ _id: "Alex", catsCount: 1 }
]
NOTE: If you need to join several collections, you can apply $lookup stage several times.

A solution I found :
Combine data in one collection
Find duplicates in the collection
And export it in a table
Code :
mapHuman = function() {
var values = {
name: this.firstname
};
emit(this._id, values);
};
mapCat = function() {
var values = {
name: this.name
};
emit(this._id, values);
};
reduce = function(k, values) {
var result = {names: []};
values.forEach(function(value) {
result.names.push(value.slug);
});
return result;
};
db.human.mapReduce(mapUsers, reduce, {"out": {"reduce": "name"}});
db.cat.mapReduce(mapUsers, reduce, {"out": {"reduce": "name"}});
db.name.aggregate(
{"$group" : { "_id": "$value.name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0}},
{"$out": "duplicate"}
)
db.duplicate.find()

Related

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?

No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.

Finding which elements in an array do not exist for a field value in MongoDB

I have a collection users as follows:
{ "_id" : ObjectId("51780f796ec4051a536015cf"), "userId" : "John" }
{ "_id" : ObjectId("51780f796ec4051a536015d0"), "userId" : "Sam" }
{ "_id" : ObjectId("51780f796ec4051a536015d1"), "userId" : "John1" }
{ "_id" : ObjectId("51780f796ec4051a536015d2"), "userId" : "john2" }
Now I am trying to write a code which can provides suggestions of a userId to user in case id provided by user already exists in DB. In same routine I just append values from 1 to 5 to the for example in case user have selected userId to be John, suggested user name array that needs to be checked for Id in database will look like this
[John,John1,John2,John3,John4,John5].
Now I just want to execute it against Db and to find out which of the suggested values do not exist in DB. So instead of selecting any document, I want to select values within suggested array which do not exist for users collection.
Any pointers are highly appreciated.

Your general approach here is you want to find the "distinct" values for the "userId's" that already exist in your collection. You then compare these to your input selection to see the difference.
var test = ["John","John1","John2","John3","John4","John5"];
var res = db.collection.aggregate([
{ "$match": { "userId": { "$in": test } }},
{ "$group": { "_id": "$userId" }}
]).toArray();
res.map(function(x){ return x._id });
test.filter(function(x){ return res.indexOf(x) == -1 })
The end result of this is the userId's that do not match in your initial input:
[ "John2", "John3", "John4", "John5" ]
The main operator there is $in which takes an array as the argument and compares those values against the specified field. The .aggregate() method is the best approach, there is a shorthand mapReduce wrapper in the .distinct() method which directly produces just the values in the array, removing the call to a function like .map() to strip out the values for a given key:
var res = db.collection.distinct("userId",{ "userId": { "$in": test } })
It should run notably slower though, especially on large collections.
Also do not forget to index your "userId" and likely you want this to be "unique" anyway so you really just want the $match or .find() result:
var res = db.collection.find({ "$in": test }).toArray()

mongodb - perform batch query

I need query data from collection a first, then according to those data, query from collection b. Such as:
For each id queried from a
query data from b where "_id" == id
In SQL, this can be done by join table a & b in a single select. But in mongodb, it needs do multi query, it seems inefficient, doesn't it? Or it can be done by just 2 queries?(one for a, another for b, rather than 1 plus n) I know NoSQL doesn't support join, but is there a way to batch execute queries in for loop into a single query?

You'll need to do it as two steps.
Look into the $in operator (reference) which allows passing an array of _ids for example. Many would suggest you do those in batches of, say, 1000 _ids.
db.myCollection.find({ _id : { $in : [ 1, 2, 3, 4] }})

It's very simple, don't make so many DB calls for each id, it is very inefficient, it is possible to execute a single query which will return all documents relevant to each of the ids in a single pass using the $in operator in MongoDB, which is synonymous to in syntax in SQL so for example if you need to find out the documents for 5 ids in a single pass then
const ids = ['id1', 'id2', 'id3', 'id4', 'id5'];
const results = db.collectionName.find({ _id : { $in : ids }})
This will get you all the relevant documents in a single pass.

This is basically a join in SQL parlance and you can do it in mongo using an aggregate query called $lookup.
For example if you had some types in collections like these:
interface IFoo {
_id: ObjectId
name: string
}
interface IBar {
_id: ObjectId
fooId: ObjectId
title: string
}
Then query with an aggregate like this:
await db.foos.aggregate([
{
$match: { _id: { $in: ids } } // query criteria here
},
{
$lookup: {
from: 'bars',
localField: '_id',
foreignField: 'fooId',
as: 'bars'
}
}
])
May produce resulting objects like this:
{
"_id": "foo0",
"name": "example foo",
"bars": [
{ _id: "bar0", "fooId": "foo0", title: "crow bar" }
]
}

How to remove duplicate entries from an array?

In the following example, "Algorithms in C++" is present twice.
The $unset modifier can remove a particular field but how to remove an entry from a field?
{
"_id" : ObjectId("4f6cd3c47156522f4f45b26f"),
"favorites" : {
"books" : [
"Algorithms in C++",
"The Art of Computer Programming",
"Graph Theory",
"Algorithms in C++"
]
},
"name" : "robert"
}

As of MongoDB 2.2 you can use the aggregation framework with an $unwind, $group and $project stage to achieve this:
db.users.aggregate([{$unwind: '$favorites.books'},
{$group: {_id: '$_id',
books: {$addToSet: '$favorites.books'},
name: {$first: '$name'}}},
{$project: {'favorites.books': '$books', name: '$name'}}
])
Note the need for the $project to rename the favorites field, since $group aggregate fields cannot be nested.

The easiest solution is to use setUnion (Mongo 2.6+):
db.users.aggregate([
{'$addFields': {'favorites.books': {'$setUnion': ['$favorites.books', []]}}}
])
Another (more lengthy) version that is based on the idea from #kynan's answer, but preserves all the other fields without explicitly specifying them (Mongo 3.4+):
> db.users.aggregate([
{'$unwind': {
'path': '$favorites.books',
// output the document even if its list of books is empty
'preserveNullAndEmptyArrays': true
}},
{'$group': {
'_id': '$_id',
'books': {'$addToSet': '$favorites.books'},
// arbitrary name that doesn't exist on any document
'_other_fields': {'$first': '$$ROOT'},
}},
{
// the field, in the resulting document, has the value from the last document merged for the field. (c) docs
// so the new deduped array value will be used
'$replaceRoot': {'newRoot': {'$mergeObjects': ['$_other_fields', "$$ROOT"]}}
},
// this stage wouldn't be necessary if the field wasn't nested
{'$addFields': {'favorites.books': '$books'}},
{'$project': {'_other_fields': 0, 'books': 0}}
])
{ "_id" : ObjectId("4f6cd3c47156522f4f45b26f"), "name" : "robert", "favorites" :
{ "books" : [ "The Art of Computer Programmning", "Graph Theory", "Algorithms in C++" ] } }

What you have to do is use map reduce to detect and count duplicate tags .. then use $set to replace the entire books based on { "_id" : ObjectId("4f6cd3c47156522f4f45b26f"),
This has been discussed sevel times here .. please seee
Removing duplicate records using MapReduce
Fast way to find duplicates on indexed column in mongodb
http://csanz.posterous.com/look-for-duplicates-using-mongodb-mapreduce
http://www.mongodb.org/display/DOCS/MapReduce
How to remove duplicate record in MongoDB by MapReduce?

function unique(arr) {
var hash = {}, result = [];
for (var i = 0, l = arr.length; i < l; ++i) {
if (!hash.hasOwnProperty(arr[i])) {
hash[arr[i]] = true;
result.push(arr[i]);
}
}
return result;
}
db.collection.find({}).forEach(function (doc) {
db.collection.update({ _id: doc._id }, { $set: { "favorites.books": unique(doc.favorites.books) } });
})

Starting in Mongo 4.4, the $function aggregation operator allows applying a custom javascript function to implement behaviour not supported by the MongoDB Query Language.
For instance, in order to remove duplicates from an array:
// {
// "favorites" : { "books" : [
// "Algorithms in C++",
// "The Art of Computer Programming",
// "Graph Theory",
// "Algorithms in C++"
// ]},
// "name" : "robert"
// }
db.collection.aggregate(
{ $set:
{ "favorites.books":
{ $function: {
body: function(books) { return books.filter((v, i, a) => a.indexOf(v) === i) },
args: ["$favorites.books"],
lang: "js"
}}
}
}
)
// {
// "favorites" : { "books" : [
// "Algorithms in C++",
// "The Art of Computer Programming",
// "Graph Theory"
// ]},
// "name" : "robert"
// }
This has the advantages of:
keeping the original order of the array (if that's not a requirement, then prefer #Dennis Golomazov's $setUnion answer)
being more efficient than a combination of expensive $unwind and $group stages.
$function takes 3 parameters:
body, which is the function to apply, whose parameter is the array to modify.
args, which contains the fields from the record that the body function takes as parameter. In our case "$favorites.books".
lang, which is the language in which the body function is written. Only js is currently available.

Get documents with tags in list, ordered by total number of matches

Given the following MongoDB collection of documents :
{
title : 'shirt one'
tags : [
'shirt',
'cotton',
't-shirt',
'black'
]
},
{
title : 'shirt two'
tags : [
'shirt',
'white',
'button down collar'
]
},
{
title : 'shirt three'
tags : [
'shirt',
'cotton',
'red'
]
},
...
How do you retrieve a list of items matching a list of tags, ordered by the total number of matched tags? For example, given this list of tags as input:
['shirt', 'cotton', 'black']
I'd want to retrieve the items ranked in desc order by total number of matching tags:
item total matches
-------- --------------
Shirt One 3 (matched shirt + cotton + black)
Shirt Three 2 (matched shirt + cotton)
Shirt Two 1 (matched shirt)
In a relational schema, tags would be a separate table, and you could join against that table, count the matches, and order by the count.
But, in Mongo... ?
Seems this approach could work,
break the input tags into multiple "IN" statements
query for items by "OR"'ing together the tag inputs
i.e. where ( 'shirt' IN items.tags ) OR ( 'cotton' IN items.tags )
this would return, for example, three instances of "Shirt One", 2 instances of "Shirt Three", etc
map/reduce that output
map: emit(this._id, {...});
reduce: count total occurrences of _id
finalize: sort by counted total
But I'm not clear on how to implement this as a Mongo query, or if this is even the most efficient approach.

As i answered in In MongoDB search in an array and sort by number of matches
It's possible using Aggregation Framework.
Assumptions
tags attribute is a set (no repeated elements)
Query
This approach forces you to unwind the results and reevaluate the match predicate with unwinded results, so its really inefficient.
db.test_col.aggregate(
{$match: {tags: {$in: ["shirt","cotton","black"]}}},
{$unwind: "$tags"},
{$match: {tags: {$in: ["shirt","cotton","black"]}}},
{$group: {
_id:{"_id":1},
matches:{$sum:1}
}},
{$sort:{matches:-1}}
);
Expected Results
{
"result" : [
{
"_id" : {
"_id" : ObjectId("5051f1786a64bd2c54918b26")
},
"matches" : 3
},
{
"_id" : {
"_id" : ObjectId("5051f1726a64bd2c54918b24")
},
"matches" : 2
},
{
"_id" : {
"_id" : ObjectId("5051f1756a64bd2c54918b25")
},
"matches" : 1
}
],
"ok" : 1
}

Right now, it isnt possible to do unless you use MapReduce. The only problem with MapReduce is that it is slow (compared to a normal query).
The aggregation framework is slated for 2.2 (so should be available in 2.1 dev release) and should make this sort of thing much easier to do without MapReduce.
Personally, I do not think using M/R is an efficient way to do it. I would rather query for all the documents and do those calculations on the application side. It is easier and cheaper to scale your app servers than it is to scale your database servers so let the app servers do the number crunching. Of those, this approach may not work for you given your data access patterns and requirements.
An even simpler approach may be to just include a count property in each of your tag objects and whenever you $push a new tag to the array, you also $inc the count property. This is a common pattern in the MongoDB world, at least until the aggregation framework.

I'll second #Bryan in saying that MapReduce is the only possible way at the moment (and it's far from perfect). But, in case you desperately need it, here you go :-)
var m = function() {
var searchTerms = ['shirt', 'cotton', 'black'];
var me = this;
this.tags.forEach(function(t) {
searchTerms.forEach(function(st) {
if(t == st) {
emit(me._id, {matches : 1});
}
})
})
};
var r = function(k, vals) {
var result = {matches : 0};
vals.forEach(function(v) {
result.matches += v.matches;
})
return result;
};
db.shirts.mapReduce(m, r, {out: 'found01'});
db.found01.find();

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Find duplicates across multiple MongoDb collections - mongodb

Related

Can I use populate before aggregate in mongoose?

Finding which elements in an array do not exist for a field value in MongoDB

mongodb - perform batch query

How to remove duplicate entries from an array?

Get documents with tags in list, ordered by total number of matches

Categories

Resources