Fast way to find duplicates on indexed column in mongodb - mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});

I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.

The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.

You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.

Related

Sort before querying

Is it possible to run a sort on a Mongo collection before running the filtering query? I have older code in which I was using a method of getting a random result from the database by having a field which was a random float between 0 and 1, then querying with findOne to get the first document with a value greater than a random float generated at that time. The sample set was small, so didn't notice a problem at the time, but recently noticed that with one query, I was almost always getting the same value. The "first" document had a random > .9, so nearly every query matched it first.
I realized, for this solution to work, I need to sort by random, then find the first value greater than my random. As I understand it, this isn't as necessary a solution as in the past, as $sample exists as of 3.2, but I figure learning how I could do this would be good? Plus, my understanding is that $sample can return the same document multiple times (where N > 1 obviously, so not directly applicable to my question).
So for example, the following data:
> db.links.find()
{ "_id" : ObjectId("553c072bc87652a80e00002a"), "random" : 0.9162904409691691 }
{ "_id" : ObjectId("553c3332c87652c80700002a"), "random" : 0.00427396921440959 }
{ "_id" : ObjectId("553c3c5cc87652a80e00002b"), "random" : 0.2409569111187011 }
{ "_id" : ObjectId("553c3c66c876521c10000029"), "random" : 0.35101076657883823 }
{ "_id" : ObjectId("553c3c6ec87652200700002e"), "random" : 0.3234482416883111 }
{ "_id" : ObjectId("553c68d5c87652a80e00002c"), "random" : 0.5221220930106938 }
Any attempt to run db.mycollection.findOne({'random': {'$gte': x}}) where x is any value up to .91 always return the first object (_id 553c072). Anything greater returns nothing. If I could sort by the random value in ascending order then filter, it would keep searching until it found the correct value.
I would strongly recommend you to drop your custom solution and simply switch to using the MongoDB built-in $sample stage which will return a random result from your collection.
EDIT based on your comment:
Here's how you can do what you originally asked for:
db.links.find({ "random": { $gte: /* put your value here */ } })
.sort({ "random": 1 /* sort by "random" field in ascending order */ })
.limit(1)
You can, but don't need to use the aggregation framework, too:
db.links.aggregate({
$match: {
"random": {
$gte: /* put your value here */ // filter the collection
}
}
}, {
$sort: {
"random": 1 // sort by "random" field in ascending order
}
}, {
$limit: 1 // return only the first element
})

MongoDB: phantom records in $unwind results under heavy load

I have a simple collection of elements like this
{_id: n, xs: [...]}
I'm trying to count total number of elements in all arrays
db.testRace.aggregate([{ $unwind : "$xs" }, { $group : { _id : null, count : { $sum : 1 } } }])
And it works great unless I start to do massive updates of this collection. Under heavy load of update operations I get wrong total - slightly bigger than it should be.
It can be easily reproduced.
First generate some test data
for(var i = 1; i <= 1000000; i++) {
db.testRace.insert({_id: i, xs: [i]});
}
Then simulate a lot of updates
while(true) {
var id = Math.floor((Math.random() * 1000000) + 1);
var obj = db.testRace.find({_id: id}).next();
obj.some="change";
db.testRace.update({_id: id}, obj);
}
And while it is running do aggregate unwind query.
Without load I get right result - 1000000. But when there are a lot of updates I get bigger numbers, like 1001456.
And if I run query like this
db.testRace.aggregate([{ $unwind : "$xs" }, {$group: {_id:"$xs", count:{$sum: 1}}}, { $sort : { count : -1 } }, { $limit : 2 }]);
I get
"result" : [
{
"_id" : 996972,
"count" : 2
},
{
"_id" : 997789,
"count" : 2
}
],
So it seems aggregate count some records twice.
Is it expected behaviour or maybe I'm doing aggregation wrong?
I tested on local mongodb instance, version - 2.4.9
It's expected behavior due to the way MongoDB handles read isolation. When you have a long running query (and an aggregation that reads every single document is a long running query) with updates to that data during the query it may impact whether or no the updated data is returned in the query - depending on what happens when, you could miss a document, receive it or receive it twice.
From the source code:
Any data inserted, deleted, or modified during a yield that should be
returned by a query may or may not be returned by that query. The
query could return: nothing; the data before; the data after; or both
the data before and the data after.
In short, there is no isolation between a query and an
insert/delete/update. AKA, READ_UNCOMMITTED.
https://github.com/mongodb/mongo/blob/master/src/mongo/db/exec/plan_stage.h
Your aggregation query is yielding mid query, during which some of the data is updated. This impacts the results of the query.

Can MongoDB aggregate "top x" results in this document schema?

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?
When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

MongoDB - Aggregation Framework (Total Count)

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.
There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});
Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.
$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.
If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});
I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/
in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.
I get total count with aggregate().toArray().length

How to get the last N records in mongodb?

I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});