I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});
Related
Let's say I have four documents in my collection:
{u'a': {u'time': 3}}
{u'a': {u'time': 5}}
{u'b': {u'time': 4}}
{u'b': {u'time': 2}}
Is it possible to sort them by the field 'time' which is common in both 'a' and 'b' documents?
Thank you
No, you should put your data into a common format so you can sort it on a common field. It can still be nested if you want but it would need to have the same path.
You can use use aggregation and the following code has been tested.
db.test.aggregate({
$project: {
time: {
"$cond": [{
"$gt": ["$a.time", null]
}, "$a.time", "$b.time"]
}
}
}, {
$sort: {
time: -1
}
});
Or if you also want the original fields returned back: gist
Alternatively you can sort once you get the result back, using a customized compare function ( not tested,for illustration purpose only)
db.eval(function() {
return db.mycollection.find().toArray().sort( function(doc1, doc2) {
var time1 = doc1.a? doc1.a.time:doc1.b.time,
time2 = doc2.a?doc2.a.time:doc2.b.time;
return time1 -time2;
})
});
You can, using the aggregation framework.
The trick here is to $project a common field to all the documents so that the $sort stage can use the value in that field to sort the documents.
The $ifNull operator can be used to check if a.time exists, it
does, then the record will be sorted by that value else, by b.time.
code:
db.t.aggregate([
{$project:{"a":1,"b":1,
"sortBy":{$ifNull:["$a.time","$b.time"]}}},
{$sort:{"sortBy":-1}},
{$project:{"a":1,"b":1}}
])
consequences of this approach:
The aggregation pipeline won't be covered by any of the index you
create.
The performance will be very poor for very large data sets.
What you could ideally do is to ask the source system that is sending you the data to standardize its format, something like:
{"a":1,"time":5}
{"b":1,"time":4}
That way your query can make use of the index if you create one on the time field.
db.t.ensureIndex({"time":-1});
code:
db.t.find({}).sort({"time":-1});
When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.
There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});
Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.
$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.
If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});
I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/
in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.
I get total count with aggregate().toArray().length
I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
"source_references" : [
"_id" : ObjectId("5045xxxxxxxxxxxxxx"),
"name" : "xxx",
"key" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).
I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.
If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:
db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.
Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.
Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
This is the easiest query I used on my MongoDB 3.2
db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})
Index your customKey before running this to increase speed
While #Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options
Let the MongoDB do that for you using Map Reduce
Another way
You do programatically which is less efficient.
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
I had a similar requirement but I wanted to retain the latest entry. The following query worked with my collection which had millions of records and duplicates.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Expanding on Fernando's answer, I found that it was taking too long, so I modified it.
var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
var i = 0;
db.collection.find({ "field": fieldValue }).forEach(doc => {
if (i) {
db.collection.remove({ _id: doc._id });
}
i++;
x += 1;
if (x % 100 === 0) {
print(x); // Every time we process 100 docs.
}
});
});
The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.
Also, indexing the field before the operation helps.
pip install mongo_remove_duplicate_indexes
create a script in any language
iterate over your collection
create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name
for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection
db.createCollection("cname")
create new index
db.cname.createIndex({'genre':1},unique:1)
now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
now just insert the json format values u received into new collection and handle exception using exception handling
for ex pymongo.errors.DuplicateKeyError
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
If you have enough memory, you can in scala do something like that:
cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})
I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});
I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.