MongoDB - Aggregation Framework (Total Count) - mongodb

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.

There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});

Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.

$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.

If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});

I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/

in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.

I get total count with aggregate().toArray().length

Related

Show Recent chat message in Mongodb [duplicate]

I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});

Result from "aggregate with unwind" is different from the "find with count"?

Here is a few documents from my collections:
{"make":"Lenovo", "model":"Thinkpad T430"},
{"make":"Lenovo", "model":"Thinkpad T430", "problems":["Battery"]},
{"make":"Lenovo", "model":"Thinkpad T430", "problems":["Battery","Brakes"]}
As you can see some documents have no problems, some have only one problem and some have few problems in a list.
I want to calculate how many reviews have a specific "problem" (like "Battery") in problems list.
I have tried to use the following aggregate command:
{ $match : { model : "Thinkpad T430"} },
{ $unwind : "$problems" },
{ $group: {
_id: '$problems',
count: { $sum: 1 }
}}
And for battery problem the count was 382. I also decided to double check this result with find() and count():
db.reviews.find({model:"Thinkpad T430",problems:"Battery"}).count()
Result was 362.
Why do I have this difference? And what is the right way to calculate it?
You likely have documents in the collection where problems contains more than one "Battery" string in the array.
When using $unwind, these will each result in their own doc, so the subsequent $group operation will count them separately.

Mongoose limit/offset and count query

Bit of an odd one on query performance... I need to run a query which does a total count of documents, and can also return a result set that can be limited and offset.
So, I have 57 documents in total, and the user wants 10 documents offset by 20.
I can think of 2 ways of doing this, first is query for all 57 documents (returned as an array), then using array.slice return the documents they want. The second option is to run 2 queries, the first one using mongo's native 'count' method, then run a second query using mongo's native $limit and $skip aggregators.
Which do you think would scale better? Doing it all in one query, or running two separate ones?
Edit:
// 1 query
var limit = 10;
var offset = 20;
Animals.find({}, function (err, animals) {
if (err) {
return next(err);
}
res.send({count: animals.length, animals: animals.slice(offset, limit + offset)});
});
// 2 queries
Animals.find({}, {limit:10, skip:20} function (err, animals) {
if (err) {
return next(err);
}
Animals.count({}, function (err, count) {
if (err) {
return next(err);
}
res.send({count: count, animals: animals});
});
});
I suggest you to use 2 queries:
db.collection.count() will return total number of items. This value is stored somewhere in Mongo and it is not calculated.
db.collection.find().skip(20).limit(10) here I assume you could use a sort by some field, so do not forget to add an index on this field. This query will be fast too.
I think that you shouldn't query all items and than perform skip and take, cause later when you have big data you will have problems with data transferring and processing.
Instead of using 2 separate queries, you can use aggregate() in a single query:
Aggregate "$facet" can be fetch more quickly, the Total Count and the Data with skip & limit
db.collection.aggregate([
//{$sort: {...}}
//{$match:{...}}
{$facet:{
"stage1" : [ {"$group": {_id:null, count:{$sum:1}}} ],
"stage2" : [ { "$skip": 0}, {"$limit": 2} ]
}},
{$unwind: "$stage1"},
//output projection
{$project:{
count: "$stage1.count",
data: "$stage2"
}}
]);
output as follows:-
[{
count: 50,
data: [
{...},
{...}
]
}]
Also, have a look at https://docs.mongodb.com/manual/reference/operator/aggregation/facet/
db.collection_name.aggregate([
{ '$match' : { } },
{ '$sort' : { '_id' : -1 } },
{ '$facet' : {
metadata: [ { $count: "total" } ],
data: [ { $skip: 1 }, { $limit: 10 },{ '$project' : {"_id":0} } ] // add projection here wish you re-shape the docs
} }
] )
Instead of using two queries to find the total count and skip the matched record.
$facet is the best and optimized way.
Match the record
Find total_count
skip the record
And also can reshape data according to our needs in the query.
There is a library that will do all of this for you, check out mongoose-paginate-v2
After having to tackle this issue myself, I would like to build upon user854301's answer.
Mongoose ^4.13.8 I was able to use a function called toConstructor() which allowed me to avoid building the query multiple times when filters are applied. I know this function is available in older versions too but you'll have to check the Mongoose docs to confirm this.
The following uses Bluebird promises:
let schema = Query.find({ name: 'bloggs', age: { $gt: 30 } });
// save the query as a 'template'
let query = schema.toConstructor();
return Promise.join(
schema.count().exec(),
query().limit(limit).skip(skip).exec(),
function (total, data) {
return { data: data, total: total }
}
);
Now the count query will return the total records it matched and the data returned will be a subset of the total records.
Please note the () around query() which constructs the query.
You don't have to use two queries or one complicated query with aggregate and such.
You can use one query
example:
const getNames = async (queryParams) => {
const cursor = db.collection.find(queryParams).skip(20).limit(10);
return {
count: await cursor.count(),
data: await cursor.toArray()
}
}
mongo returns a cursor that has predefined functions such as count, which will return the full count of the queried results regardless of skip and limit
So in count property, you will get the full length of the collection and in data, you will get just the chunk with offset of 20 and limit of 10 documents
Thanks Igor Igeto Mitkovski, a best solution is using native connection
document is here: https://docs.mongodb.com/manual/reference/method/cursor.count/#mongodb-method-cursor.count
and mongoose dont support it ( https://github.com/Automattic/mongoose/issues/3283 )
we have to use native connection.
const query = StudentModel.collection.find(
{
age: 13
},
{
projection:{ _id:0 }
}
).sort({ time: -1 })
const count = await query.count()
const records = await query.skip(20)
.limit(10).toArray()

How to get the last N records in mongodb?

I can't find anywhere it has been documented this. By default, the find() operation will get the records from beginning. How can I get the last N records in mongodb?
Edit: also I want the returned result ordered from less recent to most recent, not the reverse.
If I understand your question, you need to sort in ascending order.
Assuming you have some id or date field called "x" you would do ...
.sort()
db.foo.find().sort({x:1});
The 1 will sort ascending (oldest to newest) and -1 will sort descending (newest to oldest.)
If you use the auto created _id field it has a date embedded in it ... so you can use that to order by ...
db.foo.find().sort({_id:1});
That will return back all your documents sorted from oldest to newest.
Natural Order
You can also use a Natural Order mentioned above ...
db.foo.find().sort({$natural:1});
Again, using 1 or -1 depending on the order you want.
Use .limit()
Lastly, it's good practice to add a limit when doing this sort of wide open query so you could do either ...
db.foo.find().sort({_id:1}).limit(50);
or
db.foo.find().sort({$natural:1}).limit(50);
The last N added records, from less recent to most recent, can be seen with this query:
db.collection.find().skip(db.collection.count() - N)
If you want them in the reverse order:
db.collection.find().sort({ $natural: -1 }).limit(N)
If you install Mongo-Hacker you can also use:
db.collection.find().reverse().limit(N)
If you get tired of writing these commands all the time you can create custom functions in your ~/.mongorc.js. E.g.
function last(N) {
return db.collection.find().skip(db.collection.count() - N);
}
then from a mongo shell just type last(N)
Sorting, skipping and so on can be pretty slow depending on the size of your collection.
A better performance would be achieved if you have your collection indexed by some criteria; and then you could use min() cursor:
First, index your collection with db.collectionName.setIndex( yourIndex )
You can use ascending or descending order, which is cool, because you want always the "N last items"... so if you index by descending order it is the same as getting the "first N items".
Then you find the first item of your collection and use its index field values as the min criteria in a search like:
db.collectionName.find().min(minCriteria).hint(yourIndex).limit(N)
Here's the reference for min() cursor: https://docs.mongodb.com/manual/reference/method/cursor.min/
In order to get last N records you can execute below query:
db.yourcollectionname.find({$query: {}, $orderby: {$natural : -1}}).limit(yournumber)
if you want only one last record:
db.yourcollectionname.findOne({$query: {}, $orderby: {$natural : -1}})
Note: In place of $natural you can use one of the columns from your collection.
db.collection.find().sort({$natural: -1 }).limit(5)
#bin-chen,
You can use an aggregation for the latest n entries of a subset of documents in a collection. Here's a simplified example without grouping (which you would be doing between stages 4 and 5 in this case).
This returns the latest 20 entries (based on a field called "timestamp"), sorted ascending. It then projects each documents _id, timestamp and whatever_field_you_want_to_show into the results.
var pipeline = [
{
"$match": { //stage 1: filter out a subset
"first_field": "needs to have this value",
"second_field": "needs to be this"
}
},
{
"$sort": { //stage 2: sort the remainder last-first
"timestamp": -1
}
},
{
"$limit": 20 //stage 3: keep only 20 of the descending order subset
},
{
"$sort": {
"rt": 1 //stage 4: sort back to ascending order
}
},
{
"$project": { //stage 5: add any fields you want to show in your results
"_id": 1,
"timestamp" : 1,
"whatever_field_you_want_to_show": 1
}
}
]
yourcollection.aggregate(pipeline, function resultCallBack(err, result) {
// account for (err)
// do something with (result)
}
so, result would look something like:
{
"_id" : ObjectId("5ac5b878a1deg18asdafb060"),
"timestamp" : "2018-04-05T05:47:37.045Z",
"whatever_field_you_want_to_show" : -3.46000003814697
}
{
"_id" : ObjectId("5ac5b878a1de1adsweafb05f"),
"timestamp" : "2018-04-05T05:47:38.187Z",
"whatever_field_you_want_to_show" : -4.13000011444092
}
Hope this helps.
You can try this method:
Get the total number of records in the collection with
db.dbcollection.count()
Then use skip:
db.dbcollection.find().skip(db.dbcollection.count() - 1).pretty()
You can't "skip" based on the size of the collection, because it will not take the query conditions into account.
The correct solution is to sort from the desired end-point, limit the size of the result set, then adjust the order of the results if necessary.
Here is an example, based on real-world code.
var query = collection.find( { conditions } ).sort({$natural : -1}).limit(N);
query.exec(function(err, results) {
if (err) {
}
else if (results.length == 0) {
}
else {
results.reverse(); // put the results into the desired order
results.forEach(function(result) {
// do something with each result
});
}
});
you can use sort() , limit() ,skip() to get last N record start from any skipped value
db.collections.find().sort(key:value).limit(int value).skip(some int value);
Look under Querying: Sorting and Natural Order, http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
as well as sort() under Cursor Methods
http://www.mongodb.org/display/DOCS/Advanced+Queries
You may want to be using the find options :
http://docs.meteor.com/api/collections.html#Mongo-Collection-find
db.collection.find({}, {sort: {createdAt: -1}, skip:2, limit: 18}).fetch();
Use .sort() and .limit() for that
Use Sort in ascending or descending order and then use limit
db.collection.find({}).sort({ any_field: -1 }).limit(last_n_records);
If you use MongoDB compass, you can use sort filed to filter,
use $slice operator to limit array elements
GeoLocation.find({},{name: 1, geolocation:{$slice: -5}})
.then((result) => {
res.json(result);
})
.catch((err) => {
res.status(500).json({ success: false, msg: `Something went wrong. ${err}` });
});
where geolocation is array of data, from that we get last 5 record.
db.collection.find().hint( { $natural : -1 } ).sort(field: 1/-1).limit(n)
according to mongoDB Documentation:
You can specify { $natural : 1 } to force the query to perform a forwards collection scan.
You can also specify { $natural : -1 } to force the query to perform a reverse collection scan.
Last function should be sort, not limit.
Example:
db.testcollection.find().limit(3).sort({timestamp:-1});

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.