Is it possible to count distinct Documents in MongoTemplate? - mongodb

Is it possible to somehow chain distinct(...) and countDocuments(...) in mongoTemplate.
Something like this
mongoTemplate.getCollection("foo").distinct("bar", Foo.class).countDocuments();
To keep in mind I will have a few million results, so I dont want to create a bottleneck in the jvm by getting all all distinct entities into an array and then getting the size of it. I rather want to get a number from MongoDB and dont bother JVM.

Yes, It is possible to get count of distinct documents using mongoTemplate.
Mongo shell query
db.foo.aggregate([{
$group: {
_id: "$bar"
}
}, {
$count: "total"
}]);
Output of this query will be
{
"total" : 8
}
To get this result using MongoTemplate:
GroupOperation groupOperation = Aggregation.group("bar");
CountOperation countOperation = Aggregation.count().as("total");
Aggregation aggregation = Aggregation.newAggregation(groupOperation, countOperation);
Document result = mongoTemplate.aggregate(aggregation, "foo", Document.class)
.getUniqueMappedResult();
Integer total = Objects.nonNull(result) ? result.getInteger("total") : 0;

Last time I remember that I used the Aggregation Pipeline Operators by which I grouped the collection(which will give you distinct values) and then use count() on top of it.
For Example:
Aggregation pipeline = newAggregation(
group(fields("foo","bar")),
group("_id.bar").count().as("distinctCount")
);
Else use the following one liner:
return mongoTemplate.aggregate(aggregation,Class.COLLECTION_NAME,BasicDBObject.class).getMappedResult();
// in this case make sure this function's return type is Integer or Long not int or long
NOTE: in this case, make sure the function's return type is Integer or Long not int or long as int and long are primitive data types and they do not contain null. However, in case, there is no data, the aggregation logic might return null hence the use of Long or Integer (object could be null)

You can use Mongo Aggrigate with $group.
db.foo.aggregate([{
'$group': {
'_id': '_id',
'count': {
'$sum': 1
}
}]);
You will get:
{ "_id":"_id", "count":12}

Related

Mongodb aggregate $count

I would like to count the number of documents returned by an aggregation.
I'm sure my initial aggregation works, because I use it later in my programm. To do so I created a pipeline variable (here called pipelineTest, ask me if you want to see it in detail, but it's quite long, that's why I don't give the lines here).
To count the number of documents returned, I push my pipeline with :
{$count: "totalCount"}
Now I would like to get (or log) totalCount value. What should I do ?
Here is the aggregation :
pipelineTest.push({$count: "totalCount"});
cursorTest = collection.aggregate(pipelineTest, options)
console.log(cursorTest.?)
Thanks for your help, I read lot and lot doc about aggregation and I still don't understand how to read the result of an aggregation...
Assuming you're using async/await syntax - you need to await on the result of the aggregation.
You can convert the cursor to an array, get the first element of that array and access totalCount.
pipelineTest.push({$count: "totalCount"});
cursorTest = await collection.aggregate(pipelineTest, options).toArray();
console.log(cursorTest[0].totalCount);
Aggregation
db.mycollection.aggregate([
{
$count: "totalCount"
}
])
Result
[ { totalCount: 3 } ]
Your Particulars
Try the following:
pipelineTest.push({$count: "totalCount"});
cursorTest = collection.aggregate(pipelineTest, options)
console.log(cursorTest.totalCount)

Split a string during MongoDB aggregate

Currently, I have just fullname stored in the User collection in MongoDB. I'd like to run a report that splits the first and last name so for now I'm trying to run an aggregate and split the string when a whitespace is found.
Here is what I have now, but I'd like to replace the hard coded end position with a variable based on where whitespace is found. Is this possible in an aggregate pipeline?
db.users.aggregate([{
$project : {
fullname:{ $toUpper:"$fullname" },
first: { $substr: [ "$fullname", 0, 2 ]}, _id:0 }
}, { $sort : { fullname : 1 }
}]);
The aggregation framework does not have any operator to perform a "split" based on a matched character or any such thing. There is only $substr which of course requires an index, and there is no operator to return a "index" of a matched character either.
You could use mapReduce, which can use JavaScript .split(), but of course there is no "sort stage" in mapReduce other than the results in the main key which are always pre-sorted before attempting to apply a reduce ( which would not be applied here with all unique keys ):
db.users.mapReduce(
function() {
var lastName = this.fullname.split(/\s/).reverse()[0].toUpperCase();
emit({ "lastName": lastName, "orig": this._id },this);
},
function(){}, // Never called on all unique
{ "out": { "inline": 1 } }
);
And that will basically extract the last name after a whitespace, convert it to uppercase and use it as a composite value in the primary key so results will be sorted by that key ( note you cannot use _id as any part of the key name or it will be sorted by that field instead ).
But if your real case here is "sorting", then you are better off storing the data that way, thus giving you a direct value to sort on without calculation:
var bulk = db.users.initializeOrderedBulkOp(),
count = 0;
db.users.find().forEach(user) {
bulk.find({ "_id": user._id }).updateOne({
"$set": { "lastName": user.fullname.split(/\s/).reverse()[0].toUpperCase() }
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.users.initializeOrderedBulkOp();
}
}
if ( count % 1000 != 0 )
bulk.execute();
Then with a solid field in place you just run your sort:
db.users.find().sort({ "lastName": 1 });
Which is going to be a lot faster than trying to calculate a value from which to perform a sort.
Of course if sorting is not the purpose and it's just for presentation, then just perform the split in client code where it makes the most sense to do so. The aggregation framework cannot restructure the data like that, and while mapReduce "could", it's output is very opinionated and not really purposed for such an operation.

Cast string to number in Mongodb [duplicate]

I have a collection of documents that have a value that is known to be a number, but is stored as a string. It is out of my control to change the type of the field, but I want to use that field in an aggregation (say, to average it).
It seems that I should be using a projection prior to grouping, and in that projection convert the field as needed. I can't seem to get the syntax just right - everything I try either gives me NaN, or the new field is simply missing from the next step in the aggregation.
$project: {
value: '$value',
valueasnumber: ????
}
Given the very simple example above, where the contents of $value in all documents are string type, but will parse to a number, what do I do to make valueasnumber a new (non-existing) field that is of type double with the parsed version of $value in it?
I've tried things like the examples below (and about a dozen similar things):
{ $add: new Number('$value').valueOf() }
new Number('$value').valueOf()
Am I barking up the wrong tree entirely? Any help would be greatly appreciated!
(To be 100% clear, below is how I would like to use the new field).
$group {
score: {
$avg: '$valueasnumber'
}
}
One of the way which I can think of is to use a mongo shell javascript to modify the document by adding new number field, valuesasnumber (number conversion of existing string 'value' field) in the existing document or in the new doc. Then using this numeric field for further calculations.
db.numbertest.find().forEach(function(doc) {
doc.valueasnumber = new NumberInt(doc.value);
db.numbertest.save(doc);
});
Using the valueasnumber field for numeric calculation
db.numbertest.aggregate([{$group :
{_id : null,
"score" : {$avg : "$valueasnumber"}
}
}]);
The core operation is to convert value from string to number which is unable to handled in aggregate pipeline operation currently.
mapReduce is an alternative as below.
db.c.mapReduce(function() {
emit( this.groupId, {score: Number(this.value), count: 1} );
}, function(key, values) {
var score = 0, count = 0;
for (var i = 0; i < values.length; i++) {
score += values[i].score;
count += values[i].count;
}
return {score: score, count: count};
}, {finalize: function(key, value) {
return {score: value.score / value.count};
}, out: {inline: 1}});
Now there is $toInt conversion operators in aggregation, you can check:
https://jira.mongodb.org/browse/SERVER-11400

sorting documents in mongodb

Let's say I have four documents in my collection:
{u'a': {u'time': 3}}
{u'a': {u'time': 5}}
{u'b': {u'time': 4}}
{u'b': {u'time': 2}}
Is it possible to sort them by the field 'time' which is common in both 'a' and 'b' documents?
Thank you
No, you should put your data into a common format so you can sort it on a common field. It can still be nested if you want but it would need to have the same path.
You can use use aggregation and the following code has been tested.
db.test.aggregate({
$project: {
time: {
"$cond": [{
"$gt": ["$a.time", null]
}, "$a.time", "$b.time"]
}
}
}, {
$sort: {
time: -1
}
});
Or if you also want the original fields returned back: gist
Alternatively you can sort once you get the result back, using a customized compare function ( not tested,for illustration purpose only)
db.eval(function() {
return db.mycollection.find().toArray().sort( function(doc1, doc2) {
var time1 = doc1.a? doc1.a.time:doc1.b.time,
time2 = doc2.a?doc2.a.time:doc2.b.time;
return time1 -time2;
})
});
You can, using the aggregation framework.
The trick here is to $project a common field to all the documents so that the $sort stage can use the value in that field to sort the documents.
The $ifNull operator can be used to check if a.time exists, it
does, then the record will be sorted by that value else, by b.time.
code:
db.t.aggregate([
{$project:{"a":1,"b":1,
"sortBy":{$ifNull:["$a.time","$b.time"]}}},
{$sort:{"sortBy":-1}},
{$project:{"a":1,"b":1}}
])
consequences of this approach:
The aggregation pipeline won't be covered by any of the index you
create.
The performance will be very poor for very large data sets.
What you could ideally do is to ask the source system that is sending you the data to standardize its format, something like:
{"a":1,"time":5}
{"b":1,"time":4}
That way your query can make use of the index if you create one on the time field.
db.t.ensureIndex({"time":-1});
code:
db.t.find({}).sort({"time":-1});

MongoDB - Aggregation Framework (Total Count)

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.
There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});
Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.
$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.
If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});
I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/
in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.
I get total count with aggregate().toArray().length