Spring WebFlux + MongoDB: Tailable Cursor and Aggregation - mongodb

I´m new with WebFlux and MongoDB. I´m trying to use aggregation in a capped collection with tailable cursor, but I´m nothing getting sucessful.
I´d like to execute this mongoDB query:
db.structures.aggregate(
[
{
$match: {
id: { $in: [8244, 8052]}
}
},
{ $sort: { id: 1, lastUpdate: 1} },
{
$group:
{
_id: {id: "$id"},
lastUpdate: { $last: "$lastUpdate" }
}
}
]
)
ReactiveMongoOperations gives me option to "tail" or "aggregation".
I´m able to execute aggregation:
MatchOperation match = new MatchOperation(Criteria.where("id").in(8244, 8052));
GroupOperation group = Aggregation.group("id", "$id").last("$lastUpdate").as("lastUpdate");
Aggregation aggregate = Aggregation.newAggregation(match, group);
Flux<Structure> result = mongoOperation.aggregate(aggregate,
"structures", Structure.class);
Or tail cursor
Query query = new Query();
query.addCriteria(Criteria.where("id").in(8244, 8052));
Flux<Structure> result = mongoOperation.tail(query, Structure.class);
Is it possible? Tail and Aggregation together?
Using aggregation was the way that I found to get only the last inserted document for each id.
Without aggregation I get:
query without aggregation
With aggregation:
query with aggregation
Tks in advance

The tailable cursor query creates a Flux that never completes (never emits onComplete event) and that Flux emits records as they are inserted in the database. Because of that fact I would think aggregations are not allowed by the database engine with the tailable cursor.
So the aggregation doesn't make sense in a way because on every newly inserted record the aggregation would need to be recomputed. Technically you can do a running aggregation where for every returned record you compute the wanted aggregate record and send it downstream.
One possible solution would be to do the aggregations programmatically on the returned "infinite" Flux:
mongoOperation.tail(query, Structure.class)
.groupBy(Structure::id) // create independent Fluxes based on id
.flatMap(groupedFlux ->
groupedFlux.scan((result, nextStructure) -> { // scan is like reduce but emits intermediate results
log.info("intermediate result is: {}", result);
if (result.getLastUpdate() > nextStructure.getLastUpdate()) {
return result;
} else {
result.setLastUpdate(nextStructure.getLastUpdate());
return result;
}
}));
On the other hand you should probably revisit your use case and what you need to accomplish here and see if something other than capped collection should be used or maybe the aggregation part is redundant (i.e. if newly inserted records always have the lastUpdate property larger then the previous record).

Related

what is the difference between MongoDB find and aggregate in below queries?

select records using aggregate:
db.getCollection('stock_records').aggregate(
[
{
"$project": {
"info.created_date": 1,
"info.store_id": 1,
"info.store_name": 1,
"_id": 1
}
},
{
"$match": {
"$and": [
{
"info.store_id": "563dcf3465512285781608802a"
},
{
"info.created_date": {
$gt: ISODate("2021-07-18T21:07:42.313+00:00")
}
}
]
}
}
])
select records using find:
db.getCollection('stock_records').find(
{
'info.store_id':'563dcf3465512285781608802a',
'info.created_date':{ $gt:ISODate('2021-07-18T21:07:42.313+00:00')}
})
What is difference between these queries and which is best for select by id and date condition?
I think your question should be rephrased to "what's the difference between find and aggregate".
Before I dive into that I will say that both commands are similar and will perform generally the same at scale. If you want specific differences is that you did not add a project option to your find query so it will return the full document.
Regarding which is better, generally speaking unless you need a specific aggregation operator it's best to use find instead, it performs better
Now why is the aggregation framework performance "worse"? it's simple. it just does "more".
Any pipeline stage needs aggregation to fetch the BSON for the document then convert them to internal objects in the pipeline for processing - then at the end of the pipeline they are converted back to BSON and sent to the client.
This, especially for large queries has a very significant overhead compared to a find where the BSON is just sent back to the client.
Because of this, if you could execute your aggregation as a find query, you should.
Aggregation is slower than find.
In your example, Aggregation
In the first stage, you are returning all the documents with projected fields
For example, if your collection has 1000 documents, you are returning all 1000 documents each having specified projection fields. This will impact the performance of your query.
Now in the second stage, You are filtering the documents that match the query filter.
For example, out of 1000 documents from the stage 1 you select only few documents
In your example, find
First, you are filtering the documents that match the query filter.
For example, if your collection has 1000 documents, you are returning only the documents that match the query condition.
Here You did not specify the fields to return in the documents that match the query filter. Therefore the returned documents will have all fields.
You can use projection in find, instead of using aggregation
db.getCollection('stock_records').find(
{
'info.store_id': '563dcf3465512285781608802a',
'info.created_date': {
$gt: ISODate('2021-07-18T21:07:42.313+00:00')
}
},
{
"info.created_date": 1,
"info.store_id": 1,
"info.store_name": 1,
"_id": 1
}
)

Is it possible to count distinct Documents in MongoTemplate?

Is it possible to somehow chain distinct(...) and countDocuments(...) in mongoTemplate.
Something like this
mongoTemplate.getCollection("foo").distinct("bar", Foo.class).countDocuments();
To keep in mind I will have a few million results, so I dont want to create a bottleneck in the jvm by getting all all distinct entities into an array and then getting the size of it. I rather want to get a number from MongoDB and dont bother JVM.
Yes, It is possible to get count of distinct documents using mongoTemplate.
Mongo shell query
db.foo.aggregate([{
$group: {
_id: "$bar"
}
}, {
$count: "total"
}]);
Output of this query will be
{
"total" : 8
}
To get this result using MongoTemplate:
GroupOperation groupOperation = Aggregation.group("bar");
CountOperation countOperation = Aggregation.count().as("total");
Aggregation aggregation = Aggregation.newAggregation(groupOperation, countOperation);
Document result = mongoTemplate.aggregate(aggregation, "foo", Document.class)
.getUniqueMappedResult();
Integer total = Objects.nonNull(result) ? result.getInteger("total") : 0;
Last time I remember that I used the Aggregation Pipeline Operators by which I grouped the collection(which will give you distinct values) and then use count() on top of it.
For Example:
Aggregation pipeline = newAggregation(
group(fields("foo","bar")),
group("_id.bar").count().as("distinctCount")
);
Else use the following one liner:
return mongoTemplate.aggregate(aggregation,Class.COLLECTION_NAME,BasicDBObject.class).getMappedResult();
// in this case make sure this function's return type is Integer or Long not int or long
NOTE: in this case, make sure the function's return type is Integer or Long not int or long as int and long are primitive data types and they do not contain null. However, in case, there is no data, the aggregation logic might return null hence the use of Long or Integer (object could be null)
You can use Mongo Aggrigate with $group.
db.foo.aggregate([{
'$group': {
'_id': '_id',
'count': {
'$sum': 1
}
}]);
You will get:
{ "_id":"_id", "count":12}

How to extract the creation date of _id and add it as a new field using the aggregation framework?

I have been trying for a while to extract the insertion date of a mongodb document and add it as a new field to the same document.
I'm trying to do it using the mongo and mongo shell aggregation framework without getting good results.
Here is my query
db.getCollection('my_collection').aggregate(
[
{
$match: {
MY QUERY CRITERIA
}
},
{
$addFields: { "insertTime": "$_id.getTimestamp()" }
}
]
)
I am trying to extracr insertion time from _id using the function getTimestamp() but for sure there is somtehing about aggregation framework syntax that I am missing because I can not do what I am trying to do in my query.
This works perfect:
ObjectId("5c34f746ccb26800019edd53").getTimestamp()
ISODate("2019-01-08T19:17:26Z")
But this does not work at all:
"$_id.getTimestamp()"
What I am missing?
Thanks in advance

How to get pointer of current document to update in updateMany

I have a latest mongodb 3.2 and there is a collection of many items that have timeStamp.
A need to convert milliseconds to Date object and now I use this function:
db.myColl.find().forEach(function (doc) {
doc.date = new Date(doc.date);
db.myColl.save(doc);
})
It took very long time to update 2 millions of rows.
I try to use updateMany (seems it is very fast) but how I can get access to a current document? Is there any chance to rewrite the query above by using updateMany?
Thank you.
You can leverage other bulk update APIs like the bulkWrite() method which will allow you to use an iterator to access a document, manipulate it, add the modified document to a list and then send the list of the update operations in a batch to the server for execution.
The following demonstrates this approach, in which you would use the cursor's forEach() method to iterate the colloction and modify the each document at the same time pushing the update operation to a batch of about 1000 documents which can then be updated at once using the bulkWrite() method.
This is as efficient as using the updateMany() since it uses the same underlying bulk write operations:
var cursor = db.myColl.find({"date": { "$exists": true, "$type": 1 }}),
bulkUpdateOps = [];
cursor.forEach(function(doc){
var newDate = new Date(doc.date);
bulkUpdateOps.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$set": { "date": newDate } }
}
});
if (bulkUpdateOps.length == 1000) {
db.myColl.bulkWrite(bulkUpdateOps);
bulkUpdateOps = [];
}
});
if (bulkUpdateOps.length > 0) { db.myColl.bulkWrite(bulkUpdateOps); }
Current query is the only one solution to set field value by itself or other field value (one could compute some data using more than one field from document).
There is a way to improve performance of that query - when it is executed vis mongo shell directly on server (no data is passed to client).

MongoDB - Aggregation Framework (Total Count)

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.
There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});
Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.
$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.
If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});
I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/
in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.
I get total count with aggregate().toArray().length