Using MapReduce as a Stage in Mongo DB Aggregation Pipeline - mongodb

I want to use Mongo DB MapReduce functionality along with Aggregation Query.
The below are the stages which I see could be part of the Aggregation pipeline.
Filter docs for which the user has access based on content in the docs and
passed security context(roles of the user)
(Using $REDACT)
Filter based on one or more criteria (Using MATCH)
Tokenize the words in returned docs based on above filtering and populate a
collection (Using MAPREDUCE) (OR) return the docs inline
Query the populated collection/returned docs inline for words based on user
criteria using like query(REGEX) and return the words along with their
locations
I able to achieve steps 1,2 and 4 in the aggregation pipeline.
I am able to achieve Stage 3, separately by using mapreduce functionality in Mongo DB.
I want to make the mapreduce operation also as a stage in the aggregation pipeline and use it to receive the filtered docs from the earlier steps and pass the processed result to next step.
The mapreduce operation is based on sample map and reduce operation. I intend to use the map , reduce and finalize functions as shared in the below stackoverflow issue.
Implement auto-complete feature using MongoDB search
My query is I do now know if we can have MapReduce operation as part of the Mongo DB aggregation pipeline and if so can we use as inline and pass it to the next stage.
I am using Spring Data Mongo DB to implement the Mongo DB aggregation solution.
If someone has implemented the same please help me on this.

Related

Get data between stages in an aggregation Pipeline

Is it possible to retrieve the documents between stages in mongo aggregation pipeline?
Imagine that I have an aggregation pipeline running in pymongo with 10 stages and I want to be able to retrive some info available after stage 8 that will not be available on the last stage. Is it possible?
The idea is quite similar of this question, and looking at the answers I found this $facet but it wasn't clear for me if the stage1 of all outputFields are the same then it will be executed only once and perform as expected. And also, as I saw on the docs, $facet does not support indexes, that is a problem in my case.
To retrieve values of particular fields which are changed in subsequent stages, use $set to duplicate those values into new fields.
To retrieve the result set exactly as it exists after the 8th stage, send the first 8 stages as their own pipeline.

Can you add an aggregate pipeline to document .save() action?

I use mongoose with mongodb and while updating a document, I first find the document, modify the resultant document object and then do a .save() on the document.
Now I want to add an aggregate pipeline to the save operation so as to control the document response better and so I was wondering if this possible.
I read that the update query can have the pipeline attached to it but does that also apply to the save action?
as far as I am concern in the current version of MongoDB(4.4), the only methods that allow aggregate pipelines are those concerning updateAndModify, and Update. Thus, the limit use that mongoose might bring to this subject. What I would recommend in your case is that you use the aggregation pipeline with Model.findOneAndUpdate(). Here is an example that you might follow: example of aggregate using Model.findOneAndUpdate()
Also you might notice that this is the documentation for MongoDB and not Mongoose. I tend to find it difficult to find useful information for more specific use cases like this one in the docs of Mongoose, therefore the link in MongoDB. It will work the same as with a model from Mongoose so take a shot!

How can I use same MongoDb pipeline for two queries with common first stage?

I have 2 queries which I use the aggregate() pipeline
[filter,sort1,A,B,C]
[filter,sort2,a,b,c]
Is it possible that I use the first stage of the pipeline for both the queries? It will also make the two queries transactional if the first stage is common.
I am using the aggregate() function and 4 stages. Similarly, I have 1 more query where I just have to get an aggregate count after the initial filter.
I want both the queries to have consistent data as my database is updating quickly. So I would want to use only 1 pipeline initially for the first stage.
collection.aggregate([initialFilter,sortInitial,group,sortFinal])
I am using pymongo client.
Also how to make 2 queries transactional/atomic in pymongo?

Is it possible to get the textScore on mongodb MapReduce?

If you created a textIndex on mongodb 2.6 when you find or use the pipeline aggregate framework you can get the textScore given a query with the projection:
{'$meta': "textScore"}
This allows to operate with the textScore in further operations.
Is it possible to acces such value during a map-reduce operation?

MongoDB aggregation comparison: group(), $group and MapReduce

I am somewhat confused about when to use group(), aggregate with $group or mapreduce. I read the documentation at http://www.mongodb.org/display/DOCS/Aggregation for group(), http://docs.mongodb.org/manual/reference/aggregation/group/#_S_group for $group.. Is sharding the only situation where group() won't work? Also, I get this feeling that $group is more powerful than group() because it can be used in conjunction with other pipeline operators from aggregation framework.. How does $group compare with mapreduce? I read somewhere that it doesn't generate any temporary collection whereas mapreduce does. Is that so?
Can someone present an illustration or guide me to a link where these three concepts are explained together, taking the same sample data, so I can compare them easily?
EDIT:Also, it would be great if you can point out anything new specifically in these commands since the new 2.2 release came out..
It is somewhat confusing since the names are similar, but the group() command is a different feature and implementation from the $group pipeline operator in the Aggregation Framework.
The group() command, Aggregation Framework, and MapReduce are collectively aggregation features of MongoDB. There is some overlap in features, but I'll attempt to explain the differences and limitations of each as at MongoDB 2.2.0.
Note: inline result sets mentioned below refer to queries that are processed in memory with results returned at the end of the function call. Alternative output options (currently only available with MapReduce) could include saving results to a new or existing collection.
group() Command
Simple syntax and functionality for grouping .. analogous to GROUP BY in SQL.
Returns result set inline (as an array of grouped items).
Implemented using the JavaScript engine; custom reduce() functions can be written in JavaScript.
Current Limitations
Will not group into a result set with more than 20,000 keys.
Results must fit within the limitations of a BSON document (currently 16MB).
Takes a read lock and does not allow any other threads to execute JavaScript while it is running.
Does not work with sharded collections.
See also: group() command examples.
MapReduce
Implements the MapReduce model for processing large data sets.
Can choose from one of several output options (inline, new collection, merge, replace, reduce)
MapReduce functions are written in JavaScript.
Supports non-sharded and sharded input collections.
Can be used for incremental aggregation over large collections.
MongoDB 2.2 implements much better support for sharded map reduce output.
Current Limitations
A single emit can only hold half of MongoDB's maximum BSON document size (16MB).
There is a JavaScript lock so a mongod server can only execute one JavaScript function at a point in time .. however, most steps of the MapReduce are very short so locks can be yielded frequently.
MapReduce functions can be difficult to debug. You can use print() and printjson() to include diagnostic output in the mongod log.
MapReduce is generally not intuitive for programmers trying to translate relational query aggregation experience.
See also: Map/Reduce examples.
Aggregation Framework
New feature in the MongoDB 2.2.0 production release (August, 2012).
Designed with specific goals of improving performance and usability.
Returns result set inline.
Supports non-sharded and sharded input collections.
Uses a "pipeline" approach where objects are transformed as they pass through a series of pipeline operators such as matching, projecting, sorting, and grouping.
Pipeline operators need not produce one output document for every input document: operators may also generate new documents or filter out documents.
Using projections you can add computed fields, create new virtual sub-objects, and extract sub-fields into the top-level of results.
Pipeline operators can be repeated as needed (for example, multiple $project or $group steps.
Current Limitations
Results are returned inline, so are limited to the maximum document size supported by the server (16MB)
Doesn't support as many output options as MapReduce
Limited to operators and expressions supported by the Aggregation Framework (i.e. can't write custom functions)
Newest server feature for aggregation, so has more room to mature in terms of documentation, feature set, and usage.
See also: Aggregation Framework examples.
Can someone present an illustration or guide me to a link where these three concepts are explained together, taking the same sample data, so I can compare them easily?
You generally won't find examples where it would be useful to compare all three approaches, but here are previous StackOverflow questions which show variations:
group() versus Aggregation Framework
MapReduce versus Aggregation Framework