Running a map reduce functions as a job in mongodb. Is it possible?
If I updated the collection with some data then the map reduce functions should run automatically as a job & produce the result in the output collection with latest data.
Can we achieve this in mongodb?
No. You would need to schedule these outside MongoDB.
What you are asking for sounds like it may be better suited to being a View within Couchbase.
Related
I have an ETL pipeline that is sinking timeseries records to MongoDB.
I need to compute timely aggregations for daily, weekly and the like. I assumed the aggregations engine of MongoDB would be the way to go, so after I had the aggregation queries for each resolution I wrapped them with MongoDB views like "daily_view", "weekly_view", etc.
There is REST service to fetch from MongoDB. Depending on what period resolution is requested, it pulls from the different aforementioned views, sampling for start and end dates.
The response times are quite "poor" with these views/aggregations. It can be around 10-15 seconds. I take this lapse might not be outrageous for batch computing a report, but in my case the service needs to issue these requests in a live mode to serve the frontend, so 10 seconds wait is too much.
From the MongoDB reference I know that Views are computed on demand during read operations but I'm a bit disappointed with such response times because the same aggregations took split seconds in Elasticsearch or InfluxDB, which unfortunately are not an option for me at the moment.
I have also exhausted the research about optimizing the queries and there is no room from more improvement there than the way it already is.
My intuition tells me that if the aggregations have to be done via the aggregations engine, I need the pipelines executing continuously on the fly (so the views have records already in for the service), as opposed to be run everytime ad-hoc.
I've tried to drop the views, and instead have and aggregation with a last stage being an $out to a real collection ...but I have still the same problem, it needs to be run "on demand". I composed the pipelines using the Compass UI, and in the $out stage it presents a button to run the aggregation.
Would there be a way to schedule such pipelines/aggregation queries??
Something I can think about is, copy-pasting the code of the aggregations and make it into Javascript functions of the REST service ...but still, something would have to invoke those functions on a regular interval. I know there are libraries I can bring into the service for scheduling, but this option makes me feel a bit discomforted in terms of architecture.
In the worst case scenario, my backup plan is to implement the timely aggregations as part as the logic of the initial ETL and sink all the different resolutions to different collections, so the service will find records to fetch already waiting in the aggregated collections. But the intention was to leverage time aggregations to the datastore engine.
I'm having a bit of last minute architecture distress now
$out aggregation stage. Documentation.
Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline.
$mongo accepts javascript file as an argument. So this is the easiest way to package your aggregation. Reference.
mongo file.js --username username --password
Then - to execute it on schedule - common tools like cron jobs to the rescue.
You might need to account for the differences between Mongo Shell and Javascrips such as using db = db.getSiblingDB('<db>') instead of use <db>. Write Scripts for the mongo Shell
Say I have a Mongo aggregation I know that I will use frequently, for example, finding the average of a dataset.
Essentially, I want someone to make an API for the database such that someone could type db.collection.average() in the mongo shell and get the result of that function, so that someone without much knowledge of the aggregation framework would easily be able to get the average (or result of any complicated aggregation function I create). What is the best way to achieve this?
As of MongoDB 3.4, you can create views that wrap a defined aggregation pipeline. Sounds like a perfect fit for your use case.
i want to know whether i can excute map reduce on a result of map reduce function previous like pipeline without write it on a collection, thanks all. My english is bad, hope you understand my question :(
Chaining of map reduce is not supported at this time without storing intermediary data in some kind of collection.
Again map reduce in MongoDB is not very efficient and MongoDB recommend to export data and run map reduce in proper framework like Hadoop if you have to.
Yes, but this could cost you a lot of performance, you have to store the first result in a new colllection, then run next map-reduce on the previous output collection. See this for more infomation.
However, you still can pipeline query results through aggregation pipeline, see this. So consider convert your map-reduce to aggregation.
I have two collections : one with the raw data from devices, one with the device configuration
I use map reduce to treat the raw data from devices. I need to get the configuration parameters inside the reduce step.
Is it possible ?
Thanks in advance
UPDATE :
I have a lot of data to treat 400000 documents per day and around 4000 configuration documents.
Thus I have to make a join between the two collections and inject that in map/reduce ?
Is this the best way to do ?
No. Map/reduce should always use the collection data it is invoked on. If you need the configuration data you will have to make sure it is inside your raw device data before invoking the map/reduce. Since m/r is just JavaScript being executed server-side so it is technically possible to query other collections but it can break (sharding comes to mind).
Is it possible to run MongoDB commands like a query to grab additional data or to do an update from with in MongoDB's MapReduce command. Either in the Map or the Reduce function?
Is this completely ludicrous to do anyways? Currently I have some documents that refer to separate collections using the MongoDB DBReference command.
Thanks for the help!
Is it possible to run MongoDB commands... from within MongoDB's MapReduce command.
In theory, this is possible. In practice there are lots of problems with this.
Problem #1: exponential work. M/R is already pretty intense and poorly logged. Adding queries can easily make M/R run out of control.
Problem #2: context. Imagine that you're running a sharded M/R and you are querying into an unsharded collection. Does the current context even have that connection?
You're basically trying to implement JOIN logic and MongoDB has no joins. Instead, you may need to build the final data in a couple of phases by running a few loops on a few sets of data.