Continuously run MongoDB aggregation pipeline - mongodb

I have an ETL pipeline that is sinking timeseries records to MongoDB.
I need to compute timely aggregations for daily, weekly and the like. I assumed the aggregations engine of MongoDB would be the way to go, so after I had the aggregation queries for each resolution I wrapped them with MongoDB views like "daily_view", "weekly_view", etc.
There is REST service to fetch from MongoDB. Depending on what period resolution is requested, it pulls from the different aforementioned views, sampling for start and end dates.
The response times are quite "poor" with these views/aggregations. It can be around 10-15 seconds. I take this lapse might not be outrageous for batch computing a report, but in my case the service needs to issue these requests in a live mode to serve the frontend, so 10 seconds wait is too much.
From the MongoDB reference I know that Views are computed on demand during read operations but I'm a bit disappointed with such response times because the same aggregations took split seconds in Elasticsearch or InfluxDB, which unfortunately are not an option for me at the moment.
I have also exhausted the research about optimizing the queries and there is no room from more improvement there than the way it already is.
My intuition tells me that if the aggregations have to be done via the aggregations engine, I need the pipelines executing continuously on the fly (so the views have records already in for the service), as opposed to be run everytime ad-hoc.
I've tried to drop the views, and instead have and aggregation with a last stage being an $out to a real collection ...but I have still the same problem, it needs to be run "on demand". I composed the pipelines using the Compass UI, and in the $out stage it presents a button to run the aggregation.
Would there be a way to schedule such pipelines/aggregation queries??
Something I can think about is, copy-pasting the code of the aggregations and make it into Javascript functions of the REST service ...but still, something would have to invoke those functions on a regular interval. I know there are libraries I can bring into the service for scheduling, but this option makes me feel a bit discomforted in terms of architecture.
In the worst case scenario, my backup plan is to implement the timely aggregations as part as the logic of the initial ETL and sink all the different resolutions to different collections, so the service will find records to fetch already waiting in the aggregated collections. But the intention was to leverage time aggregations to the datastore engine.
I'm having a bit of last minute architecture distress now

$out aggregation stage. Documentation.
Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline.
$mongo accepts javascript file as an argument. So this is the easiest way to package your aggregation. Reference.
mongo file.js --username username --password
Then - to execute it on schedule - common tools like cron jobs to the rescue.
You might need to account for the differences between Mongo Shell and Javascrips such as using db = db.getSiblingDB('<db>') instead of use <db>. Write Scripts for the mongo Shell

Related

Does a running MongoDB aggregation pipeline slow down reads and writes to the affected collection?

As the title suggests, I'd like to know if reads and writes to a collection are delayed/paused while a MongoDB aggregation pipeline is running. I'm considering adding a pipeline in a user collection, and I think the query could sometimes affect a lot of users (possibly tens of thousands), or just run for longer than I expect. So I'm wondering if that will "block" reads and writes to the collection. The server isn't live, so I don't have real user data to inform this decision. I'd appreciate any feedback or suggestions, thanks!
Each server has certain resource capacity. If you are sending a query to the server, it has less capacity remaining to do other work (be that other queries or writes).
For locking and concurrency in MongoDB, see https://docs.mongodb.com/manual/faq/concurrency/.
If you are planning for high load/high throughput you need to benchmark your specific use case.

MongoDB Aggregation V/S simple query performance?

I am reasking this question as i thought this question should be on seperate thread from this one in-mongodb-know-index-of-array-element-matched-with-in-operator.
I am using mongoDB and actually i was writing all of my queries using simple queries which are find, update etc. (No Aggregations). Now i read on many SO posts see this one for example mongodb-aggregation-match-vs-find-speed. Now i thought about why increasing computation time on server because as if i will compute more then my server load will become more, so i tried to use aggregations and i thought i am going in right direction now. But later on my previous question andreas-limoli told me about not using aggregations as it is slow and for using simple queries and computing on server. Now literally i am in a delimma about what should i use, i am working with mongoDB from a year now but i don't have any knowledge about its performance when data size increases so i completely don't know which one should i pick.
Also one more thing i didn't find on anywhere, if aggregation is slower than is it because of $lookup or not, because $lookup is the foremost thing i thought about using aggregation because otherwise i have to execute many queries serially and then compute on server which appears to me very poor in front of aggregation.
Also i read about 100MB restriction on mongodb aggregation when passing data from one pipeline to other, so how people handle that case efficiently and also if they turn on Disk usage then because Disk usage slow down everything than how people handle that case.
Also i fetched 30,000 sample collection and tried to run aggregation with $match and find query and i found that aggregation was little bit faster than find query which was aggregation took 180ms to execute where as find took 220 ms to execute.
Please help me out guys please it would be really helpful for me.
Aggregation pipelines are costly queries. It might impact on your performance as an increasing data because of CPU memory. If you can achieve the with find query, go for it because Aggregation is costlier once DB data increases.
Aggregation framework in MongoDB is similar to join operations in SQL. Aggregation pipelines are generally resource intensive operations. So if in case your work is satisfied with simple queries, you should use that one at first place.
However, if it is absolute necessary then you can use aggregation pipelines in case you need to fetch the data from the multiple collections.

MongoDB: If MapReduce is set to non atomic, what happens to newly written data?

I'm planning on building an application as follows:
Node server receives logs from mobile devices and is inserted into Mongo as they come.
An incremental MapReduce job is ran to calculate new fields from the data.
The data is then pre-aggregated by minutes, hours, days, etc.
All the while, the data in mongo is queried by a front-end visualization app.
I have a couple concerns:
If I set the nonAtomic flag to true, what happens if new data is being written to the db as the MapReduce job runs?
Is it written to the db? If so, I'm assuming this data wouldn't be included in the current incremental MapReduce job.
Or, is the database locked and the write is lost?
As the MapReduce job and then the time aggregations run, can existing data already in the database be served to my front-end?
Thanks!
The following describes MongoDB 2.6. nonAtomic is an option for the out portion of map/reduce. It's not related to how map/reduce is ingesting documents from the source collection, only how it is outputting documents to the target collection.
Map/reduce uses a cursor over the input documents (created from query, sort, limit), so the rules for cursors apply to input documents to map/reduce.
When nonAtomic is false, during the out stage of the map/reduce, the output database is locked, so writes to that database will have to wait, and will possibly time out as failures on the client.
If nonAtomic is true, while the out stage of a map/reduce is running, data can be read from the database and served to the front end, but since the reads can interleave with the output from the map/reduce, the data served may be in an intermediate state.

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.

Mongodb concurrent map reduce

I have collection with events, indexed by time field. Can i run more then one incremental (with output merge to another collection) map-reduce job on it in parallel? (for example - on each five minutes)?
The JavaScript engine is single threaded per shard or mongod so even though you could schedule another MR to be run on the same computer I do not believe it will run until the current running MR is completed.
That being said I do think V8 allows for the con-currency features required for your question: http://jira.mongodb.org/browse/SERVER-2407 that is something I think you will want to watch.
Since the MongoDB 2.4 version, the V8 JavaScript engine became the default and it allows multiple JavaScript operations to execute at the same time.
So yes, you can execute several map reduce jobs concurrently in parallel.