Mongodb concurrent map reduce

Mongodb concurrent map reduce - mongodb

I have collection with events, indexed by time field. Can i run more then one incremental (with output merge to another collection) map-reduce job on it in parallel? (for example - on each five minutes)?

The JavaScript engine is single threaded per shard or mongod so even though you could schedule another MR to be run on the same computer I do not believe it will run until the current running MR is completed.
That being said I do think V8 allows for the con-currency features required for your question: http://jira.mongodb.org/browse/SERVER-2407 that is something I think you will want to watch.

Since the MongoDB 2.4 version, the V8 JavaScript engine became the default and it allows multiple JavaScript operations to execute at the same time.
So yes, you can execute several map reduce jobs concurrently in parallel.

Related

Continuously run MongoDB aggregation pipeline

I have an ETL pipeline that is sinking timeseries records to MongoDB.
I need to compute timely aggregations for daily, weekly and the like. I assumed the aggregations engine of MongoDB would be the way to go, so after I had the aggregation queries for each resolution I wrapped them with MongoDB views like "daily_view", "weekly_view", etc.
There is REST service to fetch from MongoDB. Depending on what period resolution is requested, it pulls from the different aforementioned views, sampling for start and end dates.
The response times are quite "poor" with these views/aggregations. It can be around 10-15 seconds. I take this lapse might not be outrageous for batch computing a report, but in my case the service needs to issue these requests in a live mode to serve the frontend, so 10 seconds wait is too much.
From the MongoDB reference I know that Views are computed on demand during read operations but I'm a bit disappointed with such response times because the same aggregations took split seconds in Elasticsearch or InfluxDB, which unfortunately are not an option for me at the moment.
I have also exhausted the research about optimizing the queries and there is no room from more improvement there than the way it already is.
My intuition tells me that if the aggregations have to be done via the aggregations engine, I need the pipelines executing continuously on the fly (so the views have records already in for the service), as opposed to be run everytime ad-hoc.
I've tried to drop the views, and instead have and aggregation with a last stage being an $out to a real collection ...but I have still the same problem, it needs to be run "on demand". I composed the pipelines using the Compass UI, and in the $out stage it presents a button to run the aggregation.
Would there be a way to schedule such pipelines/aggregation queries??
Something I can think about is, copy-pasting the code of the aggregations and make it into Javascript functions of the REST service ...but still, something would have to invoke those functions on a regular interval. I know there are libraries I can bring into the service for scheduling, but this option makes me feel a bit discomforted in terms of architecture.
In the worst case scenario, my backup plan is to implement the timely aggregations as part as the logic of the initial ETL and sink all the different resolutions to different collections, so the service will find records to fetch already waiting in the aggregated collections. But the intention was to leverage time aggregations to the datastore engine.
I'm having a bit of last minute architecture distress now

$out aggregation stage. Documentation.
Takes the documents returned by the aggregation pipeline and writes them to a specified collection. The $out operator must be the last stage in the pipeline.
$mongo accepts javascript file as an argument. So this is the easiest way to package your aggregation. Reference.
mongo file.js --username username --password
Then - to execute it on schedule - common tools like cron jobs to the rescue.
You might need to account for the differences between Mongo Shell and Javascrips such as using db = db.getSiblingDB('<db>') instead of use <db>. Write Scripts for the mongo Shell

How long will a mongo internal cache sustain?

I would like to know how long a mongo internal cache would sustain. I have a scenario in which i have some one million records and i have to perform a search on them using the mongo-java driver.
The initial search takes a lot of time (nearly one minute) where as the consecutive searches of same query reduces the computation time (to few seconds) due to mongo's internal caching mechanism.
But I do not know how long this cache would sustain, like is it until the system reboots or until the collection undergoes any write operation or things like that.
Any help in understanding this is appreciated!
PS:
Regarding the fields with which search is performed, some are indexed
and some are not.
Mongo version used 2.6.1

It will depend on a lot of factors, but the most prominent are the amount of memory in the server and how active the server is as MongoDB leaves much of the caching to the OS (by MMAP'ing files).
You need to take a long hard look at your log files for the initial query and try to figure out why it takes nearly a minute.

In most cases there is some internal cache invalidation mechanism that will drop your cached query internal record when write operation occurs. It is the simplest describing of process. Just from my own expirience.
But, as mentioned earlier, there are many factors besides simple invalidation that can have place.

MongoDB automatically uses all free memory on the machine as its cache.It would be better to use MongoDB 3.0+ versions because it comes with two Storage Engines MMAP and WiredTiger.
The major difference between these two is that whenever you perform a write operation in MMAP then the whole database is going to lock and whereas the locking mechanism is upto document level in WiredTiger.
If you are using MongoDB 2.6 version then you can also check the query performance and execution time taking to execute the query by explain() method and in version 3.0+ executionStats() in DB Shell Commands.
You need to index on a particular field which you will query to get results faster. A single collection cannot have more than 64 indexes. The more index you use in a collection there is performance impact in write/update operations.

MongoDb MapReduce

I have a question about mongoDb's mapReduce funcion. Say there is currently a mapReduce running that will take a long time. What will happen when a user tries to acces the same collection the mapReduce is writing to?
Does the map reduce write all the data after it is finished or does it write it while running?

Long running read and write operations, such as queries, updates, and deletes, yield under many conditions. MongoDB operations can also yield locks between individual document modifications in write operations that affect multiple documents like update() with the multi parameter.
In Map reduce mongoDB is doing Read and write lock, unless operations
are specified as non-atomic. Portions of map-reduce jobs can run
concurrently.

See the concurrency page for details on mongodb locking. For your case, the map-reduce command takes a read and write lock for the relevant collections while it is running. Portions of the map-reduce command can be concurrent, but in the general case it is locked while running.

MongoDB - How does locking work for Map Reduce?

Does MongoDB map reduce lock a collection when performing an operation on it?
I have some collections that are widely and intensively used by an application. A Map/Reduce runs in the background every 10 minutes via a cron job, on that widely and intensively used collection.
I want to know if there is a high probability that Map/Reduce won't perform well because other operations are in progress (inserts, updates, and mostly reads) on that collection. In particular, I want know if Map/Reduce interferes with normal operations performed on the collection by users.

MapReduce, if outputting to a collection will take multiple write locks out as it writes (as any operation which is creating/updating a collection would). If you are doing an in-line MR, then you avoid that locking (but have limitations on result sizes). Even so, there are still read-locks and the Javascript lock (single threaded for server side JS on mongoDB right now).
This is all explained (and will be updated if it changes) here:
http://www.mongodb.org/display/DOCS/How+does+concurrency+work#Howdoesconcurrencywork-MapReduce
Note: the SpiderMonkey to V8 JS engine migration issues are ones to watch if multi-threading is something you are concerned about.

Running MongoDB Queries in Map/Reduce

Is it possible to run MongoDB commands like a query to grab additional data or to do an update from with in MongoDB's MapReduce command. Either in the Map or the Reduce function?
Is this completely ludicrous to do anyways? Currently I have some documents that refer to separate collections using the MongoDB DBReference command.
Thanks for the help!

Is it possible to run MongoDB commands... from within MongoDB's MapReduce command.
In theory, this is possible. In practice there are lots of problems with this.
Problem #1: exponential work. M/R is already pretty intense and poorly logged. Adding queries can easily make M/R run out of control.
Problem #2: context. Imagine that you're running a sharded M/R and you are querying into an unsharded collection. Does the current context even have that connection?
You're basically trying to implement JOIN logic and MongoDB has no joins. Instead, you may need to build the final data in a couple of phases by running a few loops on a few sets of data.