I am using mongodb/mongo-hadoop
(https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage#python-example) but was confused on how I can do parallel read operations.
By parallel read operations means, do concurrent read operation on my MongoDB index. I have indexed my DB based on Timestamp, and want to query data bw T1-T2, T3-T4, etc. parallelly.
Related
I want to read a full MongoDB collection into Spark using the Mongo Spark connector (Scala API) as efficiently as possible in terms of disk I/O.
After reading the connector docs and code, I understand that the partitioners are all designed to compute the minimum and maximum boundaries of an indexed field. My understanding is (and my tests using explain show) that each cursor will scan the index for document keys within the computed boundaries and then fetch the corresponding documents.
My concern is that this index-scan approach will result in random disk reads, and ultimately more I/Ops then necessary. In my case, the problem is accentuated because the collection is larger than available RAM (I know that's not recommended). Wouldn't it be orders of magnitudes faster to use a natural order cursor to read the documents as they are stored on disk? How can I accomplish this?
I am new to MongoDB and I have this question because I read in the following link : "How can I improve MongoDB bulk performance?" that MongoDB internally breaks the list down into 1000 operations at a time. However, I could not find this information in the MongoDB documentation.
Does this mean that when I insert 50,000 documents in a MongoDB collection using the Bulk API, MongoDB will internally break the list into batches of 1,000 and performing the bulk insert operation 50 times? If yes, would I achieve the same performance if I break down the list of 50,000 documents into sublists of 1,000
documents and using the bulk insert operation in a for loop? Which is the better approach?
Please help me understand.
Thanks.
Yes, if you insert 50,000 documents in a MongoDB collection using the Bulk API, mongo will break it down to 1000 operations at max.
Ideally, you should make the batches of 1000 yourself and do the inserts but in this case it is not going to make any difference because data is already there in memory.
You should not be accepting this huge amount of data in a single request in a production ready system. Client should be capable of sending small chunks of data so that you can store it in a queue on the server and process it in the background(some other thread).
I'm planning on building an application as follows:
Node server receives logs from mobile devices and is inserted into Mongo as they come.
An incremental MapReduce job is ran to calculate new fields from the data.
The data is then pre-aggregated by minutes, hours, days, etc.
All the while, the data in mongo is queried by a front-end visualization app.
I have a couple concerns:
If I set the nonAtomic flag to true, what happens if new data is being written to the db as the MapReduce job runs?
Is it written to the db? If so, I'm assuming this data wouldn't be included in the current incremental MapReduce job.
Or, is the database locked and the write is lost?
As the MapReduce job and then the time aggregations run, can existing data already in the database be served to my front-end?
Thanks!
The following describes MongoDB 2.6. nonAtomic is an option for the out portion of map/reduce. It's not related to how map/reduce is ingesting documents from the source collection, only how it is outputting documents to the target collection.
Map/reduce uses a cursor over the input documents (created from query, sort, limit), so the rules for cursors apply to input documents to map/reduce.
When nonAtomic is false, during the out stage of the map/reduce, the output database is locked, so writes to that database will have to wait, and will possibly time out as failures on the client.
If nonAtomic is true, while the out stage of a map/reduce is running, data can be read from the database and served to the front end, but since the reads can interleave with the output from the map/reduce, the data served may be in an intermediate state.
I have a question about mongoDb's mapReduce funcion. Say there is currently a mapReduce running that will take a long time. What will happen when a user tries to acces the same collection the mapReduce is writing to?
Does the map reduce write all the data after it is finished or does it write it while running?
Long running read and write operations, such as queries, updates, and deletes, yield under many conditions. MongoDB operations can also yield locks between individual document modifications in write operations that affect multiple documents like update() with the multi parameter.
In Map reduce mongoDB is doing Read and write lock, unless operations
are specified as non-atomic. Portions of map-reduce jobs can run
concurrently.
See the concurrency page for details on mongodb locking. For your case, the map-reduce command takes a read and write lock for the relevant collections while it is running. Portions of the map-reduce command can be concurrent, but in the general case it is locked while running.
Changed in version 2.2: The use of yielding expanded greatly in MongoDB 2.2. Including the “yield for page fault.” MongoDB tracks the contents of memory and predicts whether data is available before performing a read. If MongoDB predicts that the data is not in memory a read operation yields its lock while MongoDB loads the data to memory. Once data is available in memory, the read will reacquire the lock to complete the operation.
Taken from "Does a read or write operation ever yield the lock?"
Imagine that a write operation acquires the lock. How can a write be performed (change a certain collection) while MongoDB reading the data from the same collection. What happens if part of the read collection was not changed by the write and part of it was changed?
Imagine that a write operation acquires the lock. How can a write be performed (change a certain collection) while MongoDB reading the data from the same collection.
It cannot, MongoDB has a writer greedy read/write database level lock ( http://docs.mongodb.org/manual/faq/concurrency/#what-type-of-locking-does-mongodb-use ).
While a single document write is occuring MongoDB is blocked from reading on a database wide level.
What happens if part of the read collection was not changed by the write and part of it was changed?
MongoDB is transactional to a single document which means that if a part of a large multi update fails then it fails, there is no rollback or atomicity or transaction for the life time of the multi update.