MongoDB: enforcing sequential execution of inserts and updates - mongodb

I'm doing two consecutive writes into the MongoDB (no shards, no replicas):
insert data into db
find and modify data inserted in 1.
when performing step 2), is it granted, that the command sees the data insertion from step 1)? What is the minimal WriteConcern I should use in step 1) to ensure this?
As for my use-case, I know, I could merge 1 and 2 into one simple step; however, my real use-case is much more complicated and cannot be solved such easily.

Your use case will work given you are using a write concern of Acknowledged. This is the default write concern in MongoDB 2.2 or later given you are using a recent driver (see here for the minimum driver version required).
http://docs.mongodb.org/manual/release-notes/drivers-write-concern/

Related

Is the update order preserved in MongoDB?

If I trigger two updates which are just 1 nanosecond apart. Is it possible that the updates could be done out-of-order? Say if the first update was more complex than the second.
I understand that MongoDB is eventually consistent, but I'm just not clear on whether or not the order of writes is preserved.
NB I am using a legacy system with an old version of MongoDB that doesn't have the newer transaction stuff
In MongoDB, write operations are atomic on document level as every document in a collection is independent & individual on it's own. So when an operation is executing write on a document then the second operation has to wait until first one finishes writing to the document.
From their docs :
a write operation is atomic on the level of a single document, even if
the operation modifies multiple embedded documents within a single
document.
Ref : atomicity-in-mongodb
So when this can be an issue ? - On reads, This is when if your application is so ready heavy. As reads can happen during updates - if your read happened before update finishes then your app will see old data or in other case reading from secondary can also result in inconsistent data.
In general MongoDB is usually hosted as replica-set (A MongoDB is generally a set of atleast 3 servers/shards/nodes) in which writes has to definitely be targeted to primary shard & by default reads as well are targeted to primary shard but if you overwrite read preference to read from Secondary shards to make primary free(maybe in general for app reporting), then you might see few to zero issues.
But why ? In general in background data gets sync'd from Primary to Secondary, if at any case this got delayed or not done by the time your application reads then you'll see the issue but chances can be low. Anyway all of this is until MongoDB version 4.0 - From 4.0 secondary read preference enabled apps will read from WiredTiger snapshot of data.
Ref : replica-set-data-synchronization

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
docs.forEach(
function(doc){
//whatever
}
)
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
ET…
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
…L
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

How long will a mongo internal cache sustain?

I would like to know how long a mongo internal cache would sustain. I have a scenario in which i have some one million records and i have to perform a search on them using the mongo-java driver.
The initial search takes a lot of time (nearly one minute) where as the consecutive searches of same query reduces the computation time (to few seconds) due to mongo's internal caching mechanism.
But I do not know how long this cache would sustain, like is it until the system reboots or until the collection undergoes any write operation or things like that.
Any help in understanding this is appreciated!
PS:
Regarding the fields with which search is performed, some are indexed
and some are not.
Mongo version used 2.6.1
It will depend on a lot of factors, but the most prominent are the amount of memory in the server and how active the server is as MongoDB leaves much of the caching to the OS (by MMAP'ing files).
You need to take a long hard look at your log files for the initial query and try to figure out why it takes nearly a minute.
In most cases there is some internal cache invalidation mechanism that will drop your cached query internal record when write operation occurs. It is the simplest describing of process. Just from my own expirience.
But, as mentioned earlier, there are many factors besides simple invalidation that can have place.
MongoDB automatically uses all free memory on the machine as its cache.It would be better to use MongoDB 3.0+ versions because it comes with two Storage Engines MMAP and WiredTiger.
The major difference between these two is that whenever you perform a write operation in MMAP then the whole database is going to lock and whereas the locking mechanism is upto document level in WiredTiger.
If you are using MongoDB 2.6 version then you can also check the query performance and execution time taking to execute the query by explain() method and in version 3.0+ executionStats() in DB Shell Commands.
You need to index on a particular field which you will query to get results faster. A single collection cannot have more than 64 indexes. The more index you use in a collection there is performance impact in write/update operations.

Solution to Bulk FindAndModify in MongoDB

My use case is as follows -
I have a collection of documents in mongoDB which I have to send for analysis.
The format of the documents are as follows -
{ _id:ObjectId("517e769164702dacea7c40d8") ,
date:"1359911127494",
status:"available",
other_fields... }
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing.
ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
Also, in future for scalability, we might go with more than one reader process.
In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
Reader selects X documents with appropriate limit and sorting
Reader marks the documents returned by 1) with it's own unique reader ID (e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true))
Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
Offer the resultset from 3) for your processing.
Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
EDIT: More details :
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.

Restrict querying MongoDB collection to inactive chunks only

I am building an application which will perform 2 phases.
Execute Phase - First phase is very
INSERT intensive (as many inserts
as the hardware can possibly can
execute in a second). This is
essentially a logging trail of work
performed.
Validation Phase - Next
phase will query the logs generated
by phase 1 and compare to an
external source and perform an
UPDATE on the record to store some
statistics. This process is second priority to phase 1.
I'm trying to see if its feasible to do them in parallel and keep write locking to a minimum for the execution phase. I thought one way to do this would be to restrict my Validation phase to only query from older records which are not in the chunk currently being inserted to by the execution phase. Is there something in MongoDB that restricts a find() to only query from chunks that have not been accessed in some configurable amount of time?
You probably want to set up replica set. Insert into the master and fetch from secondaries. In that way, your insert won't be blocked at all.
You can use the mentioned replica set with slaveOk, and update in the master.
You can use a timestamp field or an ObjectId (which already contains a timestamp) for filtering.