What happens when million threads try to read from and write to MongoDB at the same time? does locking happens on a db-level, table-level or row-level?
It happens at db-level, however with Mongo 2.0 there are a few methods for concurrency, such as inserting/updating by the _id field.
You might run into concurrency problems, especially if you're working with a single MongoDB instance rather than a sharded cluster. The threads would likely start blocking eachother as they wait for writes and other operations to complete and locks to be released.
Locking in MongoDB happens at the global level of the instance, but some operations since v2.0 will yield their locks (update by _id, remove, long cursor iteration). Collection-level locking will probably be added sometime soon.
If you need to have a large number of threads accessing MongoDB, consider placing a queue in front to absorb the impact of the concurrency contention, then execute the queued operations sequentially from a single thread.
Related
In the mongodb documentation, it says:
Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations. Some global operations, typically short lived operations involving multiple databases, still require a global “instance” wide lock. Before 2.2, there is only one “global” lock per mongod instance.
Does this mean that in the situation that I Have, say, 3 connections to mongodb://localhost/test from different apps running on the network - only one could be writing at a time? Or is it just per connection?
IOW: Is it per connection, or is the whole /test database locked while it writes?
MongoDB Locking is Different
Locking in MongoDB does not work like locking in an RDBMS, so a bit of explanation is in order. In earlier versions of MongoDB, there was a single global reader/writer latch. Starting with MongoDB 2.2, there is a reader/writer latch for each database.
The readers-writer latch
The latch is multiple-reader, single-writer, and is writer-greedy. This means that:
There can be an unlimited number of simultaneous readers on a database
There can only be one writer at a time on any collection in any one database (more on this in a bit)
Writers block out readers
By "writer-greedy", I mean that once a write request comes in, all readers are blocked until the write completes (more on this later)
Note that I call this a "latch" rather than a "lock". This is because it's lightweight, and in a properly designed schema the write lock is held on the order of a dozen or so microseconds. See here for more on readers-writer locking.
In MongoDB you can run as many simultaneous queries as you like: as long as the relevant data is in RAM they will all be satisfied without locking conflicts.
Atomic Document Updates
Recall that in MongoDB the level of transaction is a single document. All updates to a single document are Atomic. MongoDB achieves this by holding the write latch for only as long as it takes to update a single document in RAM. If there is any slow-running operation (in particular, if a document or an index entry needs to be paged in from disk), then that operation will yield the write latch. When the operation yields the latch, then the next queued operation can proceed.
This does mean that the writes to all documents within a single database get serialized. This can be a problem if you have a poor schema design, and your writes take a long time, but in a properly-designed schema, locking isn't a problem.
Writer-Greedy
A few more words on being writer-greedy:
Only one writer can hold the latch at one time; multiple readers can hold the latch at a time. In a naive implementation, writers could starve indefinitely if there was a single reader in operation. To avoid this, in the MongoDB implementation, once any single thread makes a write request for a particular latch
All subsequent readers needing that latch will block
That writer will wait until all current readers are finished
The writer will acquire the write latch, do its work, and then release the write latch
All the queued readers will now proceed
The actual behavior is complex, since this writer-greedy behavior interacts with yielding in ways that can be non-obvious. Recall that, starting with release 2.2, there is a separate latch for each database, so writes to any collection in database 'A' will acquire a separate latch than writes to any collection in database 'B'.
Specific questions
Regarding the specific questions:
Locks (actually latches) are held by the MongoDB kernel for only as long as it takes to update a single document
If you have multiple connections coming in to MongoDB, and each one of them is performing a series of writes, the latch will be held on a per-database basis for only as long as it takes for that write to complete
Multiple connections coming in performing writes (update/insert/delete) will all be interleaved
While this sounds like it would be a big performance concern, in practice it doesn't slow things down. With a properly designed schema and a typical workload, MongoDB will saturate the disk I/O capacity -- even for an SSD -- before lock percentage on any database goes above 50%.
The highest capacity MongoDB cluster that I am aware of is currently performing 2 million writes per second.
It is not per connection, it is per mongod. In other words the lock will exist across all connections to the test database on that server.
It is also a read/write lock so if a write is occuring then a read must wait, otherwise how can MongoDB know it is a consistent read?
However I should mention that MongoDB locks are very different to SQL/normal transactional locks you get and normally a lock will be held for about a microsecond between average updates.
Mongo 3.0 now supports collection-level locking.
In addition to this, now Mongo created an API that allows to create a storage engine. Mongo 3.0 comes with 2 storage engines:
MMAPv1: the default storage engine and the one use in the previous versions. Comes with collection-level locking.
WiredTiger: the new storage engine, comes with document-level locking and compression. (Only available for the 64-bit version)
MongoDB 3.0 release notes
WiredTiger
I know the question is pretty old but still some people are confused....
Starting in MongoDB 3.0, the WiredTiger storage engine (which uses document-level concurrency) is available in the 64-bit builds.
WiredTiger uses document-level concurrency control for write operations. As a result, multiple clients can modify different documents of a collection at the same time.
For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation.
Some global operations, typically short lived operations involving multiple databases, still require a global “instance-wide” lock. Some other operations, such as dropping a collection, still require an exclusive database lock.
Document Level Concurrency
In replica mode each write operation to any collection in any DB, also writes to the oplog collection.
Now, when writing to multiple DBs in parallel, all these write operations also write to the oplog.
My question: do these write operations require locking the oplog ? (I'm using w:1 write concern). If they do, this is kind of similar to having a global lock between all the write operations to all the different DBs, isn't it ?
I'd be happy to get any hints on this.
According to the documentation, in replication, when MongoDB writes to a collection on the primary, MongoDB also writes to the primary’s oplog, which is a special collection in the local database. Therefore, MongoDB must lock both the collection’s database and the local database. The mongod must lock both databases at the same time to keep the database consistent and ensure that write operations, even with replication, are “all-or-nothing” operations.
This means that concurrent writing to multiple database in parallel on the primary can result in global locks between all the write operations. This is not applicable to the secondary, as MongoDB does not apply writes serially to secondaries, but instead collects oplog entries in batches and then apply those batches in parallel.
Disclaimer This is all of the top off my head, so please do not crucify me if I have a mistake. However, please correct me.
Why should they?
Premise: Databases, by definition, are not interconnected
oplog entries are always idempotent
The Oplog is a capped collection, with a guarantee of preserving the insert order
Let's assume true parallelism of queries being applied. So, we have two queries arriving at the very same time and we'd need to decide which one to insert to the oplog first. The first one taking the lock will write first, right? Except, there is a problem. Let's assume the first query is a simple one db.collection.update({_id:"foo"},{$set:{"bar":"baz"}}) while the other query is more complicated and therefor takes longer to evaluate for correctness. So in order to prevent that, a lock had to be taken on arrival and released after the idempotent oplog entry was written.
Here is where I have to rely on my memory
However, queries aren't applied in parallel. Queries are queued and evaluated in order of arrival. The database get's locked upon the application of the queries after they ran through the query optimizer. During that lock the idempotent oplog queries are written to the oplog. Since databases are not interconnected and only one query can be applied to a database at any given time, the lock on the database is sufficient. No two data changing queries can be applied to the same database concurrently anyway, so why should a lock be set on the oplog?
Apparently, a lock is take on the local database. However, since a lock is already taken on the data, I do not see the reason why. *scratchingMyHead*
In the mongodb documentation, it says:
Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations. Some global operations, typically short lived operations involving multiple databases, still require a global “instance” wide lock. Before 2.2, there is only one “global” lock per mongod instance.
Does this mean that in the situation that I Have, say, 3 connections to mongodb://localhost/test from different apps running on the network - only one could be writing at a time? Or is it just per connection?
IOW: Is it per connection, or is the whole /test database locked while it writes?
MongoDB Locking is Different
Locking in MongoDB does not work like locking in an RDBMS, so a bit of explanation is in order. In earlier versions of MongoDB, there was a single global reader/writer latch. Starting with MongoDB 2.2, there is a reader/writer latch for each database.
The readers-writer latch
The latch is multiple-reader, single-writer, and is writer-greedy. This means that:
There can be an unlimited number of simultaneous readers on a database
There can only be one writer at a time on any collection in any one database (more on this in a bit)
Writers block out readers
By "writer-greedy", I mean that once a write request comes in, all readers are blocked until the write completes (more on this later)
Note that I call this a "latch" rather than a "lock". This is because it's lightweight, and in a properly designed schema the write lock is held on the order of a dozen or so microseconds. See here for more on readers-writer locking.
In MongoDB you can run as many simultaneous queries as you like: as long as the relevant data is in RAM they will all be satisfied without locking conflicts.
Atomic Document Updates
Recall that in MongoDB the level of transaction is a single document. All updates to a single document are Atomic. MongoDB achieves this by holding the write latch for only as long as it takes to update a single document in RAM. If there is any slow-running operation (in particular, if a document or an index entry needs to be paged in from disk), then that operation will yield the write latch. When the operation yields the latch, then the next queued operation can proceed.
This does mean that the writes to all documents within a single database get serialized. This can be a problem if you have a poor schema design, and your writes take a long time, but in a properly-designed schema, locking isn't a problem.
Writer-Greedy
A few more words on being writer-greedy:
Only one writer can hold the latch at one time; multiple readers can hold the latch at a time. In a naive implementation, writers could starve indefinitely if there was a single reader in operation. To avoid this, in the MongoDB implementation, once any single thread makes a write request for a particular latch
All subsequent readers needing that latch will block
That writer will wait until all current readers are finished
The writer will acquire the write latch, do its work, and then release the write latch
All the queued readers will now proceed
The actual behavior is complex, since this writer-greedy behavior interacts with yielding in ways that can be non-obvious. Recall that, starting with release 2.2, there is a separate latch for each database, so writes to any collection in database 'A' will acquire a separate latch than writes to any collection in database 'B'.
Specific questions
Regarding the specific questions:
Locks (actually latches) are held by the MongoDB kernel for only as long as it takes to update a single document
If you have multiple connections coming in to MongoDB, and each one of them is performing a series of writes, the latch will be held on a per-database basis for only as long as it takes for that write to complete
Multiple connections coming in performing writes (update/insert/delete) will all be interleaved
While this sounds like it would be a big performance concern, in practice it doesn't slow things down. With a properly designed schema and a typical workload, MongoDB will saturate the disk I/O capacity -- even for an SSD -- before lock percentage on any database goes above 50%.
The highest capacity MongoDB cluster that I am aware of is currently performing 2 million writes per second.
It is not per connection, it is per mongod. In other words the lock will exist across all connections to the test database on that server.
It is also a read/write lock so if a write is occuring then a read must wait, otherwise how can MongoDB know it is a consistent read?
However I should mention that MongoDB locks are very different to SQL/normal transactional locks you get and normally a lock will be held for about a microsecond between average updates.
Mongo 3.0 now supports collection-level locking.
In addition to this, now Mongo created an API that allows to create a storage engine. Mongo 3.0 comes with 2 storage engines:
MMAPv1: the default storage engine and the one use in the previous versions. Comes with collection-level locking.
WiredTiger: the new storage engine, comes with document-level locking and compression. (Only available for the 64-bit version)
MongoDB 3.0 release notes
WiredTiger
I know the question is pretty old but still some people are confused....
Starting in MongoDB 3.0, the WiredTiger storage engine (which uses document-level concurrency) is available in the 64-bit builds.
WiredTiger uses document-level concurrency control for write operations. As a result, multiple clients can modify different documents of a collection at the same time.
For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation.
Some global operations, typically short lived operations involving multiple databases, still require a global “instance-wide” lock. Some other operations, such as dropping a collection, still require an exclusive database lock.
Document Level Concurrency
We're using MongoDB 2.2.0 at work. The DB contains about 51GB of data (at the moment) and I'd like to do some analytics on the user data that we've collected so far. Problem is, it's the live machine and we can't afford another slave at the moment. I know MongoDB has a read lock which may affect any writes that happen especially with complex queries. Is there a way to tell MongoDB to treat my (particular) query with the lowest priority?
In MongoDB reads and writes do affect each other. Read locks are shared, but read locks block write locks from being acquired and of course no other reads or writes are happening while a write lock is held. MongoDB operations yield periodically to keep other threads waiting for locks from starving. You can read more about the details of that here.
What does that mean for your use case? Because there is no way to tell MongoDB to access the data without a read lock, nor is there a way to prioritize the requests (at least not yet) whether the reads significantly affect the performance of your writes depends on how much "headroom" you have available while write activity is going on.
One suggestion I can make is when figuring out how to run analytics, rather than scanning the entire data set (i.e. doing an aggregation query over all historical data) try running smaller aggregation queries on short time slices. This will accomplish two things:
reads jobs will be shorter lived and therefore will finish quicker, this will give you a chance to assess what impact the queries have on your "live" performance.
you won't be pulling all old data into RAM at once - by spacing out these analytical queries over time you will minimize the impact it will have on current write performance.
Depending on what it is you can't afford about getting another server - you might consider getting a short lived AWS instance which may be not very powerful but would be available to run a long analytical query against a copy of your data set. Just be careful when making it a copy of your data - doing a full sync off of the production system will place a heavy load on it (more effective way would be to use a recent backup/file snapshot to resume from).
Such operations are best left for slaves of a replica set. For one thing, read locks can be shared to allow many reads at once, but write locks will block reads. And, while you can't prioritize queries, mongodb yields long running read/write queries. Their concurrency docs should help
If you can't afford another server, you can setup a slave on the same machine, provided you have some spare RAM/Disk headroom, and you use the slave lightly/occasionally. You must be careful though, your disk I/O will increase significantly.
Just want to know if my database has about 50 queries per sec in production, will it still be able to run normally while do these operation?
My server info is:
Normal replica 2 machine with no sharding.
Just use capped collection for logging purpose only. No read. It's write heavy.
8GB of RAM
The documentation [1] about concurrency is quite clear:
"When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock."
Insert, update, and delete operations use a write lock.
Basically that means all you inserts will happen sequentially, so it's a matter of how fast Mongo writes the data.
[1] http://docs.mongodb.org/manual/faq/concurrency/#which-administrative-commands-lock-the-database