Reuse mongo internal distributed locks - mongodb

I need an distributed lock implementation for my application. I have a number of independent worker processes and I need to enforce a restriction that they can only work on a single account at one time.
The application is writen in c# with a mongo db layer. I noticed that mongo's cluster balancer uses a distributed lock mechanism to control which mongos is doing the balancing and I was wondering if I could reuse the same mechanism in my app?
I'd rather not have the overhead of implementing my own distributed lock mechanism and since all the worker processes alreading interface with mongo so it would be great if I could reuse their implementation.

There is no inherent document-level locking or distributed lock driver API in MongoDB.
MongoDB's internal locks for sharding splits & migrations use a two phase commit pattern against the shard cluster's config servers. You could implement a similar pattern yourself, and there is an example in the MongoDB documentation: Perform Two Phase Commits.
This is likely overkill if you just need a semaphore to prevent workers simultaneously updating the same account document. A more straightforward approach would be to add an advisory lock field (or embedded doc) to your account document to indicate the worker process that is currently using the document. The lock could be set when the worker starts and removed when it finishes. You probably want the lock information to include both a worker process ID and timestamp, so stale locks can be found & removed.
Note that any approach requires coordination amongst your worker processes to check and respect your locking implementation.

Related

MongoDB load balancer for the Replica set

In replica set cluster of MongoDB how can i ensure quick response for a concurent users when my primary is busy in serving another request?
Do i need to use load balancer, or the mongodb itself route the query to available Secondary?
Thanks
You don't need to use a load balancer, or to route queries to secondary nodes; the primary node can handle concurrent queries by itself:
MongoDB supports concurrent queries, both reads and writes, using a granular locking system
It is not advised to use secondaries to provide extra read capacity, as replication design makes this inefficient and unreliable for most use cases
If your primary is taking a long time serving a single request, in such a way that it locks out other requests, that should be addressed by redesigning an inefficient query or adding suitable indexes.
If your server is struggling to serve multiple users despite the queries being optimised, look at whether your hardware is insufficient for the job
If you still find that you need to scale out your reads and writes, the recommended way to do that is by sharding, not by using other nodes of a replica set.
Normally writes are handled by master and reads should be send to secondaries by setting read preference. Although it might take some negligible time to get data propagated to secondaries, as secondaries use oplog copy for data replication.
You do not need any load balancer, Mongo is capable of doing these things.
Read more about it here -
https://docs.mongodb.com/manual/replication/

To what level does MongoDB lock on writes? (or: what does it mean by "per connection"

In the mongodb documentation, it says:
Beginning with version 2.2, MongoDB implements locks on a per-database basis for most read and write operations. Some global operations, typically short lived operations involving multiple databases, still require a global “instance” wide lock. Before 2.2, there is only one “global” lock per mongod instance.
Does this mean that in the situation that I Have, say, 3 connections to mongodb://localhost/test from different apps running on the network - only one could be writing at a time? Or is it just per connection?
IOW: Is it per connection, or is the whole /test database locked while it writes?
MongoDB Locking is Different
Locking in MongoDB does not work like locking in an RDBMS, so a bit of explanation is in order. In earlier versions of MongoDB, there was a single global reader/writer latch. Starting with MongoDB 2.2, there is a reader/writer latch for each database.
The readers-writer latch
The latch is multiple-reader, single-writer, and is writer-greedy. This means that:
There can be an unlimited number of simultaneous readers on a database
There can only be one writer at a time on any collection in any one database (more on this in a bit)
Writers block out readers
By "writer-greedy", I mean that once a write request comes in, all readers are blocked until the write completes (more on this later)
Note that I call this a "latch" rather than a "lock". This is because it's lightweight, and in a properly designed schema the write lock is held on the order of a dozen or so microseconds. See here for more on readers-writer locking.
In MongoDB you can run as many simultaneous queries as you like: as long as the relevant data is in RAM they will all be satisfied without locking conflicts.
Atomic Document Updates
Recall that in MongoDB the level of transaction is a single document. All updates to a single document are Atomic. MongoDB achieves this by holding the write latch for only as long as it takes to update a single document in RAM. If there is any slow-running operation (in particular, if a document or an index entry needs to be paged in from disk), then that operation will yield the write latch. When the operation yields the latch, then the next queued operation can proceed.
This does mean that the writes to all documents within a single database get serialized. This can be a problem if you have a poor schema design, and your writes take a long time, but in a properly-designed schema, locking isn't a problem.
Writer-Greedy
A few more words on being writer-greedy:
Only one writer can hold the latch at one time; multiple readers can hold the latch at a time. In a naive implementation, writers could starve indefinitely if there was a single reader in operation. To avoid this, in the MongoDB implementation, once any single thread makes a write request for a particular latch
All subsequent readers needing that latch will block
That writer will wait until all current readers are finished
The writer will acquire the write latch, do its work, and then release the write latch
All the queued readers will now proceed
The actual behavior is complex, since this writer-greedy behavior interacts with yielding in ways that can be non-obvious. Recall that, starting with release 2.2, there is a separate latch for each database, so writes to any collection in database 'A' will acquire a separate latch than writes to any collection in database 'B'.
Specific questions
Regarding the specific questions:
Locks (actually latches) are held by the MongoDB kernel for only as long as it takes to update a single document
If you have multiple connections coming in to MongoDB, and each one of them is performing a series of writes, the latch will be held on a per-database basis for only as long as it takes for that write to complete
Multiple connections coming in performing writes (update/insert/delete) will all be interleaved
While this sounds like it would be a big performance concern, in practice it doesn't slow things down. With a properly designed schema and a typical workload, MongoDB will saturate the disk I/O capacity -- even for an SSD -- before lock percentage on any database goes above 50%.
The highest capacity MongoDB cluster that I am aware of is currently performing 2 million writes per second.
It is not per connection, it is per mongod. In other words the lock will exist across all connections to the test database on that server.
It is also a read/write lock so if a write is occuring then a read must wait, otherwise how can MongoDB know it is a consistent read?
However I should mention that MongoDB locks are very different to SQL/normal transactional locks you get and normally a lock will be held for about a microsecond between average updates.
Mongo 3.0 now supports collection-level locking.
In addition to this, now Mongo created an API that allows to create a storage engine. Mongo 3.0 comes with 2 storage engines:
MMAPv1: the default storage engine and the one use in the previous versions. Comes with collection-level locking.
WiredTiger: the new storage engine, comes with document-level locking and compression. (Only available for the 64-bit version)
MongoDB 3.0 release notes
WiredTiger
I know the question is pretty old but still some people are confused....
Starting in MongoDB 3.0, the WiredTiger storage engine (which uses document-level concurrency) is available in the 64-bit builds.
WiredTiger uses document-level concurrency control for write operations. As a result, multiple clients can modify different documents of a collection at the same time.
For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation.
Some global operations, typically short lived operations involving multiple databases, still require a global “instance-wide” lock. Some other operations, such as dropping a collection, still require an exclusive database lock.
Document Level Concurrency

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it
Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.
Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).
Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time
MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.

Does MongoDB require at least 2 server instances to prevent the loss of data?

I have decided to start developing a little web application in my spare time so I can learn about MongoDB. I was planning to get an Amazon AWS micro instance and start the development and the alpha stage there. However, I stumbled across a question here on Stack Overflow that concerned me:
But for durability, you need to use at least 2 mongodb server
instances as master/slave. Otherwise you can lose the last minute of
your data.
Is that true? Can't I just have my box with everything installed on it (Apache, PHP, MongoDB) and rely on the data being correctly stored? At least, there must be a config option in MongoDB to make it behave reliably even if installed on a single box - isn't there?
The information you have on master/slave setups is outdated. Running single-server MongoDB with journaling is a durable data store, so for use cases where you don't need replica sets or if you're in development stage, then journaling will work well.
However if you're in production, we recommend using replica sets. For the bare minimum set up, you would ideally run three (or more) instances of mongod, a 'primary' which receives reads and writes, a 'secondary' to which the writes from the primary are replicated, and an arbiter, a single instance of mongod that allows a vote to take place should the primary become unavailable. This 'automatic failover' means that, should your primary be unable to receive writes from your application at a given time, the secondary will become the primary and take over receiving data from your app.
You can read more about journaling here and replication here, and you should definitely familiarize yourself with the documentation in general in order to get a better sense of what MongoDB is all about.
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributed applications.
Replication in MongoDB
A replica set is a group of mongod instances that host the same data set. One mongod, the primary, receives all write operations. All other instances, secondaries, apply operations from the primary so that they have the same data set.
The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency. To support replication, the primary logs all changes to its data sets in its oplog. See primary for more information.

MongoDb - Utilizing multi CPU server for a write heavy application

I am currently evaluating MongoDb for our write heavy application...
Currently MongoDb uses single thread for write operation and also uses global lock whenever it is doing the write... Is it possible to exploit multiple CPU on a multi-CPU server to get better write performance? What are your workaround for global write lock?
No, it is still recommended to use sharding to utilize multiple CPU cores.
As stated in the FAQ
Sharding improves concurrency by distributing collections over multiple mongod instances, allowing shard servers (i.e. mongos processes) to perform any number of operations concurrently to the various downstream mongod instances.
Each mongod instance is independent of the others in the shard cluster and uses the MongoDB readers-writer lock). The operations on one mongod instance do not block the operations on any others.
Sharding on a single box has its issues, as one user stated in the mongodb-user mailing list
After some significant experimentation, I've found a single MongoDB shard daemon CANNOT use more than one CPU. On a 24 CPU box, performance scales up until we hit about 8 shards, then another limit kicks in.
So right now, the easy solution is to shard.
Yes, normally sharding is done across servers. However, it is completely possible to shard on a single box. You simply fire up the shards on different ports and provide them with different folders. Here's a sample configuration of 2 shards on one box.
The MongoDB team recognizes that this is kind of sub-par, and I know from talking to them that they're looking at better ways to do this.
Obviously once you get multiple shards on one box and increase your write threads, you will have to be wary of disk IO. In my experience, I've been able to saturate disks with a single write thread. If your inserts/updates are relatively simple, you may find that extra write threads don't do anything. (Map-Reduces are the exception here, sharding definitely helps there)