Consider that I have a collection of tasks in MongoDB 5.2 and a number of independent worker processes which needs to take bulk of the tasks and process them.
Is there a way to do this in MongoDB in a safe concurrent way? E.g. only one worker should be working on specific set of tasks. Other workers should be able to get unclaimed tasks and process them in parallel without stepping on each other toes.
PostgreSQL has a very convenient SELECT FOR UPDATE, SKIP LOCKED query that takes unclaimed records and locks them at the same time. Is it possible to implement a similar system using MongoDB?
Related
I'm running:
select *
from pg_stat_activity
And it shows 2 rows with same query content (under query field), and in active state,
but one row show client_backed (backend_type) and the other row show parallel_worker (backend_type)
why do I have 2 instances of same query ? (I have run just one query in my app)
what is the different between client_backed and parallel_worker ?
Since PostgreSQL v10 there is parallel processing for queries:
If the optimizer decides it is a good idea and there are enough resources, PostgreSQL will start parallel worker processes that execute the query together with your client backend. Eventually, the client backend will gather all the information from the parallel workers and finish query processing.
This speeds up query processing, but uses more resources on the database server.
The parameters that govern this are, among others max_parallel_workers, which limits the total limit for parallel worker processes, and max_parallel_workers_per_gather, which limits the numbers of parallel workers for a single query.
In replica mode each write operation to any collection in any DB, also writes to the oplog collection.
Now, when writing to multiple DBs in parallel, all these write operations also write to the oplog.
My question: do these write operations require locking the oplog ? (I'm using w:1 write concern). If they do, this is kind of similar to having a global lock between all the write operations to all the different DBs, isn't it ?
I'd be happy to get any hints on this.
According to the documentation, in replication, when MongoDB writes to a collection on the primary, MongoDB also writes to the primary’s oplog, which is a special collection in the local database. Therefore, MongoDB must lock both the collection’s database and the local database. The mongod must lock both databases at the same time to keep the database consistent and ensure that write operations, even with replication, are “all-or-nothing” operations.
This means that concurrent writing to multiple database in parallel on the primary can result in global locks between all the write operations. This is not applicable to the secondary, as MongoDB does not apply writes serially to secondaries, but instead collects oplog entries in batches and then apply those batches in parallel.
Disclaimer This is all of the top off my head, so please do not crucify me if I have a mistake. However, please correct me.
Why should they?
Premise: Databases, by definition, are not interconnected
oplog entries are always idempotent
The Oplog is a capped collection, with a guarantee of preserving the insert order
Let's assume true parallelism of queries being applied. So, we have two queries arriving at the very same time and we'd need to decide which one to insert to the oplog first. The first one taking the lock will write first, right? Except, there is a problem. Let's assume the first query is a simple one db.collection.update({_id:"foo"},{$set:{"bar":"baz"}}) while the other query is more complicated and therefor takes longer to evaluate for correctness. So in order to prevent that, a lock had to be taken on arrival and released after the idempotent oplog entry was written.
Here is where I have to rely on my memory
However, queries aren't applied in parallel. Queries are queued and evaluated in order of arrival. The database get's locked upon the application of the queries after they ran through the query optimizer. During that lock the idempotent oplog queries are written to the oplog. Since databases are not interconnected and only one query can be applied to a database at any given time, the lock on the database is sufficient. No two data changing queries can be applied to the same database concurrently anyway, so why should a lock be set on the oplog?
Apparently, a lock is take on the local database. However, since a lock is already taken on the data, I do not see the reason why. *scratchingMyHead*
I have a crawler which consits of 6+ processes. Half of processes are masters which crawl web and when they find jobs they put it inside jobs collection. In most times masters save 100 jobs at once (well I mean they get 100 jobs and save each one of them separatly as fast as possible.
The second half of processes are slaves which constantly check if new jobs of some type are available for him - if yes it marks them in_process (it is done using findOneAndUpdate), then processe the job and save the result in another collection.
Moreover from time to time master processes have to read a lot of data from jobs table to synchronize.
So to sum up there are a lot of read operations and write operations on db. When db was small it was working ok but now when I have ~700k job records (job document is small, it has 8 fields and has proper indexes / compound indexes) my db slacks. I can observe it when displaying "stats" page which basicaly executes ~16 count operations with some conditions (on indexed fields).
When masters/slaves processes are not runing stats page displays after 2 seconds. When masters/slaves are runing same page displays around 1 minute and sometime is does not display at all (timeout).
So how I can make my db to handle more requests per second? I have to replicate it?
I need an distributed lock implementation for my application. I have a number of independent worker processes and I need to enforce a restriction that they can only work on a single account at one time.
The application is writen in c# with a mongo db layer. I noticed that mongo's cluster balancer uses a distributed lock mechanism to control which mongos is doing the balancing and I was wondering if I could reuse the same mechanism in my app?
I'd rather not have the overhead of implementing my own distributed lock mechanism and since all the worker processes alreading interface with mongo so it would be great if I could reuse their implementation.
There is no inherent document-level locking or distributed lock driver API in MongoDB.
MongoDB's internal locks for sharding splits & migrations use a two phase commit pattern against the shard cluster's config servers. You could implement a similar pattern yourself, and there is an example in the MongoDB documentation: Perform Two Phase Commits.
This is likely overkill if you just need a semaphore to prevent workers simultaneously updating the same account document. A more straightforward approach would be to add an advisory lock field (or embedded doc) to your account document to indicate the worker process that is currently using the document. The lock could be set when the worker starts and removed when it finishes. You probably want the lock information to include both a worker process ID and timestamp, so stale locks can be found & removed.
Note that any approach requires coordination amongst your worker processes to check and respect your locking implementation.
What happens when million threads try to read from and write to MongoDB at the same time? does locking happens on a db-level, table-level or row-level?
It happens at db-level, however with Mongo 2.0 there are a few methods for concurrency, such as inserting/updating by the _id field.
You might run into concurrency problems, especially if you're working with a single MongoDB instance rather than a sharded cluster. The threads would likely start blocking eachother as they wait for writes and other operations to complete and locks to be released.
Locking in MongoDB happens at the global level of the instance, but some operations since v2.0 will yield their locks (update by _id, remove, long cursor iteration). Collection-level locking will probably be added sometime soon.
If you need to have a large number of threads accessing MongoDB, consider placing a queue in front to absorb the impact of the concurrency contention, then execute the queued operations sequentially from a single thread.