mongoDB optimistic concurrency control for update

mongoDB optimistic concurrency control for update - mongodb

I am simulating multiple concurrent request for MongoDB`s "update".
Here is the thing, I insert a data amount=1000 in mongoDB, and every time I trigger the api, it will update the amount by amount += 50 and save it back to database. Basically it is a find and update operation on a single document.
err := globalDB.C("bank").Find(bson.M{"account": account}).One(&entry)
if err != nil {
panic(err)
}
wait := Random(1, 100)
time.Sleep(time.Duration(wait) * time.Millisecond)
//step 3: add current balance and update back to database
entry.Amount = entry.Amount + 50.000
err = globalDB.C("bank").UpdateId(entry.ID, &entry)
Here is the source code for the project.
I am simulating requests using Vegeta:
If I set -rate=10(which means trigger api 10 times in a second, so 1000 + 50 * 10 = 1500), the data is correct
echo "GET http://localhost:8000" | \
vegeta attack -rate=10 -connections=1 -duration=1s | \
tee results.bin | \
vegeta report
But with -rate=100(which means trigger api 100 times in a second, so 1000 + 50 * 100 = 6000) produces very confusing result.
echo "GET http://localhost:8000" | \
vegeta attack -rate=100 -connections=1 -duration=1s | \
tee results.bin | \
vegeta report
In short, the thing I want to know is: I thought MongoDB is using optimistic concurrency control, which means if there's a write conflict, it should retry again so the latency will go up, but the data should be guaranteed to be correct.
Why the result looks like the data correctness is totally not guaranteed in MongoDB?
I know some of you guys might notice the sleep at line 41 and 42, but even though I commented it out, when I test with -rate=500 the result is still not correct.
Any clues why this is happening?

Generally you should extract the relevant segment of the code into the question. It is inconsiderate to ask people to locate the 5 relevant lines in your 76 line program.
Your test is performing concurrent find-and-modify operations. Let's suppose there are two concurrent processes A and B that each increment account balance by 50. Starting balance is 0. The order of operations could be:
A: what is the current balance for account 1234?
B: what is the current balance for account 1234?
DB -> A: balance for account 1234 is 0
DB -> B: balance for account 1234 is 0
A: new balance is 0+50 = 50
A: set balance for account 1234 to 50
DB -> A: ok, new balance for account 1234 is 50
B: new balance is 0+50 = 50
B: set balance for account 1234 to 50
DB -> B: ok, new balance for account 1234 is 50
From the database's perspective, there are no "write conflicts" here. You asked to set the balance to 50 for the given account twice.
There are different ways of solving this issue. One is to use conditional updates such that the process looks like this:
A: what is the current balance for account 1234?
B: what is the current balance for account 1234?
DB -> A: balance for account 1234 is 0
DB -> B: balance for account 1234 is 0
A: new balance is 0+50 = 50
A: if balance in account 1234 is 0, set balance to 50
DB -> A: ok, new balance for account 1234 is 50
B: new balance is 0+50 = 50
B: if balance in account 1234 is 0, set balance to 50
DB -> B: balance is not 0, no update was performed
B: err, let's start over
B: what is the current balance for account 1234?
DB -> B: balance for account 1234 is 50
B: new balance is 50+50 = 100
B: if balance in account 1234 is 50, set balance to 100
DB -> B: ok, new balance for account 1234 is 100
As you see, the database must support the conditional update and the application must handle the possibility of concurrent updates and retry the operation.
If the balance can go up and down, this is not a practically useful way of writing a debit & credit system (but if balance can only increase or only decrease, this would in fact work quite fine). In real systems you'd use a special field whose purpose is to identify the specific version of the document that was in existence at the moment the application retrieved some data; the update is conditioned on the current version of the document staying the same, and each update increments the version. Concurrent updates would then be detected because the version number is wrong rather than a content field.
There are ways to produce a "write conflict" on the database side, for example by using transactions as supported by MongoDB 4.0+. In principle this works the same way but the "version" is called a "transaction identifier" and it's stored in a different place (not inline in the document being operated on). But the principle is the same. In this case the database would inform you that there was a write conflict, you'd still need to reissue the operations.
Update:
I think you also need to distinguish between "optimistic currency control" as a concept, its implementation, and what the implementation applies to. https://docs.mongodb.com/manual/faq/concurrency/#how-granular-are-locks-in-mongodb for example says:
For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database and collection levels. When the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation.
Reading this statement carefully, it applies to write operations on storage engine level. I imagine when MongoDB performs something like $set, or other atomic write operations, this would apply. But this doesn't apply to application-level operation sequences like you've given in your example.
If you try your example code with your favorite relational DBMS, I think you'll find it produces roughly the same result as you've seen with MongoDB, if you issue a transaction around each individual read and write (such that balance read and write are in different transactions), for the same reason - RDBMSes lock data (or use techniques like MVCC) for the lifetime of a transaction, but not across transactions.
Similarly if you put both balance read and balance write on the same account into a transaction in MongoDB, you may find that you are receiving transient errors when other transactions modify the account in question concurrently.
Lastly, the API that MongoDB implements for transactions (with retries) is described here. If you look at it carefully you'll find that it expects the application to reissue not just the transaction commit command, but to repeat the entire transaction operation. This is because generally, if there is a "write conflict" the starting data has changed, and simply attempting the final write again isn't enough - potentially calculations in the applications need to be redone, possibly even side effects of that process change as a result.

Related

How to detect a negative integer in firebase?

I'm using ServerValue.increment in Flutter to update an inventory amount in Firebase. It is a nice solution when my users are offline, but I need to fix the folowing case:
The user 1 reads the inventory of 40 (in example) and inmediately goes offline
The user 2 reads the same inventory (40) and spend 10, then the online inventory is updated to 30
The user 1 spends 35 (less than 40). When he/she goes online again the inventory is updated to -5 (30 - 35)
I would like to detect this negative number to execute a procedure. How can I detect it in Firebase?
I'm using ServerValue.increment in this way:
db.child('quantityInStock')
.set(ServerValue.increment(-quantityToReduce.round()));
How can I detect when quantityInStock ends up being a negative number in order to execute a new procedure automatically?

If the new value depends on the existing value in the way you describe, you have two options:
Use security rules to ensure the write operations is only allowed when there's enough inventory.
".write": "newData.val() >= 0"
Use a transaction to ensure that your client can actively check the current value, to determine the new value.
dataRef.runTransaction((MutableData transaction) async{
if (transaction.value >= 40) {
transaction.value = transaction.value - 40;
}
return transaction;
});
Both approaches have advantages and disadvantages.
For example: using security rules in your scenario with an offline user may prevent your application code from knowing the write was rejected, as completion listeners are not persisted across app restarts.
Using a transaction you won't have this problem, but in that case your app will only work when the user is connected to the database. Transactions don't work when the user is offline.

How to avoid concurrent writes to a row using Slick?

I have the following table:
case class Project(id: Int, name: String, locked: Boolean)
Users can request some processing to be done on the project - but I'd like to make sure only one processing job is being run on the project at a time.
My way right now is to set locked = true on the project whenever a job begins, and if a user (malicious or otherwise) tries to start a second job while locked = true, it should check if locked is already true, and if so, it should respond with an error message saying 'please wait' or such.
I think I need to do this using transactions, so race conditions / concurrent requests wouldn't work, and a malicious user wouldn't be able to send concurrent requests and have multiple jobs start because all saw locked = false (as they started simultaneously)
How can I do this with Slick? My current attempt looks like this:
def lock(id: Long): Future[Int] = {
val select = (for {p <- projects if p.id === id && p.locked === false} yield l.locked)
val q = select.update(true).transactionally //attempting to use transactions
db.run(q)
}
I believe db.run will return the number of rows which were updated, and if p.locked === false condition fails, then the number of rows updated will be 0, and I could use that to determine if project was successfully locked. And the .transactionally should perhaps make this run in transactions so concurrent requests won't be an issue.
Are my assumptions / reasoning here correct? Is there a better way to do this?

The meaning of .transactionally here depends on which database you are using.
Without specifying anything else, in this way you are using the default isolation level for a transaction offered by you db, that for example, if you use Postgres, the level will be READ COMMITTED, that means that given two concurrent transactions, one can see the data committed from the other before it ends.
I suggest to you to specify always the isolation level with .transactionally.withTransactionIsolation(transactionLevel) to avoid concurrency problems

REST, Pagination with filters dependent on external system and sql

I have a REST web-service which is expected to expose a paginated GET call.
For eg: I have a list of students( "Name" , "Age" , "Class" ) in my sql table. And I have to expose a paginated API to get all students given a class. So far so good. Just a typical REST api does the job and pagination can be achieved by the sql query.
Now suppose we have the same requirement just that we need to send back students who are from particular state. This information is hosted by a web-service, S2. S2 has an API which given a list of student names and a state "X" returns the students that belong to state X.
Here is where I'm finding it difficult to support pagination.
eg: I get a request with page_size 10, a class C and a state X which results in 10 students from class C from my db. Now I make a call to S2 with these 10 students and state X, in return, the result may include 0 students, all 10 students, or any number students between 0 and 10 from state 'X'.
How do I support pagination in this case?
Brute force would be to make db calls and S2 calls till the page size is met and then only reply. I don't like this approach .
Is there a common practice followed for this, a general rule of thumb, or is this architecture a bad service design?
(EDIT): Also please tell about managing the offset value.
if we go with the some approach and get the result set , how can I manage the offset for next page request ?
Thanks for reading :)

Your service should handle the pagination and not hand it off the SQL. Make these steps:
Get all students from S1 (SQL database) where class = C.
Using the result, get all students from S2 that are in the result and where state = X.
Sort the second result in a stable way.
Get the requested page you want from the sorted result.
All this is done in the code that calls both S1 and S2. Only it has the knowledge to build the pages.

Not doing the pagination with SQL can lead to performance problems in case of large databases.
Some solution in between can be applied. I assume that the pagination parameters (offset, page size) are configurable for both services, yours and the external one.
You can implement prefetch logic for both services, lets say the prefetch chunk size can be 100.
The frontend can be served with required page size 10.
If the prefetched chunks do not result in a frontend page size 10, the backend should prefetch another chunk till the fronend can be served with 10 students.
This approach require more logic in backend to calculate the next offsets for prefetching, but if you want performance and pagination solved you must invest some effort.

Is querying MongoDB faster than Redis?

I have some data stored in a database (MongoDB) and in distributed cache redis.
While querying to the repository, I am using lazy loading approach which first finds the data in the cache if it's available, if not find it in the database and update the cache as well so that next time when the requirement comes it should be found in the cache.
Sample Model Used:
Person ( id, name, age, address (Reference))
Address (id, place)
PersonCacheModel extends Person with addressId.
I am not storing parent object with child object together in the cache that is why I've created personCacheModel with addressId and store this object in the cache and while getting the data personCacheModel converts to person and make a call to address repo to addressCache to fill the address details of the person object.
As far as I understand:
personRepository.findPersonByName(NAME + randomNumber);
Access Data from Cache = network time + cache access time + deserialize time
Access Data from database = network time + database query time + object mapping time
When I ran above approach for 1000 rows, accessing data from the database is faster than the accessing data from the cache. I believe cache access time must be smaller than the accessing MongoDB.
Please let me know if there's an issue with the approach or is this is the expected scenario.

to have a valid benchmark we need to consider hardware side and data processing side:
hardware - do we have same configuration, RAM, CPUs count, OS... etc
process - how data is transformed (on single thread, multi thread, per object, per request)
Performing a load test on your data set will give you an good overview of which process is faster in particular use case scenario.
It is hard to judge - what it should be as long as there mentioned above points will be know for us.
The other thing is to have more than one test scenario and have it stressed in let's say 10 sec time, minute , 5 an hour... so you can have digits that will tell you the truth.

Mongo transactions and updates

If I've got an environment with multiple instances of the same client connecting to a MongoDB server and I want a simple locking mechanism to ensure single client access for a short time, can I safely use my own lock object?
Say I have one object with a lockState that can be "locked" or "unlocked" and the plan is everyone checks that it is "unlocked" before doing "stuff". To lock the system I say:
db.collection.update( { "lockState": "unlocked" }, { "lockState": "locked" })
(aka UPDATE lockObj SET lockState = 'locked' WHERE lockState = 'unlocked')
If two clients try to lock the system at the same time, is it possible that both clients can end up thinking they "have the lock"?
Both clients find the record by the query parameter of the update
Client 1 updates the record (which is an atomic operation)
update returns success
Client 2 updates the document (it's already found it before client 1 modified it)
update returns success
I realize this is probably a very contrived case that would be very hard to reproduce, but is it possible or does mongo somehow make client 2's update fail?
Alternative approach
Use insert instead of update. insert is atomic and will fail if the document already exists.
To lock the system: db.locks.insert({someId: 27, state: “locked”}).
If the insert succeeds - I've got the lock and since the update was atomic, no one else can have it.
If the insert fails - someone else must have the lock.

If two clients try to lock the system at the same time, is it possible that both clients can end up thinking they "have the lock"?
No, only one client at a time writes to the lock space (Global, Database, Collection or Document depending on your version and configuration) and the operations on that lock space are sequential and one or the other (read or write, not both) per document so that other connections will not mistakenly pick up a document in a inbetween state and think that it is not locked by another client.
All operations on a single document are atomic, whether update or insert.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse