Preventing race conditions with transaction locking in MongoDB - mongodb

I have a REST API (distributed with across multiple hosts/containers) that uses MongoDB as a database. Among the collections in my database, I want to focus on the Users and Games collections in this example.
Let's say I have an endpoint (called by the user/client) called /join_game. This endpoint will step-through the following logic:
Check if game is open (query the Games model)
If the game is open, allow the user to join (continue with below logic)
Add player to the participants field in the Games model and update that document
Update some fields in the Users document (stats, etc.)
And let there be another endpoint (called by a cron job on the server) called /close_game which steps-through the following logic:
Close the game (update the Games Model)
Determine the winner & update their stats (update in the Users model)
Now I know that the following race condition is possible between two concurrent requests handled by each of the endpoints:
request A to /join_game called by a client - /join_game controller checks if game is open (it is so it proceeds with the rest of the endpoint logic)
request B to /close_game called internally by the server - /close_game controller sets the game as closed within the game's document
If these requests are concurrent and request A is called before request B, then the remaining /join_game logic potentially might be executed despite that the game is technically locked. Now this is obvious behavior I don't want and can introduce many errors/unexpected outcomes.
To prevent this, I looked into using the transactions API since it makes all database operations within the transaction atomic. However, I'm not sure if transactions actually solve my case since I'm not sure if they place a complete lock on the documents being queried and modified (I read that mongodb uses shared locks for reads and exclusive locks for writes). And if they do put a complete lock, would other database calls to those documents within the transaction just wait for those transactions to complete? I also read that transactions abort if they wait after a certain period of time (which can also lead to unwanted behavior).
If transactions are not the way to go about preventing race conditions across multiple different endpoints, I'd like to know of any good alternative methods.
Originally, I was using an in-memory queue for handling these race conditions which seemed to have work on a server running the REST API on a single node. But as I scale up, managing this queue amongst distributed servers will become more of an issue so I'd like to handle these race-conditions directly within mongo if possible.
Thanks!

From my understanding, it looks like you don't need to use Transactions within MongoDB but you can use MongoDB atomic update operations.
Each operation to a document gets executed one at a time, thus meaning if you join a game while a close game gets called and executes first then you just won't be able to join.
This is why schema design is important, you'll need to figure out how you can model your document to allow atomic updates to solve the concurrency issues. For updating other collections which are view which might be how many times a player has won/lost/games joined. I'd make that all eventually consistent on back of events.
You can also use the FindOneAndModify operations which can return the new state of the document once the update has been completed. Which becomes really useful for dealing with concurrency.
db.test.insert({ _id: 1, name: "Game 1", players : [ ], isClosed: false})
db.test.insert({ _id: 2, name: "Game 2", players : [ ], isClosed: false})
db.test.insert({ _id: 3, name: "Game 2", players : [ ], isClosed: false})
db.test.update({ _id: 1, isClosed: false}, {$push: { players: 20} })
db.test.update({ _id: 1, isClosed: false}, {isClosed: true, players : [] })

Related

CQRS and Event Sourcing coupled with Relational Database Design

Let me start by saying, I do not have real world experience with CQRS and that is the basis for this question.
Background:
I am building a system where a new key requirement is allowing admins to "playback" user actions (admins want to be able to step through every action that has happened in a system to any particular point). The caveats to this are, the company already has reports that are generated off of their current SQL db that they will not change (at least not in parallel with this new requirement) so the storage of record will be SQL. I do not have access to SQL's Change Data Capture, so creating a bunch of history tables with triggers would be incredibly difficult to maintain so I'd like to avoid that if at all possible. Lastly, there are potentially (not currently) a lot of data entry points that go through a versioning lifecycle that will result in changes to the SQL db (adding/removing fields) so if I tried to implement change tracking in SQL, I'd have to maintain the tables that handled the older versions of the data (nightmare).
Potential Solution
I am thinking about using NoSQL (Azure DocumentDB) to handle data storage (writes) and then have command handlers handle updating the current SQL (Azure SQL) with the relevant data to be queried (reads). That way the audit trail is created and that idea of "playing back" can be handled while also not disturbing the current back end functionality that is provided.
This approach would handle the requirement and satisfy the caveats. I wouldnt use CQRS for the entire app, just for the pieces that I needed this "playback" functionality. I know that I would have to mitigate failure points along the Client -> Write to DocumentDB -> Respond to user with success/fail -> Write to SQL on Success write to DocumentDB path, but my novice CQRS eyes can't see a reason why this isn't a great way to handle this.
Any advice would be greatly appreciated.
This article explained CQRS pattern and provided an example of a CQRS implementation please refer to it.
I am thinking about using NoSQL (Azure DocumentDB) to handle data storage (writes) and then have command handlers handle updating the current SQL (Azure SQL) with the relevant data to be queried (reads).
here is my suggestion, when a user do write operations to update a record, we could always do insert operation before admin audit user’s operations. For example, if user want to update a record, we could insert updating entity with a property that indicates if current operation is audited by admins instead of directly update the record.
Original data in document
{
"version1_data": {
"data": {
"id": "1",
"name": "jack",
"age": 28
},
"isaudit": true
}
}
For updating age field, we could insert entity with updated information instead of updating original data directly.
{
"version1_data": {
"data": {
"id": "1",
"name": "jack",
"age": 28
},
"isaudit": true
},
"version2_data": {
"data": {
"id": "1",
"name": "jack",
"age": 29
},
"isaudit": false
}
}
and then admin could check the current the document to audit user’s operations and determine if updates could write to SQL database.
One potential way to think about this is creating a transaction object that has a unique id and represents the work that needs to be done. The transaction in this case would be write an object to document db or write an object to sql db. It could contain the in memory object to be written and the destination db (doc db, sql, etc.) connection parameters.
Once you define your transaction you would need to adjust your work flow for a proper CQRS. Instead of client writing to doc db directly and waiting on the result of this call, let the client create a transaction with a unique id - which could be something like Date Time tick counts or an incremental transaction id for instance, and then write this transaction to a message queue like azure queue or service bus. Once you write the transaction to the queue return success to user at that point. Create worker roles that would read the transaction messages from this queue and process them, write objects to doc db. That is not overwriting the same entity in doc db, but just writing the transaction with the unique incremental id to doc db for that particular entity. You could also use azure table storage for that afaik.
After successfully updating the doc db transaction, the same worker role could write this transaction to a different message queue which would be processed by its own set of worker roles which would update the entity in sql db. If anything goes wrong in the interim, keep an error table and update failures in that error table to query and retry later on.

MongoDB and Write/Read Consistency

I've passed a MongoU Course and Midway through a second. I've read what I can and done what I can to learn, but failed to learn how to handle what I consider a standard situation.
Consider a booking system for Hotel Rooms. There is a collection bookings:
4/1 - Room 1
4/1 - Room 2
4/1 - Room 3
Then when a client checks the bookings collections { date: 4/1 Room: 3}, they will find a booking and the application can deny the booking.
However, say two users look for { date: 4/1 Room: 4} at the same time, the application will proceed with booking for both clients, meaning they will try to create the booking.
What happens next? One of the clients get a booking and the other fails. Somewhat of a race condition? Or does one client get a booking and other person overwrites it.
Can this be prevented with a write concern? Or some other lock? Or is this a better case for more atomic database?
All the demos I see have to do with a blog, which have very little concerns for unique data.
In general, you need to be careful with your data models and – of course – your application flow.
One way to prevent duplicate bookings would be to use a compound _id field:
{
_id: { room: 1, date:"4/1" }
}
So once room 1 is booked for 4/1, there is no way a duplicate booking for room 1 can be created, as _id's are guaranteed to be unique. The second insert will fail. Alternatively, you can create a unique index on an arbitrary compound field.
Be aware though that your application should not upserts or updates on this document without proper permission checks, which does not apply to MongoDB only. In the most simple case, for updates, you would need to check wether the user trying to update the document actually is the user who booked the room. So our model needs to be expanded:
{
_id:{room:"1",date:"4/1"},
reservationFor:"userID"
}
So now it becomes pretty easy: before inserting a reservation, you check for one. If the result is not empty, there already is a reservation for that room. If an exception is thrown because of a duplicate ID on insert, there was a reservation made in the meantime. Before doing an update, you need to check wether reservationFor holds the userID of the current user.
How to react to this kind of situation heavily depends on the language and framework(s) used to develop your application. What I tend to do is to catch the according exceptions and modify an errorMessage accordingly.

Many to many relationship on Mongodb based e-learning webapp?

I am relatively new to No-SQL databases. I am designing a data structure for an e-learning web app. There would be X quantity of courses and Y quantity of users.
Every user will be able to take any number of courses.
Every course will be compound of many sections (each section may be a video or a quiz).
I will need to keep track of every section a user takes, so I think the whole course should be part of the user set (for each user), like so:
{
_id: "ed",
name: "Eduardo Ibarra",
courses: [
{
name: "Node JS",
progress: "100%",
section: [
{name: "Introdiction", passed:"100%", field3:"x", field4:""},
{name: "Quiz 1", passed:"75%", questions:[...], field3:"x", field4:""},
]
},
{
name: "MongoDB",
progress: "65%",
...
}
]
}
Is this the best way to do it?
I would say that design your database depending upon your queries. One thing is for sure.. You will have to do some embedding.
If you are going to perform more queries on what a user is doing, then make user as the primary entity and embed the courses within it. You don't need to embed the entire course info. The info about a course is static. For ex: the data about Node JS course - i.e. the content, author of the course, exercise files etc - will not change. So you can keep the courses' info separately in another collection. But how much of the course a user has completed is dependent on the individual user. So you should only keep the id of the course (which is stored in the separate 'course' collection) and for each user you can store the information that is related to that (User, Course) pair embedded in the user collection itself.
Now the most important question - what to do if you have to perform queries which require 'join' of user and course collections? For this you can use javascript to first get the courses (and maybe store them in an array or list etc) and then fetch the user for each of those courses from the courses collection or vice-versa. There are a few drivers available online to help you accomplish this. One is UnityJDBC which is available here.
From my experience, I understand that knowing what you are going to query from MongoDB is very helpful in designing your database because the NoSQL nature of MongoDB implies that you have no correct way for designing. Every way is incorrect if it does not allow you in accomplishing your task. So clearly, knowing beforehand what you will do (i.e. what you will query) with the database is the only guide.

MongoDB findAndModify from multiple clients

my MongoDB collection is used as a job queue, and there are 3 C++ machines that read from this collection. The problem is that those three can't perform the same job. All jobs need to be made only once.
I fetch all un-done jobs by searching the collection for all records with 'isDone:False' and then update this document, 'isDone:True'. But if 2 machines find the same document at the same time, they would to the same job both. How can I avoid this?
Edit: My question is - do findAndModify really solves that problem?
(After reading A way to ensure exclusive reads in MongoDb's findAndModify?)
Yes, findAndModify solve it.
Ref: MongoDB findAndModify from multiple clients
"...
Note: This command obtains a write lock on the affected database and will block other operations until it has completed; however, typically the write lock is short lived and equivalent to other similar update() operations.
..."
Ref: http://docs.mongodb.org/manual/reference/method/db.collection.update/#db.collection.update
"...
For unsharded collections, you can override this behavior with the $isolated isolation operator, which isolates the update operation and blocks other write operations during the update. See the isolation operator.
..."
Ref: http://docs.mongodb.org/manual/reference/operator/isolated/
Regards,
Moacy
Yes, find-and-modify will solve your problem:
db.collection.findAndModify( {
query: { isDone: false },
update: { $set: { isDone: true } },
new: true,
upsert: false # never create new docs
} );
This will return a single document that it just updated from false to true.
But you have a serious problem if your C++ clients ever have a hiccup (the box dies, they are killed, the code has an error, etc.) Imagine if your TCP connection drops just after the update on the server, but before the C++ code gets the job. It's generally better to have multi-phase approach:
change "isDone" to "isInProgress", then when it's done, delete the document. (Now, you can see the stack of "todo" and "being done". If something is "being done" for a long time, the client probably died.
change "isDone" to "phase" and atomically set it from "new" to "started" (and later set it to "finished"). Now you can see if something is "started" for a long time, the client may have died.
If you're really sophisticated, you can make a partial index. For example, "Only index documents with "phase:{ $ne: 'finished'}". Now you don't need to waste space indexing the millions of finished documents. The index only holds the handful of new/in-progress documents, so it's smaller/faster.

MongoDB database schema design

I have a website with 500k users (running on sql server 2008). I want to now include activity streams of users and their friends. After testing a few things on SQL Server it becomes apparent that RDMS is not a good choice for this kind of feature. it's slow (even when I heavily de-normalized my data). So after looking at other NoSQL solutions, I've figured that I can use MongoDB for this. I'll be following data structure based on activitystrea.ms
json specifications for activity stream
So my question is: what would be the best schema design for activity stream in MongoDB (with this many users you can pretty much predict that it will be very heavy on writes, hence my choice of MongoDB - it has great "writes" performance. I've thought about 3 types of structures, please tell me if this makes sense or I should use other schema patterns.
1 - Store each activity with all friends/followers in this pattern:
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
consumers:[
person3, person4, person5, person6, ... so on
]
}
2 - Second design: Collection name- activity_stream_fanout
{
_id:'activ_fanout_123',
personId:person3,
activities:[
{
_id:'activ123',
actor:{
id:person1
},
verb:'follow',
object:{
objecttype:'person',
id:'person2'
},
updatedon:Date(),
}
],[
//activity feed 2
]
}
3 - This approach would be to store the activity items in one collection, and the consumers in another. In activities, you might have a document like:
{ _id: "123",
actor: { person: "UserABC" },
verb: "follow",
object: { person: "someone_else" },
updatedOn: Date(...)
}
And then, for followers, I would have the following "notifications" documents:
{ activityId: "123", consumer: "someguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "otherguy", updatedOn: Date(...) }
{ activityId: "123", consumer: "thirdguy", updatedOn: Date(...) }
Your answers are greatly appreciated.
I'd go with the following structure:
Use one collection for all actions that happend, Actions
Use another collection for who follows whom, Subscribers
Use a third collection, Newsfeed for a certain user's news feed, items are fanned-out from the Actions collection.
The Newsfeed collection will be populated by a worker process that asynchronously processes new Actions. Therefore, news feeds won't populate in real-time. I disagree with Geert-Jan in that real-time is important; I believe most users don't care for even a minute of delay in most (not all) applications (for real time, I'd choose a completely different architecture).
If you have a very large number of consumers, the fan-out can take a while, true. On the other hand, putting the consumers right into the object won't work with very large follower counts either, and it will create overly large objects that take up a lot of index space.
Most importantly, however, the fan-out design is much more flexible and allows relevancy scoring, filtering, etc. I have just recently written a blog post about news feed schema design with MongoDB where I explain some of that flexibility in greater detail.
Speaking of flexibility, I'd be careful about that activitystrea.ms spec. It seems to make sense as a specification for interop between different providers, but I wouldn't store all that verbose information in my database as long as you don't intend to aggregate activities from various applications.
I believe you should look at your access patterns: what queries are you likely to perform most on this data, etc.
To me The use-case that needs to be fastest is to be able to push a certain activity to the 'wall' (in fb terms) of each of the 'activity consumers' and do it immediately when the activity comes in.
From this standpoint (I haven't given it much thought) I'd go with 1, since 2. seems to batch activities for a certain user before processing them? Thereby if fails the 'immediate' need of updates. Moreover, I don't see the advantage of 3. over 1 for this use-case.
Some enhancements on 1? Ask yourself if you really need the flexibility of defining an array of consumers for every activity. Is there really a need to specify this on this fine-grained scale? instead wouldn't a reference to the 'friends' of the 'actor' suffice? (This would a lot of space in the long run, since I see the consumers-array being the bulk of the entire message for each activity when consumers typically range in the hundreds (?).
on a somewhat related note: depending on how you might want to implement realtime notifications for these activity streams, it might be worth looking at Pusher - http://pusher.com/ and similar solutions.
hth