Detecting concurrent data modification of document between read and write - mongodb

I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks

In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory

Related

DDD, Event Sourcing, and the shape of the Aggregate state

I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".

CQRS, Event-Sourcing and Web-Applications

As I am reading some CQRS resources, there is a recurrent point I do not catch. For instance, let's say a client emits a command. This command is integrated by the domain, so it can refresh its domain model (DM). On the other hand, the command is persisted in an Event-Store. That is the most common scenario.
1) When we say the DM is refreshed, I suppose data is persisted in the underlying database (if any). Am I right ? Otherwise, we would deal with a memory-transient model, which I suppose, would not be a good thing ? (state is not supposed to remain in memory on server side outside a client request).
2) If data is persisted, I suppose the read-model that relies on it is automatically updated, as each client that requests it generates a new "state/context" in the application (in case of a Web-Application or a RESTful architecture) ?
3) If the command is persisted, does that mean we deal with Event-Sourcing (by construct when we use CQRS) ? Does Event-Sourcing invalidate the database update process ? (as if state is reconstructed from the Event-Store, maintaining the database seems useless) ?
Does CQRS only apply to multi-databases systems (when data is propagated on separate databases), and, if it deals with memory-transient models, does that fit well with Web-Applications or RESTful services ?
1) As already said, the only things that are really stored are the events.
The only things that commands do are consistency checks prior to the raise of events. In pseudo-code:
public void BorrowBook(BorrowableBook dto){
if (dto is valid)
RaiseEvent(new BookBorrowedEvent(dto))
else
throw exception
}
public void Apply(BookBorrowedEvent evt) {
this.aProperty = evt.aProperty;
...
}
Current state is retrieved by sequential Apply. Since this, you have to point a great attention in the design phase cause there are common pitfalls to avoid (maybe you already read it, but let me suggest this article by Martin Fowler).
So far so good, but this is just Event Sourcing. CQRS come into play if you decide to use a different database to persist the state of an aggregate.
In my project we have a projection that every x minutes apply the new events (from event store) on the aggregate and save the results on a separate instance of MongoDB (presentation layer will access to this DB for reading). This model is clearly eventually consistent, but in this way you really separate Command (write) from Query (read).
2) If you have decided to divide the write model from the read model there are various options that you can use to make them synchronized:
Every x seconds apply events from the last checkpoint (some solutions offer snapshot to avoid reapplying of heavy commands)
A projection that subscribe events and update the read model as soon event is raised
3) The only thing stored are the events. Infact we have an event-store, not a command store :)
Is database is useless? Depends! How many events do you need to reapply for take the aggregate to the current state?
Three? Maybe you don't need to have a database for read-model
The thing to grok is that the ONLY thing stored is the events*. The domain model is rebuilt from the events.
So yes, the domain model is memory transient as you say in that no representation of the domain model is stored* only the events which happend to the domain to put the model in the current state.
When an element from the domain model is loaded what happens is a new instance of the element is created and then the events that affect that instance are replayed one after the other in the right order to put the element into the correct state.
you could keep instances of your domain objects around and subscribing to new events so that they can be kept up to date without loading them from all the events every time, but usually its quick enough just to load all the events from the database and apply them every time in the same way that you might load the instance from the database on every call to your web service.
*Unless you have snapshots of you domain object to reduce the number of events you need to load/process
Persistence of data is not strictly needed. It might be sufficient to have enough copies in enough different locations (GigaSpaces). So no, a database is not required. This is (at least was a few years ago) used in production by the Dutch eBay equivalent.

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Applying updates to a KDB table in thread safe manner in C

I need to update a KDB table with new/updated/deleted rows while it is being read by other threads. Since writing to K structures while other threads access will not be thread safe, the only way I can think of is to clone the whole table and apply new changes to that. Even to do that, I need to first clone the table, then find a way to insert/update/delete rows from it.
I'd like to know if there are functions in C to:
1. Clone the whole table
2. Delete existing rows
3. Insert new rows easily
4. Update existing rows
Appreciate suggestions on new approaches to the same problem as well.
Based on the comments...
You need to do a set of operations on the KDB database "atomically"
You don't have "control" of the database, so you can't set functions (though you don't actually need to be an admin to do this, but that's a different story)
You have a separate C process that is connecting to the database to do the operations you require. (Given you said you don't have "access" to the database as admin, you can't get KDB to load your C binary to use within-process anyway).
Firstly I'm going to assume you know how to connect to KDB+ and issue via the C API (found here).
All you need to do then is to concatenate your "atomic" operation into a set of statements that you are going to issue in one call from C. For example say you want to update a table and then delete some entry. This is what your call might look like:
{update name:`me from table where name=`you; delete from `table where name=`other;}[]
(Caution: this is just a dummy example, I've assumed your table is in-memory so that the delete operation here would work just fine, and not saved to disk, etc. If you need specific help with the actual statements you require in your use case then that's a different question for this forum).
Notice that this is an anonymous function that will get called immediately on issue ([]). There is the assumption that your operations within the function will succeed. Again, if you need actual q query help it's a different question for this forum.
Even if your KDB database is multithreaded (started with -s or negative port number), it will not let you update global variables inside a peach thread. Therefore your operation should work just fine. But just in case there's something else that could interfere with your new anonymous function, you can wrap the function with protected evaluation.

How do I model a queue on top of a key-value store efficiently?

Supposed I have a key-value database, and I need to build a queue on top of it. How could I achieve this without getting a bad performance?
One idea might be to store the queue inside an array, and simply store the array using a fixed key. This is a quite simple implementation, but is very slow, as for every read or write access the complete array must be loaded / saved.
I could also implement a linked list, with random keys, and there is one fixed key which acts as starting point to element 1. Depending on if I prefer a fast read or a fast write access, I could let point the fixed element to the first or the last entry in the queue (so I have to travel it forward / backward).
Or, to proceed with that - I could also have two fixed pointers: One for the first, on for the last item.
Any other suggestions on how to do this effectively?
Initially, key-value structure is extremely similar to the original memory storage where the physical address in computer memory plays as the key. So any type of data structure could be modeled upon key-value storage surely, including linked list.
Originally, a linked list is a list of nodes including the index information of previous node or following node. Then the node it self should also be viewed as a sub key-value structure. With additional prefix to the key, the information in the node could be separately stored in a flat table of key-value pairs.
To proceed with that, special suffix to the key could also make it possible to get rid of redundant pointer information. This pretend list might look something like this:
pilot-last-index: 5
pilot-0: Rei Ayanami
pilot-1: Shinji Ikari
pilot-2: Soryu Asuka Langley
pilot-3: Touji Suzuhara
pilot-5: Makinami Mari
The corresponding algrithm is also imaginable, I think. If you could have a daemon thread for manipulation these keys, pilot-5 could be renamed as pilot-4 in the above example. Even though, it is not allowed to have additional thread in some special situation, the result of the queue it self is not affected. Just some overhead would exist for the break point in sequence.
However which of the two above should be applied is the problem of balance between the cost of storage space or the overhead of CPU time.
The thread safe is exactly a problem however an ancient problem. Just like the class implementing the interface of ConcurrentMap in JDK, Atomic operation on key-value data is also provided perfectly. There are similar methods featured in some key-value middleware, like memcached, as well, which could make you update key or value separately and thread safely. However these implementation is the algrithm problem rather than the key-value structure it self.
I think it depends on the kind of queue you want to implement, and no solution will be perfect because a key-value store is not the right data structure for this kind of task. There will be always some kind of hack involved.
For a simple first in first out queue you could use a few kev-value stores like the folliwing:
{
oldestIndex:5,
newestIndex:10
}
In this example there would be 6 items in the Queue (5,6,7,8,9,10). Item 0 to 4 are already done whereas there is no Item 11 or so for now. The producer worker would increment newestIndex and save his item under the key 11. The consumer takes the item under the key 5 and increments oldestIndex.
Note that this approach can lead to problems if you have multiple consumer/producers and if the queue is never empty so you cant reset the index.
But the multithreading problem is also true for linked lists etc.