What is the difference between serializability and linearizability? - distributed-computing

I am very confused between these two consistency models. Please give some timeline examples along with explanation.
http://en.wikipedia.org/wiki/Consistency_model

It was hard to find information about this subject. However, At some point I found a statement that explained it clearly:
Linearizability gives isolation at the level of operations, while Serializability gives isolation at the level of transactions.
(summarized from the in-depth description found here)
As an example:
Here, A, B and C are three different transactions running at the same time. r(varname) means that the current transaction is accessing the value inside varname, and w(varname) means that the current transaction is writing a certain value in varname.
Now, to create a linearized history of these events, we have to make sure that no two operations are happening at the same time. An operation that has started while another operation already started should appear behind the first operation.
In this case:
Log1: A.r(x), B.r(X), B.r(Y), A.w(X), C.r(Y)
To create a Serialized history of these events, one has to separate all the operations of the transactions A, B and C so there are no interleaved operations from other transactions.
From our example this could result in:
Log2: A.r(x), A.w(x), B.r(X), B.r(Y), C.r(Y)

please have a look at this video: https://www.youtube.com/watch?v=noUNH3jDLC0&t
Martin Kleppmann is the writer of Designing Data-Intense Applications which is a great book and I'd highly recommend it to someone interested about either serializability or linearizability.

Related

Composite unique constraint on business fields with Axon

We leverage AxonIQ Framework in our system. We've faced a problem implementing composite uniq constraint based on aggregate business fields.
Consider following Aggregate:
#Aggregate
public class PersonnelCardAggregate {
#AggregateIdentifier
private UUID personnelCardId;
private String personnelNumber;
private Boolean archived;
}
We want to avoid personnelNumber duplicates in the scope of NOT-archived (archived == false) records. At the same time personnelNumber duplicates may exist in the scope of archived records.
Query Side check seems NOT to be an option. Taking into account Eventual Consistency nature of our system, more than one creation request with the same personnelNumber may exist at the same time and the Query Side may be behind.
What the solution would be?
What you're asking is an issue that can occur as soon as you start implementing an application along the CQRS paradigm and DDD modeling techniques.
The PersonnelCardAggregate in your scenario maintains the consistency boundary of a single "Personnel Card". You are however looking to expand this scope to achieve a uniqueness constraints among all Personnel Cards in your system.
I feel that this blog explains the problem of "Set Based Consistency Validation" you are encountering quite nicely.
I will not iterate his entire blog, but he sums it up as having four options to resolving the problem:
Introduce locking, transactions and database constraints for your Personnel Card
Use a hybrid locking field prior to issuing your command
Really on the eventually consistent Query Model
Re-examine the domain model
To be fair, option 1 wont do if your using the Event-Driven approach to updating your Command and Query Model.
Option 3 has been pushed back by you in the original question.
Option 4 is something I cannot deduce for you given that I am not a domain expert, but I am guessing that the PersonnelCardAggregate does not belong to a larger encapsulating Aggregate Root. Maybe the business constraint you've stated, thus the option to reuse personalNumbers, could be dropped or adjusted? Like I said though, I cannot state this as a factual answer for you, as I am not the domain expert.
That leaves option 2, which in my eyes would be the most pragmatic approach too.
I feel this would require a combination of a cache at your command dispatching side to deal with quick successions of commands to resolve the eventual consistency issue. To capture the occurs that an update still comes through accidentally, I'd introduce some form of Event Handler that (1) knows the entire set of "PersonnelCards" from a personalNumber/archived point of view and (2) can react on a faulty introduction by dispatching a compensating action.
You'd thus introduce some business logic on the event handling side of your application, which I'd strongly recommend to segregate from the application part which updates your query models (as the use cases are entirely different).
Concluding though, this is a difficult topic with several ways around it.
It's not so much an Axon specific problem by the way, but more an occurrence of modeling your application through DDD and CQRS.

Is there a hard definition of how CQRS should be applied and CQRS questions

I have some trouble understanding what the CQRS pattern really is, on its core, meaning what are the red lines that, when crossed, we are not implementing the CQRS pattern.
I clearly understand the CQS pattern.
Suppose that we implement microservices with CQRS, without event sourcing.
1) First question is, does the CQRS pattern only apply to the client I/O? Meaning, hoping I get this right, that for example the client updates using controllers that write to database A, but read by querying controllers that write to database B, (B is eventually updated and may be aggregating information from multiple models using events sent by controller of A).
Or, this is not about the client, but anything, for example another microservice reading / writing? And if the latter, what is really the borderline that defines who is the reader / writer that causes the segregation?
Does this maybe have to do with the domains in DDD?
This is an important question in my mind, because without it, CQRS is just an interdependence, of model B being updated by model A, but not the reverse. And why wouldn't this be propagated from a model B to a model C for example?
I have also read articles stating that some people implement CQRS by having one Command Service and one Query Service, which even more complicates this.
2) Similar to the first question, why do some references speak of events as if they are the Commands of CQRS? This complicates CQRS in my mind, because, technically, with one request event we can ask a service "Hey please give me the information X" and the service can respond with an event that contains the payload, effectively doing a query. Is this a hard rule, or just an example to state that, we can update using events and we can query using REST?
3) What if, in most cases I write to model A, and I read from model B, but in some cases I read directly from model A?
Am I breaking CQRS?
What if my querying needs are very simple, should I duplicate model A in this situation?
4) What if, as stated in question 1), I read from model A to emit events, to produce model B, but then I like to read some information from model B because it's valuable because it is aggregated, in order to produce model C?
Am I breaking CQRS?
What is the controller that populates model B doing in that case, e.g. if it also emits events to populate model C? Is it simply a Command anyway because it is not the part that queries, so we still apply CQRS?
Additionally, what if, we read from model A to emit events, to produce model B, but while we produce model B, we emit events, to send client notifications. Is that still CQRS?
5) When is CQRS broken?
If I have a controller that reads from model B, but also emits a message that updates model A, is that it?
Finally, if that controller, e.g. a REST controller, reads from model B and simultaneously emits a message to update model A, but without containing any information from what was read from model B, (so the operation is two in one, but it does not use information from B to update A), is that still CQRS?
And, what if a REST controller, that updates model A, also returns some information to the client, that has been read from A, does that break CQRS? What if this is just an id? And what if the id is not read from A, but it is just a reference number that is randomly generated? Is there a problem in that case because the REST controller updates, but also returns something to the user?
I will really appreciate your patience for replying as it can be seen that I'm still quite confused and that I'm in the process of learning!
Is there a hard definition of how CQRS should be applied and CQRS questions
Yes, start with Greg Young.
CQRS is simply the creation of two objects where there was previously only one. The separation occurs based upon whether the methods are a command or a query (the same definition that is used by Meyer in Command and Query Separation, a command is any method that mutates state and a query is any method that returns a value). -- Greg Young 2010
It's "just a pattern", born of the fact that data representations which are effective for queries are frequently not the patterns that are effective for tracking change. For example, using an RDBMS for storing business data may be our best choice for maintaining data integrity, but for certain kinds of queries we might want to use a replicate of that data in a graph database.
why do some references speak of events as if they are the Commands of CQRS
HandleEvent is a command. CommandReceived is an event. It's very easy for readers (and authors!) to confuse the contexts that are being described. They are all "just" messages, the semantics of one or the other really depend on the direction the message is traveling relative to the authority for the information in the message.
For example, if I fill out a form on an e-commerce site and submit; is the corresponding message an OrderSubmitted event? or is it a PlaceOrder command? Either spelling could be the correct one, depending on how you choose to model the ordering process.
What if, in most cases I write to model A, and I read from model B, but in some cases I read directly from model A? Am I breaking CQRS?
The CQRS police are not going to come after you if you read from write models. In many architectures, the business logic is executed in a stateless component, and will depend on reading the "current" state from some storage appliance -- in other words, to support write often requires a read.
Pessimizing a write model to support read use cases is the thing we are trying to avoid.
Also: horses for courses. It's entirely appropriate to restrict the use of CQRS to those cases where you can profit from it. When GET/PUT semantics of a single model work, you should prefer them to separate models for reads and writes.

How to handle the two signals depending on each other?

I read Deprecating the Observer Pattern with Scala.React and found reactive programming very interesting.
But there is a point I can't figure out: the author described the signals as the nodes in a DAG(Directed acyclic graph). Then what if you have two signals(or event sources, or models, w/e) depending on each other? i.e. the 'two-way binding', like a model and a view in web front-end programming.
Sometimes it's just inevitable because the user can change view, and the back-end(asynchronous request, for example) can change model, and you hope the other side to reflect the change immediately.
The loop dependencies in a reactive programming language can be handled with a variety of semantics. The one that appears to have been chosen in scala.React is that of synchronous reactive languages and specifically that of Esterel. You can have a good explanation of this semantics and its alternatives in the paper "The synchronous languages 12 years later" by Benveniste, A. ; Caspi, P. ; Edwards, S.A. ; Halbwachs, N. ; Le Guernic, P. ; de Simone, R. and available at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1173191&tag=1 or http://virtualhost.cs.columbia.edu/~sedwards/papers/benveniste2003synchronous.pdf.
Replying #Matt Carkci here, because a comment wouldn't suffice
In the paper section 7.1 Change Propagation you have
Our change propagation implementation uses a push-based approach based on a topologically ordered dependency graph. When a propagation turn starts, the propagator puts all nodes that have been invalidated since the last turn into a priority queue which is sorted according to the topological order, briefly level, of the nodes. The propagator dequeues the node on the lowest level and validates it, potentially changing its state and putting its dependent nodes, which are on greater levels, on the queue. The propagator repeats this step until the queue is empty, always keeping track of the current level, which becomes important for level mismatches below. For correctly ordered graphs, this process monotonically proceeds to greater levels, thus ensuring data consistency, i.e., the absence of glitches.
and later at section 7.6 Level Mismatch
We therefore need to prepare for an opaque node n to access another node that is on a higher topological level. Every node that is read from during n’s evaluation, first checks whether the current propagation level which is maintained by the propagator is greater than the node’s level. If it is, it proceed as usual, otherwise it throws a level mismatch exception containing a reference to itself, which is caught only in the main propagation loop. The propagator then hoists n by first changing its level to a level above the node which threw the exception, reinserting n into the propagation queue (since it’s level has changed) for later evaluation in the same turn and then transitively hoisting all of n’s dependents.
While there's no mention about any topological constraint (cyclic vs acyclic), something is not clear. (at least to me)
First arises the question of how is the topological order defined.
And then the implementation suggests that mutually dependent nodes would loop forever in the evaluation through the exception mechanism explained above.
What do you think?
After scanning the paper, I can't find where they mention that it must be acyclic. There's nothing stopping you from creating cyclic graphs in dataflow/reactive programming. Acyclic graphs only allow you to create Pipeline Dataflow (e.g. Unix command line pipes).
Feedback and cycles are a very powerful mechanism in dataflow. Without them you are restricted to the types of programs you can create. Take a look at Flow-Based Programming - Loop-Type Networks.
Edit after second post by pagoda_5b
One statement in the paper made me take notice...
For correctly ordered graphs, this process
monotonically proceeds to greater levels, thus ensuring data
consistency, i.e., the absence of glitches.
To me that says that loops are not allowed within the Scala.React framework. A cycle between two nodes would seem to cause the system to continually try to raise the level of both nodes forever.
But that doesn't mean that you have to encode the loops within their framework. It could be possible to have have one path from the item you want to observe and then another, separate, path back to the GUI.
To me, it always seems that too much emphasis is placed on a programming system completing and giving one answer. Loops make it difficult to determine when to terminate. Libraries that use the term "reactive" tend to subscribe to this thought process. But that is just a result of the Von Neumann architecture of computers... a focus of solving an equation and returning the answer. Libraries that shy away from loops seem to be worried about program termination.
Dataflow doesn't require a program to have one right answer or ever terminate. The answer is the answer at this moment of time due to the inputs at this moment. Feedback and loops are expected if not required. A dataflow system is basically just a big loop that constantly passes data between nodes. To terminate it, you just stop it.
Dataflow doesn't have to be so complicated. It is just a very different way to think about programming. I suggest you look at J. Paul Morison's book "Flow Based Programming" for a field tested version of dataflow or my book (once it's done).
Check your MVC knowledge. The view doesn't update the model, so it won't send signals to it. The controller updates the model. For a C/F converter, you would have two controllers (one for the F control, on for the C control). Both controllers would send signals to a single model (which stores the only real temperature, Kelvin, in a lossless format). The model sends signals to two separate views (one for C view, one for F view). No cycles.
Based on the answer from #pagoda_5b, I'd say that you are likely allowed to have cycles (7.6 should handle it, at the cost of performance) but you must guarantee that there is no infinite regress. For example, you could have the controllers also receive signals from the model, as long as you guaranteed that receipt of said signal never caused a signal to be sent back to the model.
I think the above is a good description, but it uses the word "signal" in a non-FRP style. "Signals" in the above are really messages. If the description in 7.1 is correct and complete, loops in the signal graph would always cause infinite regress as processing the dependents of a node would cause the node to be processed and vice-versa, ad inf.
As #Matt Carkci said, there are FRP frameworks that allow loops, at least to a limited extent. They will either not be push-based, use non-strictness in interesting ways, enforce monotonicity, or introduce "artificial" delays so that when the signal graph is expanded on the temporal dimension (turning it into a value graph) the cycles disappear.

CQRS sagas - did I understand them right?

I'm trying to understand sagas, and meanwhile I have a specific way of thinking of them - but I am not sure whether I got the idea right. Hence I'd like to elaborate and have others tell me whether it's right or wrong.
In my understanding, sagas are a solution to the question of how to model long-running processes. Long-running means: Involving multiple commands, multiple events and possibly multiple aggregates. The process is not modeled inside one of the participating aggregates to avoid dependencies between them.
Basically, a saga is nothing more but a command / event handler that reacts on internal and external commands / events. It does not contain its own logic, it's just a (finite) state machine, and therefor provides tasks such as When event X happens, send command Y.
Sagas are persisted to the event store as well as aggregates, are correlated to a specific aggregate instance, and hence are reloaded when this specific aggregate (or set of aggregates) is used.
Is this right?
There are different means of implementing Sagas. Reaching from stateless event handlers that publish commands all the way to carrying all the state and basically being the domain's aggregates themselves. Udi Dahan once wrote an article about Sagas being the only Aggregates in a (in his specific case) correctly modeled system. I'll look it up and update this answer.
There's also the concept of document-based sagas.
Your definition of Sagas sounds right for me and I also would define them so.
The only change in your description I would made is that a saga is only a eventhandler (not a command) for event(s) and based on the receiving event and its internal state constructs a command and sents it to the CommandBus for execution.
Normally has a Saga only a single event to be started from (StartByEvent) and multiple events to transition (TransitionByEvent) to the next state and mutiple event to be ended by(EndByEvent).
On MSDN they defined Sagas as ProcessManager.
The term saga is commonly used in discussions of CQRS to refer to a
piece of code that coordinates and routes messages between bounded
contexts and aggregates. However, for the purposes of this guidance we
prefer to use the term process manager to refer to this type of code
artifact. There are two reasons for this: There is a well-known,
pre-existing definition of the term saga that has a different meaning
from the one generally understood in relation to CQRS. The term
process manager is a better description of the role performed by this
type of code artifact. Although the term saga is often used in the
context of the CQRS pattern, it has a pre-existing definition. We have
chosen to use the term process manager in this guidance to avoid
confusion with this pre-existing definition. The term saga, in
relation to distributed systems, was originally defined in the paper
"Sagas" by Hector Garcia-Molina and Kenneth Salem. This paper proposes
a mechanism that it calls a saga as an alternative to using a
distributed transaction for managing a long-running business process.
The paper recognizes that business processes are often comprised of
multiple steps, each of which involves a transaction, and that overall
consistency can be achieved by grouping these individual transactions
into a distributed transaction. However, in long-running business
processes, using distributed transactions can impact on the
performance and concurrency of the system because of the locks that
must be held for the duration of the distributed transaction.
reference: http://msdn.microsoft.com/en-us/library/jj591569.aspx

"Life Beyond Transactions" Entity-Message-Activity Model in Practice?

Over vacation I read Pat Helland's "Life Beyond Transactions" (yes, vacation was that good :). To sum it up briefly, it advocates limiting the scope of transactions to a single entity and then using groups of "activities" that have the ability to update the entity or cancel a task anytime a change takes place that would make that task invalid.
(E.g. Shipping Order A requires some amount of Item 1. The Shipping Orders and Items are stored as entities and have their own activities. Shipping Order B ships with the last of Item 1 before A finishes. The activity for Item 1 cancels Shipping Order A.)
I had thought I was printing out the Dynamo paper, so forgive me if I conflate the two here. I've seen quite a few "NoSQL" projects influenced by Dynamo and BigTable, particularly in how they address entities by keys and partition data. I was wondering if this Entity-Message-Activity model has influenced any of them?
Or, to put it in more concrete terms, if I have an operation in HBase, Cassandra, Riak, etc. that spans multiple entities, do I need to implement an Activity all by myself (as more of a design pattern in the application), or is there some kind of existing framework? Or do they do something else completely that renders this entire question moot?
Thanks!
I can add my 2 cents here just from a Cassandra point of view (I haven't used the other NoSQL engines available). Cassandra is primarily designed to be a fast read-write structure. Twitter is a great use case for Cassandra (check the twitter clone Twissandra for this)
Assuming I have understood your question correctly: yes you will have to implement the activity yourself. To understand the modeling of Column/SuperColumnFamilies I would suggest reading this great article WTF is a SuperColumn?
Cheers!