Can you publish to a queue and write to the db within a Slick transaction, and still guarantee atomicity? - scala

I have a Slick server for which I'd like to use the transactionally keyword to perform a double-write of data to my database and a RabbitMQ message queue. My code looks something like this:
val a = (for {
_ <- coffees.map(c => (c.name, c.supID, c.price)) += ("Colombian_Decaf", 101, 8.99)
ch.txSelect()
ch.basicPublish("SELL " + c.name, QUEUE_NAME, MessageProperties.PERSISTENT_BASIC, "nop".getBytes())
ch.txCommit()
} yield ()).transactionally
My question is: Is it possible for the queue publish action to commit successfully, but have the DB insert fail? In this case, my system would be in an inconsistent state, as I would only want the message to be published to the queue if the value was successfully inserted into the database, and vice versa.
Thanks!

Unfortunately for you the answer is that you can't easily guarantee consistency for such a system. What you want is distributed transactions and they are fundamentally hard. To see why this is so you can make the following thought experiment: what happens if your computer blows up (or less radically gets cut of electricity) at the most unfortunate moment? For this code one of such bad moments is exactly after the line ch.txCommit() is fully executed (so it is before the outer DB transaction is committed as well). Fundamentally there is nothing you can do about such scenario unless somehow these two concurrent transactions a aware of each other. Unfortunately I don't know about any distributed transaction coordinators to cover both traditional SQL DBs and RabbitMQ. So your choices are:
Give up and do nothing (and develop some procedure to recover after disastrous events afterward in a manual mode)
Implement some distributed transaction algorithm such as 2-phase commit yourself. This most probably requires some re-design and a complicated implementation.
Re-design your system to use some form of eventual consistency. This probably requires a bigger re-design but still might be easier to implement than #2.

Related

Designing event-based architecture for the customer service

Being a developer with solid experience, i am only entering the world of microservices and event-driven architecture. Things like loose coupling, independent scalability and proper implementation of asynchronous business processes is something that i feel should get simplified as compared with traditional monolith approach. So giving it a try, making a simple PoC for myself.
I am considering making a simple application where user can register, login and change the customer details. However, i want to react on certain events asynchronously:
customer logs in - we send them an email, if the IP address used is new to the system.
customer changes their name, we send them an email notifying of the change.
The idea is to make a separate application that reacts on "CustomerLoggedIn", "CustomerChangeName" events.
Here i can think of three approaches, how to implement this simple functionality, with each of them having some drawbacks. So, when a customer submits their name change:
Store change name Changed name is stored in the DB + an event is sent to Kafkas when the DB transaction is completed. One of the big problems that arise here is that if a customer had 2 tabs open and almost simultaneously submits a change from initial name "Bob" to "Alice" in one tab and from "Bob" to "Jim" in another one, on a database level one of the updates overwrites the other, which is ok, however we cannot guarantee the order of the events to be the same. We can use some checks to ensure that DB update is only done when "the last version" has been seen, thus preventing the second update at all, so only one event will be emitted. But in general case, this pattern will not allow us to preserve the same order of events in the DB as in Kafka, unless we do DB change + Kafka event sending in one distributed transaction, which is anti-pattern afaik.
Change the name in the DB, and use Debezium or similar DB CDC to capture the event and stream it. Here we get a single event source, so ordering problem is solved, however what bothers me is that i lose the ability to enrich the events with business information. Another related drawback is that CDC will stream all the updates in the "customer" table regardless of the business meaning of the event. So, in this case, i will probably need to build a Kafka Streams application to convert the DB CDC events to business events and decouple the DB structure from event structure. The potential benefit of this approach is that i will be able to capture "direct" DB changes in the same manner as those originated in the application.
Emit event from the application, without storing it in the DB. One of the subscribers might to the DB persistence, another will do email sending, etc. The biggest problem i see here is - what do i return to the client? I cannot say "Ok, your name is changed", it's more like "Ok, you request has been recorded and will be processed". In case if the customer quickly hits refresh - he expects to see his new name, as we don't want to explain to the customers what's eventual consistency, do we? Also the order of processing the same event by "email sender" and "db updater" is not guaranteed, so i can send an email before the change is persisted.
I am looking for advices regarding any of these three approaches (and maybe some others i am missing), maybe the usecases when one can be preferrable over others?
It sounds to me like you want event sourcing. In event sourcing, all you need to store is the event: the current state of a customer is derived from replaying the events (either from the beginning of time, or since a snapshot: the snapshot is just an optional optimization). Some other process (there are a few ways to go about this) can then project the events to Kafka for consumption by interested parties. Since every event has a sequence number, you can use the sequence number to prevent concurrent modification (alternatively, the more actor modely event-sourcing implementations can use techniques like cluster sharding in Akka to achieve the same ends).
Doing this, you can have a "write-side" which processes the updates in a strongly consistent manner and can respond to queries which only involve a single customer having seen every update to that point (the consistency boundary basically makes customer in this case an aggregate in domain-driven-design terms). "Read-sides" consuming events are eventually consistent: the latencies are typically fairly short: in this case your services sending emails are read-sides (as would be a hypothetical panel showing names of all customers), but the customer's view of their own data could be served by the write-side.
(The separation into read-sides and write-side (the pluralization is significant) is Command Query Responsibility Segregation, which sometimes gets interpreted as "reads can only be served by a read-side". This is not totally accurate: for one thing a write-side's model needs to be read in order for the write-side to perform its task of validating commands and synchronizing updates, so nearly any CQRS-using project violates that interpretation. CQRS should instead be interpreted as "serve reads from the model that makes the most sense and avoid overcomplicating a model (including that model in the write-side) to support a new read".)
I think I qualify to answer this, having extensively used debezium for simplifying the architecture.
I would prefer Option 2:
Every transaction always results in an event emitted in correct order
Option 1/3 has a corner case, what if transaction succeeds, but application fails to emit the event?
To your point:
Another related drawback is that CDC will stream all the updates in
the "customer" table regardless of the business meaning of the event.
So, in this case, i will probably need to build a Kafka Streams
application to convert the DB CDC events to business events and
decouple the DB structure from event structure.
I really dont think that is a roadblock. The benefit you get is potentially other usecases may crop up where another consumer to this topic may want to read other columns of the table.
Option 1 and 3 are only going to tie this to your core application logic, and that is not doing any favor from simplifying PoV. With option 2, with zero code changes to core application APIs, a developer can independently work on the events, with no need to understand that core logic.

Scala objects and thread safety

I am new to Scala.
I am trying to figure out how to ensure thread safety with functions in a Scala object (aka singleton)
From what I have read so far, it seems that I should keep visibility to function scope (or below) and use immutable variables wherever possible. However, I have not seen examples of where thread safety is violated, so I am not sure what other precautions should be taken.
Can someone point me to a good discussion of this issue, preferably with examples of where thread safety is violated?
Oh man. This is a huge topic. Here's a Scala-based intro to concurrency and Oracle's Java lessons actually have a pretty good intro as well. Here's a brief intro that motivates why concurrent reading and writing of shared state (of which Scala objects are particular specific case) is a problem and provides a quick overview of common solutions.
There's two (fundamentally related) classes of problems when it comes to thread safety and state mutation:
Clobbering (missing) writes
Inaccurate (changing out from under you) reads
Let's look at each of these in turn.
First clobbering writes:
object WritesExample {
var myList: List[Int] = List.empty
}
Imagine we had two threads concurrently accessing WritesExample, each of executes the following updateList
def updateList(x: WritesExample.type): Unit =
WritesExample.myList = 1 :: WritesExample.myList
You'd probably hope when both threads are done that WritesExample.myList has a length of 2. Unfortunately, that might not be the case if both threads read WritesExample.myList before the other thread has finished a write. If when both threads read WritesExample.myList it is empty, then both will write back a list of length 1, with one write overwriting the other, so that in the end WritesExample.myList only has a length of one. Hence we've effectively lost a write we were supposed to execute. Not good.
Now let's look at inaccurate reads.
object ReadsExample {
val myMutableList: collection.mutable.MutableList[Int]
}
Once again, let's say we had two threads concurrently accessing ReadsExample. This time each of them executes updateList2 repeatedly.
def updateList2(x: ReadsExample.type): Unit =
ReadsExample.myMutableList += ReadsExample.myMutableList.length
In a single-threaded context, you would expect updateList2, when repeatedly called, to simply generate an ordered list of incrementing numbers, e.g. 0, 1, 2, 3, 4,.... Unfortunately, when multiple threads are accessing ReadsExample.myMutableList with updateList2 at the same time, it's possible that between when ReadsExample.myMutableList.length is read and when the write is finally persisted, ReadsExample.myMutableList has already been modified by another thread. So in theory you could see something like 0, 0, 1, 1 or potentially if one thread takes longer to write than another 0, 1, 2, 1 (where the slower thread finally writes to the list after the other thread has already accessed and written to the list three times).
What happened is that the read was inaccurate/out-of-date; the actual data structure that was updated was different from the one that was read, i.e. was changed out from under you in the middle of things. This is also a huge source of bugs because many invariants you might expect to hold (e.g. every number in the list corresponds exactly to its index or every number appears only once) hold in a single-threaded context, but fail in a concurrent context.
Now that we've motivated some of the problems, let's dive into some of the solutions. You mentioned immutability so let's talk about that first. You might notice that in my example of clobbering writes I use an immutable data structure whereas in my inconsistent reads example I use a mutable data structure. That is intentional. They are in a sense dual to one another.
With immutable data structures you cannot have an "inaccurate" read in the sense I laid out above because you never mutate data structures, but rather place a new copy of a data structure in the same location. The data structure cannot change out from under you because it cannot change! However you can lose a write in the process by placing a version of a data structure back to its original location that does not incorporate a change made previously by another process.
With mutable data structures on the other hand, you cannot lose a write because all writes are in-place mutations of the data structure, but you can end up executing a write to a data structure whose state differs from when you analyzed it to formulate the write.
If it's a "pick your poison" kind of scenario, why do you often hear advice to go with immutable data structures to help with concurrency? Well immutable data structures make it easier to ensure invariants about the state being modified hold even if writes are lost. For example, if I rewrote the ReadsList example to use an immutable List (and a var instead), then I could confidently say that the integer elements of the list will always correspond to the indices of the list. This means that your program is much less likely to enter an inconsistent state (e.g. it's not hard to imagine that a naive mutable set implementation could end up with non-unique elements when mutated concurrently). And it turns out that modern techniques for dealing with concurrency usually are pretty good at dealing with missing writes.
Let's look at some of those approaches that deal with shared state concurrency. At their hearts they can all be summed up as various ways of serializing read/write pairs.
Locks (a.k.a. directly try to serialize read/write pairs): This is usually the one you'll hear first as a fundamental way of dealing with concurrency. Every process that wants to access state first places a lock on it. Any other process is now excluded from accessing that state. The process then writes to that state and on completion releases the lock. Other processes are now free to repeat the process. In our WritesExample, updateList would first acquire the lock before executing and releasing the lock; this would prevent other processes from reading WritesExample.myList until the write was completed, thereby preventing them from seeing old versions of myList that would lead to clobbering writes (note that are more sophisticated locking procedures that allow for simultaneous reads, but let's stick with the basics for now).
Locks often do not scale well to multiple pieces of state. With multiple locks, often you need to acquire and release locks in a certain order otherwise you can end up deadlocking or livelocking.
The Oracle and Twitter docs linked a the beginning have good overviews of this approach.
Describe Your Action, Don't Execute It (a.k.a. build up a serial representation of your actions and have someone else process it): Instead of accessing and modifying state directly, you describe an action of how to do this and then give it to someone else to actually execute the action. For example, you might pass messages to an object (e.g. actors in Scala) that queues up these requests and then executes them one-by-one on some internal state that it never directly exposes to anyone else. In the particular case of actors, this improves the situation over locks by removing the need to explicitly acquire and release locks. As long as you encapsulate all the state you need to access at once in a single object, message passing works great. Actors break down when you distribute state across multiple objects (and as such this is heavily discouraged in this paradigm).
Akka actors are one good example of this in Scala.
Transactions (a.k.a. temporarily isolate some reads and writes from others and let the isolation system serialize things for you): Wrap all your read/writes in transactions that ensure during the course of your reads and writes your view of the world is isolated from any other changes. There's usually two ways of achieving this. Either you go for an approach similar to locks where you prevent other people from accessing the data while a transaction is running or you restart a transaction from the very beginning whenever you detect that a change has occurred to the shared state and throw away any progress you've made (usually the latter for performance reasons). On the one hand, transactions, unlike locks and actors, scale to disparate pieces of state very well. Just wrap all your accesses in transactions and you're good to go. On the other hand, your reads and writes have to be side-effect-free because they might be thrown away and retried many times and you can't really undo most side effects.
And if you're really unlucky, although you usually can't truly deadlock with a good implementation of transactions, a long-lived transaction can constantly be interrupted by other short-lived transactions such that it keeps getting thrown away and retried and never actually succeeds (which amounts to something like livelocking). In effect you're giving up direct control of serialization order and hoping your transaction system orders things sensibly.
Scala's STM library is a good example of this approach.
Remove Shared State: The final "solution" is to rethink the problem altogether and try to think about whether you truly need global, shared state that is writable. If you don't need writable shared state, then concurrency problems go away altogether!
Everything in life is about trade-offs and concurrency is no exception. When thinking about concurrency first understand what state you have and what invariants you want to preserve about that state. Then use that to guide your decision as to what kind of tools you want to use to tackle the problem.
The Thread Safety Problem section within this Scala concurrency article might be of interest to you. In essence, it illustrates the thread safety problem using a simple example and outlines 3 different approaches to tackle the problem, namely synchronization, volatile and AtomicReference:
When you enter synchronized points, access volatile references, or
deference AtomicReferences, Java forces the processor to flush their
cache lines and provide a consistent view of data.
There is also a brief overview comparing the cost of the 3 approaches:
AtomicReference is the most costly of these two choices since you
have to go through method dispatch to access values. volatile and
synchronized are built on top of Java’s built-in monitors. Monitors
cost very little if there’s no contention. Since synchronized allows
you more fine-grained control over when you synchronize, there will be
less contention so synchronized tends to be the cheapest option.
This is not specific to Scala, if your object contains a state that can be modified concurrently thread safety can be violated depending on the implementation. For example:
object BankAccount {
private var balance: Long = 0L
def deposit(amount: Long): Unit = balance += amount
}
In this case the the object is not thread safe, there are a lot of approachs to make it thread safe, for example using Akka, or synchronized blocks. For simplicity I will write it using synchronized blocks
object BankAccount {
private var balance: Long = 0L
def deposit(amount: Long): Unit =
this.synchronized {
balance += amount
}
}

Doctrine: avoid collision in update

I have a product table accesed by many applications, with several users in each one. I want to avoid collisions, but in a very small portion of code I have detected collisions can occur.
$item = $em->getRepository('MyProjectProductBundle:Item')
->findOneBy(array('product'=>$this, 'state'=>1));
if ($item)
{
$item->setState(3);
$item->setDateSold(new \DateTime("now"));
$item->setDateSent(new \DateTime("now"));
$dateC = new \DateTime("now");
$dateC->add(new \DateInterval('P1Y'));
$item->setDateGuarantee($dateC);
$em->persist($item);
$em->flush();
//...after this, set up customer data, etc.
}
One option could be make 2 persist() and flush(), the first one just after the state change, but before doing it I would like to know if there is a way that offers more guarantee.
I don't think a transaction is a solution, as there are actually many other actions involved in the process so, wrapping them in a transaction would force many rollbacks and failed sellings, making it worse.
Tha database is Postgress.
Any other ideas?
My first thought would be to look at optimistic locking. The end result is that if someone changes the underlying data out from under you, doctrine will throw an exception on flush. However, this might not be easy, since you say you have multiple applications operating on a central database -- it's not clear if you can control those applications or not, and you'll need to, because they'll all need to play along with the optimistic locking scheme and update the version column when they run updates.

J Oliver EventStore V2.0 questions

I am embarking upon an implementation of a project using CQRS and intend to use the J Oliver EventStore V2.0 as my persistence engine for events.
1) In the documentation, ExampleUsage.cs uses 3 serializers in "BuildSerializer". I presume this is just to show the flexibility of the deserialization process?
2) In the "Restart after failure" case where some events were not dispatched I believe I need startup code that invokes GetUndispatchedCommits() and then dispatch them, correct?
3) Again, in "ExampleUseage.cs" it would be useful if "TakeSnapshot" added the third event to the eventstore and then "LoadFromSnapShotForward" not only retrieve the most recent snapshot but also retrieved events that were post snapshot to simulate the rebuild of an aggregate.
4) I'm failing to see the use of retaining older snapshots. Can you give a use case where they would be useful?
5) If I have a service that is handling receipt of commands and generation of events what is a suggested strategy for keeping track of the number of events since the last snapshot for a given aggregate. I certainly don't want to invoke "GetStreamsToSnapshot" too often.
6) In the SqlPersistence.SqlDialects namespace the sql statement name is "GetStreamsRequiringSnaphots" rather than "GetStreamsRequiringSnapShots"
1) There are a few "base" serializers--such as the Binary, JSON, and BSON serializers. The other two in the example--GZip/Compression and Encryption serializers are wrapping serializers and are only meant to modify what's already been serialized into a byte stream. For the example, I'm just showing flexibility. You don't have to encrypt if you don't want to. In fact, I've got stuff running production that uses simple JSON which makes debugging very easy because everything is text.
2) The SynchronousDispatcher and AsychronousDispatcher implementations are both configured to query and find any undispatched commits. You shouldn't have to do anything special.
3) Greg Young talked about how he used to "inline" his snapshots with the main event stream, but there were a number of optimistic concurrency and race conditions in high-performance systems that came up. He therefore decided to move them "out of band". I have followed this decision for many of the same reasons.
In addition snapshots are really a performance consideration when you have extrememly low SLAs. If you have a stream with a few thousand events on it and you don't have low SLAs, why not just take the minimal performance hit instead of adding additional complexity into your system. In other words, snapshots are "ancillary" concepts. They're in the EventStore API, but they're an optional concept that should be considered for certain use cases.
4) Let's suppose you had an aggregate with tens of millions of events and you wanted to run a "what if" scenario from before your most recent snapshot. It's a lot cheaper to go from another snapshot forward. The really nice thing about snapshots being a secondary concept is that if you wanted to drop older snapshots you could and it wouldn't affect your system at all.
5) There is a method in each implementation of IPersistStreams called GetStreamsRequiringSnapshots. You provide a threshold of 50, for example which finds all streams having 50 or more events since their last snapshot. This can (and probably should) be done asynchronously from your normal processing.
6) "Snapshots" is the correct casing for that word. Much like "website" used to be "Web site" but because of common usage it became "website".

Undoable sets of changes

I'm looking for a way to manage consistent sets of changes across several data sources, including, but not limited to, a database, some network control tools, and probably other SOAP-based services.
If one change fails for some reason (e.g. real-world app says "no", or a database insert fails), I want the whole set to be undone. So that's like transactions, just not limited to a DB.
I came up with a module that stacks up "change" objects which in turn have their init, commit, and rollback methods. When the set is DESTROYed, it rolls uncommitted changes back. This kinda works.
Still I can't overcome the feeling of a wheel being invented. Is there a standard CPAN module, or a well described common method to perform such a task? (At least GoF's "command" pattern and RAII principle come to mind...)
There are a couple of approaches to executing a Distributed transaction (which is what you're describing):
The standard pattern is called "Two-phase commit protocol".
At the moment I'm not aware of any Perl module which implements Two-phase commit, which is kind of surprising and may likely be due to a lapse in my Googling. The only thing I found was Env::Transaction but I have no clue how stable/good/functional it is.
For certain cases, a solution involving rollback via "Compensating transactions" is possible.
This is basically a special case of general rollback where, when generating task list A designed to change the target system state from S1 to S2, you at the same time generate a "compensating" task list A-neg designed to change the target system state from S2 back to S1. This is obviously only possible for certain systems, and moreover only a small subset of those are commutative (meaning that your can execute both transaction and its compensating transaction non-contiguously, e.g. the result of A + B + A-neg + B-neg is an invariant state.
Please notice that the compensating transactions does NOT always have to be engineered to be a "transaction" - one clever approach (again, only possible on certain subject domains) involves storing your data with a special "finalized" flag; then periodically harvest and destroy data with a false "finalized" flag if the data's "originating transaction timestamp" is older than some sort of threshold.