I'm building a Scala REST API to handle orders and I have been following https://github.com/alexandru/scala-best-practices to ensure I am following best practices as it's been a while since I wrote any Scala code.
So far, all of my code is thread safe and non blocking, using Futures where applicable and immutability throughout. Now, i've integrated a data store that handles the creation of orders, my problem is that how do I deal with transactions (such as two orders exactly the same being created at the same time)?
Are there any patterns or examples that I can follow for handling transactions whilst preserving thread safety and avoiding non blocking code? I come from a Java background and typically would use synchronised blocks but that would not respect:
4.8. https://github.com/alexandru/scala-best-practices/blob/master/sections/4-concurrency-parallelism.md
Meet Amdahl's Law. Synchronizing with locks drastically limits the parallelization possible and thus vertical scalability. Reads are embarrassingly paralellizable, so avoid at all times doing this:
def fetch = synchronized { someValue }
Come up with better synchronization schemes that does not involve synchronizing reads, like atomic references or STM. If you aren't able to do that, then avoid this altogether by using proper abstractions.
I would appreciate guidance and/or examples of how this can be achieved.
Related
In this picture, we can see saga is the one that implements transactions and cqrs implements queries. As far as I know, a transaction is a set of queries that follow ACID properties.
So can I consider CQRS as an advanced version of saga which increases the speed of reads?
Note that this diagram is not meant to explain what Sagas and CQRS are. In fact, looking at it this way it is quite confusing. What this diagram is telling you is what patterns you can use to read and write data that spans multime microservices. It is saying that in order to write data (somehow transactionally) across multiple microservices you can use Sagas and in order to read data which belongs to multiple microservices you can use CQRS. But that doesn't mean that Sagas and CQRS have anything in common. They are two different patterns to solve completely different problems (reads and writes). To make an analogy, it's like saying that to make pizzas (Write) you can use an oven and to view the pizzas menu (Read) you can use a tablet.
On the specific patterns:
Sagas: you can see them as a process manager or state machine. Note that they do not implement transactions in the RDBMS sense. Basically, they allow you to create a process that will take care of telling each microservice to do a write operation and if one of the operations fails, it'll take care of telling the other microservices to rollback (or compensate) the action that they did. So, these "transactions" won't be atomic, because while the process is running some microservices will have already modified the data and others won't. And it is not garanteed that whatever has succeed can sucessfully be rolled back or compensated.
CQRS (Command Query Responsibility Segregation): suggests the separation of Commands (writes) and Queries (Reads). The reason for that, it is what I was saying before, that the reads and writes are two very different operations. Therefore, by separating them, you can implement them with the patterns that better fit each scenario. The reason why CQRS is shown in your diagram as a solution for reading data that comes from multiple microservices is because one way of implementing queries is to listen to Domain Events coming from multiple microservices and storing the information in a single database so that when it's time to query the data, you can find it all in a single place. An alternative to this would be Data Composition. Which would mean that when the query arrives, you would submit queries to multiple microservices at that moment and compose the response with the composition of the responses.
So can I consider CQRS as an advanced version of saga which increases the speed of reads?
Personally I would not mix the concepts of CQRS and Sagas. I think this can really confuse you. Consider both patterns as two completely different things and try to understand them both independently.
I'm just wondering if the design I will be trying to implement is valid CQRS.
I'm going to have a query handler that itself will send more queries to other sub-handlers. Its main task is going to aggregate results from multiple services.
Is this ok to send queries from within handlers? I can already think of 3 level deep hierachies of these in my application.
No, MediatR is designed for a single level of requests and handlers. A better design may be to create a service/manager of some kind which invokes multiple, isolated queries using MediatR and aggregate the results. The implementation may be similar to what you have in mind, except that it's not a request handler itself but rather an aggregation of multiple request handlers.
This will badly affect the system's resilience and compute time and it will increase coupling.
If any of the sub-handlers fails then the entire handler will fail. If the queries are send in a synchronous way then the total compute time is sum of the individual queries times.
One way to reuse the sub-handlers is to query them in the background, outside the client's request, it possible. In this way, when a client request comes you already have the data locally, increasing the resilience and compute time. You will be left only with the coupling but it could worth it if the reuse is heavier than the coupling.
I don't know if any of this is possible in MediatR, there are just general principles of system architecture.
While building a large multi threaded application for the financial services industry, I utilized immutable classes and an Actor model for workflow everywhere I could. I'm pretty pleased with the outcome. It uses a fair amount of heap space (its in Java btw) but the JVM's GC works pretty well with short lived immutable classes.
I am just wondering if there are any downsides to using this kind of pattern going forward? When debugging a team mates code, I often find myself recomending this pattern in one way or another. I guess once one has a hammer, everything looks like a nail. So the question is: When would this design pattern (paradigm?) work poorly?
My hunch is when memory usage is a big issue or when the project restrictions require something along the lines of low-level C, etc.
Many science simulations codes are really memory intensive. For example for cellular automata models fast memory access is more important than CPU power. In that case, accessing and modifying in place a mutable array is always faster (at least in all my trials).
All depends of your project design.
If you have some resource and lot of actors use it then the common pattern is to design accessor actor. Then when some other actor needs to ask about some resource, he ask about it accessor actor. Then the answer is copied through message channel.
Now imagine - you have really heavy resource (eg map[String, BigObject]) and other actors frequently ask about some BigObject then you waste your bandwidth.
Better idea would be to share the resource to all actors in readonly mode, and make one actor to perform writes.
Other example would be database connector which connect to database with a lot of blob data. When database connector is thread safe (as normally is) it's better to share the connector object reference to all actors, then design some actor which provides the access.
All you need to remember every that communication between actors is done by copying messages.
Scala 2.9 introduced parallel collections. They are a really great tool for certain tasks. However, how do they work internally and am I able to influence the behavior/configuration?
What method do they use to figure out the optimal number of threads? If I am not satisfied with the result are there any configuration parameters to adjust?
I'm not only interested how many threads are actually created, I am also interested in the way how the actual work is distributed amongst them. How the results are collected and how much magic is going on behind the scenes. Does Scala somehow test if a collection is large enough to benefit from parallel processing?
Briefly, there are two orthogonal aspects to how your operations are parallelized:
The extent to which your collection is split into chunks (i.e. the size of the chunks) for a parallelizable operation (such as map or filter)
The number of threads to use for the underlying fork-join pool (on which the parallel tasks are executed)
For #2, this is managed by the pool itself, which discovers the "ideal" level of parallelism at runtime (see java.lang.Runtime.getRuntime.availableProcessors)
For #1, this is a separate problem and the scala parallel collections API does this via the concept of work-stealing (adaptive scheduling). That is, when a particular piece of work is done, a worker will attempt to steal work from other work-queues. If none is available, this is an indication that all of the processors are very busy and hence a bigger chunk of work should be taken.
Aleksandar Prokopec, who implemented the library gave a talk at this year's ScalaDays which will be online shortly. He also gave a great talk at ScalaDays2010 where he describes in detail how the operations are split and re-joined (there are a number of issues that are not immediately obvious and some lovely bits of cleverness in there too!).
A more comprehensive answer is available in the PDF describing the parallel collections API.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
It seems that there has been a recent rising interest in STM (software transactional memory) frameworks and language extensions. Clojure in particular has an excellent implementation which uses MVCC (multi-version concurrency control) rather than a rolling commit log. GHC Haskell also has an extremely elegant STM monad which also allows transaction composition. Finally, so as to toot my own horn just a bit, I've recently implemented an STM framework for Scala which statically enforces reference restrictions.
All of these are interesting experiments, but they seem to be confined to that sphere alone (experimentation). So my question is: have any of you seen or used STM in the real world? If so, why? What sort of benefits did it bring? What about performance? (there seems to be a great deal of conflicting information on this point) Would you use STM again or would you prefer to use some other concurrency abstraction like actors?
I participated in the hobbyist development of the BitTorrent client in Haskell (named conjure). It uses STM quite heavily to coordinate different threads (1 per peer + 1 for storage management + 1 for overall management).
Benefits: less locks, readable code.
Speed was not an issue, at least not due to STM usage.
Hope this helps
We use it pretty routinely for high concurrency apps at Galois (in Haskell). It works, its used widely in the Haskell world, and it doesn't deadlock (though of course you can have too much contention). Sometimes we rewrite things to use MVars, if we've got the design right -- as they're faster.
Just use it. It's no big deal. As far as I'm concerned, STM in Haskell is "solved". There's no further work to do. So we use it.
The article "Software Transactional Memory: Why is it Only a Research Toy?" (Călin Caşcaval et al., Communications of the ACM, Nov. 2008), fails to look at the Haskell implementation, which is a really big omission. The problem for STM, as the article points out, is that implementations must chose between either making all variable accesses transactional unless the compiler can prove them safe (which kills performance) or letting the programmer indicate which ones are to be transactional (which kills simplicity and reliability). However the Haskell implementation uses the purity of Haskell to avoid the need to make most variable uses transactional, while the type system provides a simple model together with effective enforcement for the transactional mutation operations. Thus a Haskell program can use STM for those variables that are truly shared between threads whilst guaranteeing that non-transactional memory use is kept safe.
We, factis research GmbH, are using Haskell STM with GHC in production. Our server receives a stream of messages about new and modified "objects" from a clincal "data server", it transforms this event stream on the fly (by generating new objects, modifying objects, aggregating things, etc) and calculates which of these new objects should be synchronized to connected iPads. It also receives form inputs from iPads which are processed, merged with the "main stream" and also synchronized to the other iPads. We're using STM for all channels and mutable data structures that need to be shared between threads. Threads are very lightweight in Haskell so we can have lots of them without impacting performance (at the moment 5 per iPad connection). Building a large application is always a challenge and there were many lessons to be learned but we never had any problems with STM. It always worked as you'd naively expect. We had to do some serious performance tuning but STM was never a problem. (80% of the time we were trying to reduce short-lived allocations and overall memory usage.)
STM is one area where Haskell and the GHC runtime really shines. It's not just an experiment and not for toy programs only.
We're building a different component of our clincal system in Scala and have been using Actors so far, but we're really missing STM. If anybody has experience of what it's like to use one of the Scala STM implementations in production I'd love to hear from you. :-)
We have implemented our entire system (in-memory database and runtime) on top of our own STM implementation in C. Prior to this, we had some log and lock based mechanism to deal with concurrency, but this was a pain to maintain. We are very happy with STM since we can treat every operation the same way. Almost all locks could be removed. We use STM now for almost anything at any size, we even have a memory manager implement on top.
The performance is fine but to speed things up we now developed a custom operating system in collaboration with ETH Zurich. The system natively supports transactional memory.
But there are some challenges caused by STM as well. Especially with larger transactions and hotspots that cause unnecessary transaction conflicts. If for example two transactions put an item into a linked list, an unnecessary conflict will occur that could have been avoided using a lock free data structure.
I'm currently using Akka in some PGAS systems research. Akka is a Scala library for developing scalable concurrent systems using Actors, STM, and built-in fault tolerance capabilities modeled after Erlang's "Let It Fail/Crash/Crater/ROFL" philosophy. Akka's STM implementation is supposedly built around a Scala port of Clojure's STM implementation. An overview of Akka's STM module can be found here.