Current application uses Akka eventstream and its publish/subscribe for a use case which imports a lot of data and upon receiving data for each row it publishes and event and there is an subscriber to it. this design is running into risk of losing events if something goes wrong with either publisher/subscriber as such.
I am wondering if using Akka persistence makes sense here, for a few reasons
1)Persist events
2)Audit history
3)Recreate scenario with snapshot
note there isn't a shared/global state (generally described as a use case in almost all Akka persistence blogs/examples) in the system.
Does Akka persistence make sense here?
If I understand your scenario correctly, I'd say no for 1), yes for 2), no for 3):
1) If the message is lost due to a problem with the pub/sub mediator (which you don't really control), it will never reach your persistent actors and therefore will never be saved in the event stream, thus never replayed.
2) Recorded message will be lookable upon during audit.
3) If your actors are stateless processors, what scenario are you going to recreate/save in the snapshot?
I'd suggest you can work around 1 by using a confirmation/retry mechanism in which you resend the message at regular intervals until you receive an ack from the consumer.
Related
One of great promises of Event Sourcing is the ability to replay events. When there's no relationship between entities (e.g. blob storage, user profiles) it works great, but how to do replay quckly when there are important relationships to check?
For example: Product(id, name, quantity) and Order(id, list of productIds). If we have CreateProduct and then CreateOrder events, then it will succeed (product is available in warehouse), it's easy to implement e.g. with Kafka (one topic with n1 partitions for products, another with n2 partitions for orders).
During replay everything happens more quickly, and Kafka may reorder the events (e.g. CreateOrder and then CreateProduct), which will give us different behavior than originally (CreateOrder will now fail because product doesn't exist yet). It's because Kafka guarantees ordering only within one topic within one partition. The easy solution would be putting everything into one huge topic with one partition, but this would be completely unscalable, as single-threaded replay of bigger databases could take days at least.
Is there any existing, better solution for quick replaying of related entities? Or should we forget about event sourcing and replaying of events when we need to check relationships in our databases, and replaying is good only for unrelated data?
As a practical necessity when event sourcing, you need the ability to conjure up a stream of events for a particular entity so that you can apply your event handler to build up the state. For Kafka, outside of the case where you have so few entities that you can assign an entire topic partition to just the events for a single entity, this entails a linear scan and filter through a partition. So for this reason, while Kafka is very likely to be a critical part of any event-driven/event-based system in relaying events published by a service for consumption by other services (at which point, if we consider the event vs. command dichotomy, we're talking about commands from the perspective of the consuming service), it's not well suited to the role of an event store, which are defined by their ability to quickly give you an ordered stream of the events for a particular entity.
The most popular purpose-built event store is, probably, the imaginatively named Event Store (at least partly due to the involvement of a few prominent advocates of event sourcing in its design and implementation). Alternatively, there are libraries/frameworks like Akka Persistence (JVM with a .Net port) which use existing DBs (e.g. relational SQL DBs, Cassandra, Mongo, Azure Cosmos, etc.) in a way which facilitates their use as an event store.
Event sourcing also as a practical necessity tends to lead to CQRS (they go together very well: event sourcing is arguably the simplest possible persistence model capable of being a write model, while its nearly useless as a read model). The typical pattern seen is that the command processing component of the system enforces constraints like "product exists before being added to the cart" (how those constraints are enforced is generally a question of whatever concurrency model is in use: the actor model has a high level of mechanical sympathy with this approach, but other models are possible) before writing events to the event store and then the events read back from the event store can be assumed to have been valid as of the time they were written (it's possible to later decide a compensating event needs to be recorded). The events from within the event store can be projected to a Kafka topic for communication to another service (the command processing component is the single source of truth for events).
From the perspective of that other service, as noted, the projected events in the topic are commands (the implicit command for an event is "update your model to account for this event"). Semantically, their provenance as events means that they've been validated and are undeniable (they can be ignored, however). If there's some model validation that needs to occur, that generally entails either a conscious decision to ignore that command or to wait until another command is received which allows that command to be accepted.
Ok, you are still thinking how did we developed applications in last 20 years instead of how we should develop applications in the future. There are frameworks that actually fits the paradigms of future perfectly, one of those, which mentioned above, is Akka but more importantly a sub component of it Akka FSM Finite State Machine, which is some concept we ignored in software development for years, but future seems to be more and more event based and we can't ignore anymore.
So how these will help you, Akka is a framework based on Actor concept, every Actor is an unique entity with a message box, so lets say you have Order Actor with id: 123456789, every Event for Order Id: 123456789 will be processed with this Actor and its messages will be ordered in its message box with first in first out principle, so you don't need a synchronisation logic anymore. But you could have millions of Order Actors in your system, so they can work in parallel, when Order Actor: 123456789 processing its events, an Order Actor: 987654321 can process its own, so there is the parallelism and scalability. While your Kafka guaranteeing the order of every message for Key 123456789 and 987654321, everything is green.
Now you can ask, where Finite State Machine comes into play, as you mentioned the problem arise, when addProduct Event arrives before createOrder Event arrives (while being on different Kafka Topics), at that point, State Machine will behave differently when Order Actor is in CREATED state or INITIALISING state, in CREATED state, it will just add the Product, in INITIALISING state probably it will just stash it, until createOrder Event arrives.
These concepts are explained really good in this video and if you want to see a practical example I have a blog for it and this one for a more direct dive.
I think I found the solution for scalable (multi-partition) event sourcing:
create in Kafka (or in a similar system) topic named messages
assign users to partitions (e.g by murmurHash(login) % partitionCount)
if a piece of data is mutable (e.g. Product, Order), every partition should contain own copy of the data
if we have e.g. 256 pieces of a product in our warehouse and 64 partitions, we can initially 'give' every partition 8 pieces, so most CreateOrder events will be processed quickly without leaving user's partition
if a user (a partition) sometimes needs to mutate data in other partition, it should send a message there:
for example for Product / Order domain, partitions could work similarly to Walmart/Tesco stores around a country, and the messages sent between partitions ('stores') could be like CreateProduct, UpdateProduct, CreateOrder, SendProductToMyPartition, ProductSentToYourPartition
the message will become an 'event' as if it was generated by an user
the message shouldn't be sent during replay (already sent, no need to do it twice)
This way even when Kafka (or any other event sourcing system) chooses to reorder messages between partitions, we'll still be ok, because we don't ever read any data outside our single-threaded 'island'.
EDIT: As #LeviRamsey noted, this 'single-threaded island' is basically actor model, and frameworks like Akka can make it a bit easier.
I have some actors that kill themselves when idle or other system constraints require them to. The actors that have ActorRefs to them are watching for their Terminated(ref), but there is a race condition of messages meant for the actors being sent before the termination arrives and I'm trying to figure out a clean way to handle that.
I was considering subscribing to DeadLetter and using that to signal the sender that their ref is stale and that they need to get or spawn a new target ActorRef.
However, in Akka Typed, I cannot find any way to get to dead letters other than using the untyped co-existence path, so I figure I'm likely approaching this wrong.
Is there a better pattern for dealing dead downstream refs and re-directing messages to a new downstream refs, short of requiring some kind of ack hand-shake for every message?
Consider dead letters as a debugging tool rather something to use to implement delivery guarantees with (true for both Akka typed and untyped).
If an actor needs to be certain that a message was delivered the message protocol will need to include an an ack. To do resending the actor will also need to keep a buffer for in-flight/not yet acknowledged messages to be able to resend.
We have some ideas on an abstraction for different levels of reliability for message delivery, we'll see if that fits in Akka 2.6 or happens later though, prototyped in: https://github.com/akka/akka/pull/25099
The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map() operator an external service REST API that provides this reference data, for each incoming event.
eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)
Another approach is to have second events stream with reference data and store it in KTable that will be a lightweight embedded "database" then join main event stream with it.
KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream
.leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
.map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
.to("enriched-event-topic", ...);
Can the "naive" approach be considered an anti-pattern? Can the "KTable" approach be recommended as the preferred one?
Kafka can easily manage millions of messages per minute. Service that is called from the map() operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?
Yes, it is ok to do RPC inside Kafka Streams operations such as map() operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).
Pros of doing RPC calls from within Kafka Streams operations:
Your application will fit more easily into an existing architecture, e.g. one where the use of REST APIs and request/response paradigms is common place. This means that you can make more progress quickly for a first proof-of-concept or MVP.
The approach is, in my experience, easier to understand for many developers (particularly those who are just starting out with Kafka) because they are familiar with doing RPC calls in this manner from their past projects. Think: it helps to move gradually from request-response architectures to event-driven architectures (powered by Kafka).
Nothing prevents you from starting with RPC calls and request-response, and then later migrating to a more Kafka-idiomatic approach.
Cons:
You are coupling the availability, scalability, and latency/throughput of your Kafka Streams powered application to the availability, scalability, and latency/throughput of the RPC service(s) you are calling. This is relevant also for thinking about SLAs.
Related to the previous point, Kafka and Kafka Streams scale very well. If you are running at large scale, your Kafka Streams application might end up DDoS'ing your RPC service(s) because the latter probably can't scale as much as Kafka. You should be able to judge pretty easily whether or not this is a problem for you in practice.
An RPC call (like from within map()) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
Example: Kafka Streams (by default) processes data based on event-time (= based on when an event happened in the real world), so you can easily re-process old data and still get back the same results as when the old data was still new. But the RPC service you are calling during such reprocessing might return a different response than "back then". Ensuring the latter is your responsibility.
Example: In the case of failures, Kafka Streams will retry operations, and it will guarantee exactly-once processing (if enabled) even in such situations. But it can't guarantee, by itself, that an RPC call you are doing from within map() will be idempotent. Ensuring the latter is your responsibility.
Alternatives
In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable and perform the enrichment of your input stream via a stream-table join.
I suspect most of the advice you hear from the internet is along the lines of, "OMG, if this REST call takes 200ms, how wil I ever process 100,000 Kafka messages per second to keep up with my demand?"
Which is technically true: even if you scale your servers up for your REST service, if responses from this app routinely take 200ms - because it talks to a server 70ms away (speed of light is kinda slow, if that server is across the continent from you...) and the calling microservice takes 130ms even if you measure right at the source....
With kstreams the problem may be worse than it appears. Maybe you get 100,000 messages a second coming into your stream pipeline, but some kstream operator flatMaps and that operation in your app creates 2 messages for every one object... so now you really have 200,000 messages a second crashing through your REST server.
BUT maybe you're using Kstreams in an app that has 100 messages a second, or you can partition your data so that you get a message per partition maybe even just once a second. In that case, you might be fine.
Maybe your Kafka data just needs to go somewhere else: ie the end of the stream is back into a Good Ol' RDMS. In which case yes, there's some careful balancing there on the best way to deal with potentially "slow" systems, while making sure you don't DDOS yourself, while making sure you can work your way out of a backlog.
So is it an anti-pattern? Eh, probably, if your Kafka cluster is LinkedIn size. Does it matter for you? Depends on how many messages/second you need to drive, how fast your REST service really is, how efficiently it can scale (ie your new kstreams pipeline suddenly delivers 5x the normal traffic to it...)
I have a Java web-service that I am going to reimplement from scratch in Scala. I have an actor-based design for the new code, with around 10-20 actors. One of the use-cases has a flow like this:
Actor A gets a message a, creates tens of b messages to be handled by Actor B (possibly multiple instances, for load balancing), producing multiple c messages for Actor C, and so on.
In the scenario above, one message a could lead to a few thousand messages being sent back and forth, but I don't expect more than a handful of a messages a day (yes, it is not a busy service at the moment).
I have the following requirements:
Messages should not be lost or repeated. I mean if the system is restarted in the middle of processing b messages, the unprocessed ones should be picked up after restart. On the other hand, the processed ones should not be taken again (these messages will in the end start some big computation, and repeating them is costly).
It should be easily extensible. I mean in the future, I may want to add some other components to the system that can read all the communication (or parts of it) and for example make a log of what has happened, or count how many b messages were processed, or do something new with the b messages (next to what is already happening), etc. Note that these "components" could be independent applications written in other languages.
I am new to message bus technologies, but from what I have read, these requirements sound to me like what "message buses" offer, like RabbitMQ, Kafka, Kestrel, but I also see that akka also offers some means for persistence.
My problem is, given the huge range of possibilities, I am lost which technology to use. I read that something like Kafka is probably an overkill for my application. But I am also not sure if akka persistence answers my two requirements (especially the extensibility).
My question is: Should I go for an enterprise message bus? Something like Kafka? Or something like akka persistence will do?
Or would it be just faster and more appropriate if I implement something myself (with support for, say, AMQP to allow extensibility)?
Of course, specific technology suggestions are also welcome if you know of something that fits this purpose.
A Message Bus (typically called Message Brokers) like RabbitMQ can handle "out of the box" all of the messaging mechanisms you describe in your question. Specifically:
RabbitMQ has the ability "Out of the Box":
To deliver messages without repeating the message.
To extend the system and add logging and have statistics like you describe.
My application needs to log all messages processed by an actor and replay messages between minSequenceNr and maxSequenceNr sometimes.
Is akka-persistence a good for this use case? If yes, How can I force replay messages from journal? I can use Persistence(actorSystem).journalFor("x") to get a journal's ActorRef but I can't send JournalProtocol.ReplayMessages to it because JournalProtocol is private for akka.persistence.
This question was asked and answered on akka-user already: https://groups.google.com/forum/#!topic/akka-user/AJjdIt_bztM
On Akka 2.3.x (very old version)
Have you read the docs about recovery http://doc.akka.io/docs/akka/2.3.4/scala/persistence.html#recovery ?
You can start recovery by sending an Recover(toSequenceNr: Long) message to yourself.
We do not support ranged (as in “from 200 to 400”) playback, skipping events (the “from N”) does not match eventsourcing philosophy very well.
On the other hand, you can easily issue an replay “to 400”, and simply in your actor choose to ignore any event with seqNr lower than 200,
which achieves the same end result you’re after.
On Akka 2.4.x
Akka Persistence since entering the stable release in 2.4 disallows randomly replaying in the middle of your lifetime. We found it caused more bugs than benefit to people. Please read http://doc.akka.io/docs/akka/2.4.5/scala/persistence.html
I hope this helps, happy hakking!