If I specify a changelog backing for a RocksDB Table in Samza. Is there configuration to update the async write time to the changelog? I want to reduce it to a shorter time. I cannot see anything in the Config reference.
The scenario I want is too write to a changelog from a stream after bridging a legacy JMS connection. This legacy connection provides partial updates and I want to merge the partial updates into a fuller message building a cache of these messages in the samza streaming application and write these down to a changelog.
If I use a changelog configured with stores.store-name.changelog then it will write to the changelog eventually changes I make to the Samze API Table. But not quick enough for my needs so want to configure the max wait time to propagate to changelog.
Alternatively it seems that using the withSideInputs to bootstrap my table each time and then using sendTo will work faster to update and I can keep a LocalStore to read and write the cache too and always have the changelog as golden source.
The reason I want the changelog to write quickly too is because other applications are reading from this changelog.
Yes you can configure the time it will commit changes to the changelog usin the config:
task.commit.ms
Docs
Then writes to the store will be flushed when the commit happens:
profileTable.put(message.key, message.value)
A note on this higher volumes of input appear to result in changes going to changelog topic before this commit millisecond configuration. Also be careful not to put too low as will slow down overall throughout massively with higher volumes.
You can also use the low level API to commit on a particular stream task the TaskCoordinator provides commit api to manually commit.
There are a lot of articles across the internet about the usage of the Kafka Streams, but almost nothing about how it's done internally.
Does it use any features inside Kafka outside the standard set (let's call "standard" the librdkafka implementation)?
If it saves the state inside RocksDB (or any custom StateStore), how it guarantees that the state saving and the commit are in one transaction?
The same question in the case when the state is saved in the compacted log (the commit and the log updated should be in one transaction).
Thank you.
I found the answers by combining information from several threads here.
It uses transactions (see https://stackoverflow.com/a/54593467/414016) which are unsupported (yet) by librdkafka.
It doesn't really rely on RocksDB, instead it saves the state changes into the commit log (see https://stackoverflow.com/a/50264900/414016).
It does it using the transactions as mentioned above.
I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?
Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.
Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).
there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.
Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it
We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
{
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
}
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
context.put(...)
context.commit();
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.
Since I have not seen any example of using AKKA.NET Journals and Snapshot store, I assume I have to use both type of actors to implement an Event Store and CQRS.
Is the Snapshot store expected to be updated every time when the actor state is changed, or should be set on a scheduled update like every 10 seconds?
Should the Snapshot store actors talk to the Journal actors only, so the actors having the state should not talk to Journals and Snapshot at the same time? I'm thinking in the line of SOC.
Assume I have to shut down the server and back up. A user tries to access a product (like computers) through a Web UI. At that time, the product actor does not exist in the actor system. To retrieve the state of the product, shouldn't I go to the snapshot store instead of running all the journals to recreate the state?
In Akka.Persistence both Journal and SnapshotStore are in fact actors used to abstract your actors from particular persistent provider. You almost never will have to use them directly - PersistentView and PersistentActor use them automatically under the hood.
Snapshot stores are only way to optimize speed of your actor recovery in case when your persistent actor has a loot of events to recover from. In distributed environment snapshotting without event sourcing is not a mean to achieve persistence . Good idea is to have counter which produces a snapshot after X events being processed by the persistent actor. Time-based updates have no sense - in many cases actor probably didn't changed over specified time. Performance is also bad (lot of unnecessary cycles).
SnapshotStores and Journals are unaware of each other. Akka.Persistence persistent actors have built-in recovering mechanism which handles actor's state recovery from SnapshtoStores and Journals and exposes methods to communicate with them.
As I said you'd probably don't want to communicate with snapshot stores and journals directly. This is what persistent actors/persistent views are for. Ofc you could probably just read actor state directly from backend storage, but the you should compare if there are no other events after latest saved snapshot etc. Recreation of persistent actor/view on different working node is IMO a better solution.