As per the Reactive Stream paradigm,
Currently, we’ve focused primarily on cold streams. These are static, fixed length streams which are easy to deal with. A more realistic use case for reactive might be something that happens infinitely. For example, we could have a stream of mouse movements which constantly needs to be reacted to or a twitter feed. These types of streams are called hot streams, as they are always running and can be subscribed to at any point in time, missing the start of the data.
So how can we implement this hot stream?
This can be done using the ConnectableFlux as follows:
ConnectableFlux<Object> publish = Flux.create(fluxSink -> {
while(true) {
fluxSink.next(System.currentTimeMillis());
}
})
.publish();
You could use MongoDB's Capped Collections to create a hot stream. By using a #Tailable on that collection it will create a publisher which publishes every new entry. With .share() it will multicast that publisher, so that not every subscription creates a new database connection.
Related
I am converting a monolithic application which was handling CRUD events into multiple microservices. Each microservice being designed as kafka stream with specific function. Idea was to allow scaling each microservice individually.
The challenge is, there is a difference in the processing flow of Create and update events where Create event has to go through multiple functional microservices( filtering, enriching) in the pipeline where as update event need not go through these processors. But in order to process the create and update event in order, update events also have to flow through all the microservices( read/write from kafka topics)if not update event can get published even before create event is processed.
event publisher(create/update event) =>(kafka topic)=> filter =>(kafka topic)=> enrichment=>..=>...=>(output topic)
Is this solution ok to have dummy pipe/service for update event flow to achieve ordering?
Another approach is, instead of designing each microservice as kafka stream, i create each microservice with a synchronous rest interface and call it from a single kafka stream application which can then manage the order of processing create and update events.
event publisher(create/update event) =>(kafka topic)=> event processor=>(output topic)
where event processor invoked each functional microservice with rest interface.
In this case, although i still achieve separate microservice for each function which can scale individually, invoking the rest calls inside an async streaming application might impact its scalability and other potential problems.
Any comments/suggestion on both the approaches and any alternative approach recommendation would help.
I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.
I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?
Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.
Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).
there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.
Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it
Microsoft says, "developers represent asynchronous data streams with Observables." I'm trying to reason through the idea. If I were to tackle the concept implicitly, I would imagine that it's just, anything that could be observed in the data stream. Code should be more precise.
How would I know an "observable" if I saw it? Could you give me a better explanation of what an "observable" is?
Microsoft says, "developers represent asynchronous data streams with
Observables." I'm trying to reason through the idea. If I were to
tackle the concept implicitly, I would imagine that it's just,
anything that could be observed in the data stream. Code should be
more precise.
The code actually is more precise. An Observable is represented by the IObservable<T> interface. The main job of IObservable<T> is to handle IObserver<T>s. These two work in tandem: An IObservable<T> represents a stream of type T that can be be subscribed to. An IObserver<T> represents a handler that subscribes to the observable to handle those events.
There are three types of events that an observable can implicitly emit:
OnNext: The next instance of T
OnCompleted: A non-error (empty-message) terminator.
OnError: An error terminator.
However, observables don't emit these messages directly, rather they emit them only onto subscribed observers.
How would I know an "observable" if I saw it? Could you give me a
better explanation of what an "observable" is?
Imagine a service that reports the latest Apple stock price. You can think of the service as an observable. To get this information, you would have to subscribe to the service. Once subscribed, the service could emit one of three messages:
Next most-latest stock price
Market closed
Some sort of failure (connection failure would be most typical)
You would in turn write a handler to handle these three types of messages. That handler would be an observer to the observable stream of prices.
From Wikipedia:
The observer pattern is a software design pattern in which an object, called the subject, maintains a list of its dependents, called observers, and notifies them automatically of any state changes, usually by calling one of their methods.
This definition is clear when applied to the events used in user interfaces: you observe button clicks by providing a event handler which the button calls when it is clicked. In this case, the button is an observable, which notifies a number of observers in the form of event handlers.
Applied to reactive programming, an observable is just a stream of events that you can subscribe - i.e. observe. Think of it as a pipe through which events traverse and that you can peek into. You do so by observing the stream and handling those events you are interested to. Furthermore, operations can be performed over streams - for instance merging a couple streams into a new one.
Both the publishing of events to the stream and the handling of those events - your observer which processes them - can be done asynchronously which promotes scalability.
Similar concepts are those of messages, topics, and subscribers: some stakeholder can publish messages to a topic, to which many different stakeholders can subscribe. Respectively, these would correspond to the events, the observable stream event, and the observers.
Microsoft uses the terms Observer and Observable while in some other reactive frameworks they may use other terms. The Getting started of Introduction to Rx can help you further clarify these concepts and the whole book is a free gem. Note that this book prefers to use the term sequence to refer to a stream of events.
I would imagine that it's just, anything that could be observed in the data stream.
That's right. Actually, in Microsoft's Rx, the main core are just the two interfaces interfaces defining the contract between observers and observables, the rest is pretty much abstracted away.
I think the terminology varies, but if you search for functional reactive programming papers e.g. on Google Scholar, you will find definitions of the basic concepts behavior and event. I think the following two definitions from a paper from Functional Reactive Programming from First Principles are representative:
Behavior is a value of type a that changes over time
Event is a
time-ordered sequence of event occurences
Intuitively, a behavior is a stream transformer: a function that takes
an infinite stream of sample times, and yields an infinite stream of
values. Similarly, an event is a stream transformer, and can be
thought of as a behavior where, at each time t, the event either
occurs or does not occur.
It seems MS fuses both into the concept of an Observable.
I think it is good to read some background papers to get the terminology. The papers from Conal Elliott are a good start. Or you could enlist in the Principles of Reactive Programming at coursera if you want a more interactive introduction.
I watched the video and I know the general principles - hot happens even when nobody is subscribed, cold happens "on demand".
Also, Publish() converts cold to hot and Defer() converts hot to cold.
But still, I feel I am missing the details. Here are some questions I'd like to have answered:
Can you give a comprehensive definition for these terms?
Does it ever make sense to call Publish on a hot observable or Defer on a cold?
What are the aspects of Hot/Cold conversions - do you lose messages, for example?
Are there differences between hot and cold definitions for IObservable and IEnumerable?
What are the general principles you should take into account when programming for cold or hot?
Any other tips on hot/cold observables?
From: Anton Moiseev's Book “Angular Development with Typescript, Second Edition.” :
Hot and cold observables
There are two types of observables: hot and cold. The main
difference is that a cold observable creates a data
producer for each subscriber, whereas a hot observable
creates a data producer first, and each subscriber gets the
data from one producer, starting from the moment of subscription.
Let’s compare watching a movie on Netflix to going into a
movie theater. Think of yourself as an observer. Anyone who decides to watch Mission: Impossible on Netflix will get the entire
movie, regardless of when they hit the play button. Netflix creates a
new producer to stream a movie just for you. This is a cold
observable.
If you go to a movie theater and the showtime is 4 p.m., the producer
is created at 4 p.m., and the streaming begins. If some people
(subscribers) are late to the show, they miss the beginning of the
movie and can only watch it starting from the moment of arrival. This
is a hot observable.
A cold observable starts producing data when some code invokes a
subscribe() function on it. For example, your app may declare an observable providing a URL on the server to get certain products. The
request will be made only when you subscribe to it. If another script
makes the same request to the server, it’ll get the same set of data.
A hot observable produces data even if no subscribers are
interested in the data. For example, an accelerometer in your
smartphone produces data about the position of your device, even if no
app subscribes to this data. A server can produce the latest stock
prices even if no user is interested in this stock.
Hot observables are ones that are pushing event when you are not subscribed to the observable. Like mouse moves, or Timer ticks or anything like that. Cold observables are ones that start pushing only when you subscribe, and they start over if you subscribe again.
I hope this helps.
Can you give a comprehensive
definition for these terms?
See my blog post at: https://leecampbell.com/2010/08/19/rx-part-7-hot-and-cold-observables
Does it ever make sense to call
Publish on a hot observable or Defer
on a cold?
No, not that I can think of.
What are the aspects of Hot/Cold
conversions - do you lose messages,
for example?
It is possible to "lose" messages when the Observable is Hot, as "events" happen regardless of subscribers.
Are there differences between hot and
cold definitions for IObservable and
IEnumerable?
I dont really understand the question. I hope this analogy helps though. I would compare a Hot Observable to an Eagerly evaluated IEnumerable. ie a List or an Array are both Eagerly evaluated and have been populated even if no-one enuemerates over them. A yield statement that gets values from a file or a database could be lazily evaluated with the Yield keyword. While lazy can be good, it will by default, be reevaluated if a second enumerator runs over it. Comparing these to Observables, a Hot Observable might be an Event (Button click) or a feed of temperatures; these events will happen regardless of a subscription and would also be shared if multiple subscriptions were made to the same observale. Observable.Interval is a good example of a Cold observable. It will only start producing values when a subscription is made. If multiple subscriptions as made then the sequence will be re-evaluated and the "events" will occur at seperate times (depending on the time between subscriptions).
What are the general principles you should take into account when programming for cold or hot?
Refer to the link in point one. I would also recommend you look into Publsh being used in conjunction with RefCount. This allows you to have the ability to have Lazy evaluation semantics of Cold Observables but the sharing of events that Hot Observables get.
Any other tips on hot/cold
observables?
Get your hands dirty and have a play with them. Once you have read about them for more than 30minutes, then time spent coding with them is far more productive to you than reading any more :)
Not pretending to give a comprehensive answer, I'd like to summarize in a simplest form what I have learned since the time of this question.
Hot observable is an exact match for event. In events, values usually are fed into the handler even if no subscribers are listening. All subscribers are receiving the same set of values. Because of following the "event" pattern, hot observables are easier to understand than the cold ones.
Cold observable is also like an an event, but with a twist - Cold observable's event is not a property on a shared instance, it is a property on an object that is produced from a factory each time when somebody subscribes. In addition, subscription starts the production of the values. Because of the above, multiple subscribers are isolated and each receives its own set of values.
The most common mistake RX beginners make is creating a cold observable (well, thinking they are creating a cold observable) using some state variables within a function (f.e. accumulated total) and not wrapping it into a .Defer() statement. As a result, multiple subscribers share these variables and cause side effects between them.