I am implementing an EventSourcing application that handles a large number of original and derived data points. In short, we have an PersistentActor functioning as an Aggregate Root accepting commands:
UpdateValue(name, value, timestamp)
UpdateValue(name, value, timestamp)
UpdateValue(name, value, timestamp)
After these commands are verified, they produce events which are persisted and update the state:
ValueUpdated(name, value, timestamp)
ValueUpdated(name, value, timestamp)
ValueUpdated(name, value, timestamp)
In a PersistentView we listen to these events and calculate derived values:
case v # ValueUpdated("value_i_care_about", _, _) => calculate_derived_values(v)
case v # ValueUpdated("another_value_i_care_about", _, _) => calculate_derived_values(v)
But this recalculation itself is a new value on which other views could depend, so that means that we have to send a new command back to the aggregate root to process the new value, which can produce events that can be picked up by this or other views.
Is it acceptable for a view to produce events or commands? I would think a view's responsibility is to update a state based on events, not produce events or commands. Also, the order in which events arrive can influence the new events being broadcast during replay.
Is it necessary to produce commands instead of events? Since the command was updating the initial value, you could argue that producing all the derived values are simply events that are the result of the command being processed, although they are generated in a distributed fashion and not directly by the aggregate root.
I've been looking at Akka's Reactive Streams that could be used to string these actors together, and also looked at the idea of Sagas as presented here: http://blog.jonathanoliver.com/cqrs-sagas-with-event-sourcing-part-i-of-ii/. In that post Jonathan mentions:
Sagas listen to events and dispatch commands while aggregates receive commands and publish events.
That seems like a sensible approach as well, to implement all these actors as FSMs: wait 5 seconds for related events, recalculate everything, dispatch command, wait 5 seconds for events, etc.
To make things a little bit more interesting, the streams of values can be out of order and incomplete, but should produce derived values at points in time. So if I receive values A & B:
A1, B1, B2, A2, B3, A4, B4
it should produce derived values D:
D1 (A1 * B1), D2 (B2 * A2), D3 (B3 * A2, there is no A3), D4 (A4 * B4)
This means I have to keep track of order, and sometimes reissue a derived value if a missing value comes in.
Thanks!
An alternative is to persist both events at once in your aggregate but this assumes the aggregate has the information available to do so. If the second calculation relies on a bunch of query side data then of course this isn't an option.
But truly this sounds outside of the realm of the aggregate if you are doing it simply because other subsystems are interested. You can just publish a non-persistent event to the event stream and have other views/listeners subscribe.
Related
The confluent document here states
And Kafka exploits this duality in many ways: for example, to make your applications elastic, to support fault-tolerant stateful processing, or to run Kafka Streams Interactive Queries against your application’s latest processing results.
Wonder if there are more details for how is the duality of streams/tables used in these scenarios. Looking for some simple explanation rather than some long design docs
A stream can be considered as log and table can be a snapshot of logs at a given instant of time.
A stream is a flow of data, new data can keep on coming and we process it as it comes, store the processed results in a table for querying.
A table's data, changes over time. At any given instant of time, we get a snapshot of that data at that instant. A table, therefore, can be used for performing queries and retreiving results on demand, which is not the case with 'just' streams
For example,
User comments on a video can be a stream of events, new comments keep on coming and they simply get displayed on the UI. Nothing to query here (typically).
But there also some other use cases, like..
Cricket updates: For every new ball, we get no. of runs for that ball, now we need to add them to the score. We certainly, need to store the previous score and update it with every new ball. We also need to query the score at any given instant of time (on demand). For performing queries or updating the score, we can use a table.
In the Kafka context, event is a log message and every message is immutable.
Consider an example, of an user information getting updated.
{user_id: 101, name: X}
{user_id: 101, name: Y}
The name of the user_id=101 is updated from X to Y. When you perform the update on a database directly and do a query, you see only name: Y, you may not have the previous name of the user with you, because it is overridden with the new value.
In Kafka, we have two messages, 'X' and 'Y'.
At times, this may be useful and even critical. A hacker could have changed all the user information, and the legit user has no way of proving his identity to re-claim his account. But if there is previous info about his account which he can tell as a proof, he can re-claim it.
So for those who use Kafka, there could be use-cases to store data as a table (or) a map and then retrieve it using queries.
I have a large quantity of sqlite databases, represented as Source[File, NotUsed]. For each db, I want to paginate through the results. Memory limits mean I cannot do this eagerly. Say that the result type is Foo, then I'm trying to figure out how to create a Flow[File, Foo, NotUsed] that internally uses a lazy, recursive call on the resource.
I see that the Source.unfold method allows me to do this, but it can only create a Source, which means I can't feed it the necessary input of File. I can't see how to convert a Source to a Flow (except via fromSinkAndSource, but that doesn't pipe the values through). I'm not sure if this path of inquiry will yield anything.
It was suggested to me that I should use the GraphDSL and Merge, but I'm stuck trying to understand how many input ports the Merge should have and how I would actually wire it together.
I think you're looking for the flatMapConcat operator:
Signature
def flatMapConcat[T, M](f: Out ⇒ Graph[SourceShape[T], M]): Repr[T]
Description
Transform each input element into a Source whose elements are then flattened into the output stream through concatenation. This means each source is fully consumed before consumption of the next source starts.
emits when the current consumed substream has an element available
backpressures when downstream backpressures
completes when upstream completes and all consumed substreams complete
This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.
I have a simple test code for Akka Streams (written in F# but Scala version isn't match different):
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.Via(flow).To(sink);
Since Via helper method is just a shortcut for ViaMaterialized(flow, Keep.Left) I can rewrite the code like this:
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.ViaMaterialized(flow, Keep.Left).To(sink);
Keep property (Left, Right, Both or None) tells the stream materializer that is should preserve the value on a specified side of the stream operation. But I notice that if I change Keep.Left to Keep.Right, Keep.Both or event Keep.None, that doesn't change anything in the execution outcome: the sink will always receive the output according to the flow transformation function.
I thought that using non-None Keep value for Flow stages in a stream graph is necessary to ensure the values gets sent to the sink. I must have misunderstood the meaning of this, so my question is why a stream flow works even when materialization is disabled for both sides? And can you give an example when changing Keep values between Left, Right, Both and None affects the values that reach the sink?
You are confusing the fact that a stream gets materialized and the fact that it has a materialized value.
A flow (or more generally a graph) is a blueprint for a stream. When you use the run() method on a runnable graph, a stream is materialized using this blueprint. This stream does whatever is expected of it without any regards for materialized values.
What is a materialized value? When you use the method run(), a value is returned. That's the materialized value for your stream. Most of the time (for simple built-in stages), the materialized value is unimportant (it's called NotUsed in scala, I don't know about .NET). A non-trivial example is the Sink.ignore that is materialized as a Future[Done]. It gives you a handle on when the particular stream you have materialized will have completely consumed its input (or thrown an error). More generally, the materialized value gives you some circumstantial information on what's going in your stream (sorry about the vagueness of this statement, but the principle at hand is too general for me to be more explicit).
When building a graph, you put together different pieces that all have a different materialized value. Since you can only have one for your runnable graph, you need to combine them in some way. Keep.{right, left, both, none} are simple functions that combine those values by keeping only one of the values, or both, or none. However, it does not change the fact that both graphs will be materialized, and the values generated, even if you decide not to keep them.
Keep.* functions don't influence the materialization process itself, only what you get out of it.
More specifically, at materialization time (i.e. when run() is called), each and every stage of your stream (in your example, source, flow and sink) will always be materialized - and therefore produce a materialized value under the hood. You can clearly see what that value will be from their last type parameter.
For the user's convenience, as most likely you will not be interested in all of them, you can use Keep.* accordingly to select what to keep around. This directly reflects on the return type of run().
I am trying to integrate ReactiveX into my GUI using RxPY. This is a general ReactiveX question.
Say I have a visualization that depends on multiple Observable streams using combine_latest(stream1, stream2, plot_function). This works great when one Observable changes, like when the user modifies a value; the visualization updates using the new values. However, sometimes both Observables are updated simultaneously, like when the user loads data for both streams from a single file. Technically, one Observable will be updated before the other (whichever comes first in the import function). As a result, the plot will be updated twice, but for all intents and purposes it need only be updated once.
Some of the visualizations I have are expensive to compute, so I want to make sure that if both streams are updated simultaneously, then the combined stream only emits one value. I can think of a few options:
Use debounce() with a small timeout (like 50ms) on the combined stream. This approach seems dirty to me.
Don't use combine_latest directly. Wrap the two streams in a new object that also has some sort of updating flag. If I set the updating flag to True, then don't emit anything until I set the updating flag to False. This approach feels to stateful, and it ruins the composability of the streams.
Tell all visualizations not to update until all the streams are updated. Again, this breaks encapsulation because the visualization shouldn't care what is happening upstream. It should just receive the new values from the combined stream and make a pretty picture.
Make the visualizations fine-grained enough that one stream updating first only introduces a small performance penalty. This is impossible for some visualizations, like a visualization that computes a mesh based on points and a mesh size. If the points or the mesh size changes, then the whole mesh needs to be recomputed.
Is there some facility in Rx to handle "simultaneously" updating multiple streams? I feel like what I'm asking for is magic.
For anyone who has made GUI programs using Rx: is there some better architecture I should be using for models besides sending new values through streams?
If this question is unclear, please tell me in a comment and I will try to make a more concrete example.
Example
Here is a sample Python RxPY program:
import rx
stream1 = rx.subjects.BehaviorSubject(1)
stream2 = rx.subjects.BehaviorSubject(2)
rx.Observable\
.combine_latest(stream1, stream2, lambda x, y: (x, y))\
.subscribe(print)
stream1.on_next(3)
stream2.on_next(4)
This prints:
(1, 2)
(3, 2)
(3, 4)
How could I update the values of stream1 and stream2 simultaneously so that the following becomes the result?
(1, 2)
(3, 4)
In other words, how could I modify combine_latest in such a way that I can tell it downstream "hey, wait a second while I update the other streams before you emit your next value"?
I have found one possible answer, but it is not the best and I would like others.
I discovered the pausable combinator. By passing in a stream that emits True or False, you can control if a sequence will be paused. Here is a modification of my example:
import rx
stream1 = rx.subjects.BehaviorSubject(1)
stream2 = rx.subjects.BehaviorSubject(2)
pauser = rx.subjects.BehaviorSubject(True)
rx.Observable\
.combine_latest(stream1, stream2, lambda x, y: (x, y))\
.pausable(pauser)\
.subscribe(print)
# Begin updating simultaneously
pauser.on_next(False)
stream1.on_next(3)
stream2.on_next(4)
# Update done, resume combined stream
pauser.on_next(True)
# Prints:
# (1, 2)
# (3, 4)
To apply to my GUI, I can create a BehaviorSubject called updating in my model that emits whether or not the whole model is being updated. For example, if stream1 and stream2 are being simultaneously updated, then I can set updating to True. On any visualizations that are expensive to produce, I can apply the value of updating to pause the combined stream.
This works in Rx for c#:
var throttled = source.Publish(hot => hot.Buffer(() => hot.Throttle(dueTime));
The dueTime value here is a TimeSpan in .NET. It merely says what the window of time is that you want to have inactivity before a value it produced. This basically gobbles up values produced "simultaneously" within a margin of time.
The source in this case would be your .combine_latest(...) observable.