Why doesn't Keep.None affect the Akka stream execution result? - scala

I have a simple test code for Akka Streams (written in F# but Scala version isn't match different):
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.Via(flow).To(sink);
Since Via helper method is just a shortcut for ViaMaterialized(flow, Keep.Left) I can rewrite the code like this:
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.ViaMaterialized(flow, Keep.Left).To(sink);
Keep property (Left, Right, Both or None) tells the stream materializer that is should preserve the value on a specified side of the stream operation. But I notice that if I change Keep.Left to Keep.Right, Keep.Both or event Keep.None, that doesn't change anything in the execution outcome: the sink will always receive the output according to the flow transformation function.
I thought that using non-None Keep value for Flow stages in a stream graph is necessary to ensure the values gets sent to the sink. I must have misunderstood the meaning of this, so my question is why a stream flow works even when materialization is disabled for both sides? And can you give an example when changing Keep values between Left, Right, Both and None affects the values that reach the sink?

You are confusing the fact that a stream gets materialized and the fact that it has a materialized value.
A flow (or more generally a graph) is a blueprint for a stream. When you use the run() method on a runnable graph, a stream is materialized using this blueprint. This stream does whatever is expected of it without any regards for materialized values.
What is a materialized value? When you use the method run(), a value is returned. That's the materialized value for your stream. Most of the time (for simple built-in stages), the materialized value is unimportant (it's called NotUsed in scala, I don't know about .NET). A non-trivial example is the Sink.ignore that is materialized as a Future[Done]. It gives you a handle on when the particular stream you have materialized will have completely consumed its input (or thrown an error). More generally, the materialized value gives you some circumstantial information on what's going in your stream (sorry about the vagueness of this statement, but the principle at hand is too general for me to be more explicit).
When building a graph, you put together different pieces that all have a different materialized value. Since you can only have one for your runnable graph, you need to combine them in some way. Keep.{right, left, both, none} are simple functions that combine those values by keeping only one of the values, or both, or none. However, it does not change the fact that both graphs will be materialized, and the values generated, even if you decide not to keep them.

Keep.* functions don't influence the materialization process itself, only what you get out of it.
More specifically, at materialization time (i.e. when run() is called), each and every stage of your stream (in your example, source, flow and sink) will always be materialized - and therefore produce a materialized value under the hood. You can clearly see what that value will be from their last type parameter.
For the user's convenience, as most likely you will not be interested in all of them, you can use Keep.* accordingly to select what to keep around. This directly reflects on the return type of run().

Related

Flink processWindow function emits records with partial information

We are seeing some weird behaviour with a processWindow function emitting two records,
the first record contains complete information using aggregated data present in the window and the second record contains partial information with some information removed from the record.
The processWindow function is using state(MapState) as follows:
override def open(parameters: Configuration): Unit = {
cfState = getRuntimeContext.getMapState(
new MapStateDescriptor[(String, Int), mutable.Map[Int, mutable.Set[Int]]] (
"customFieldsState",
classOf[(String, Int)],
classOf[mutable.Map[Int, mutable.Set[Int]]]
)
)
}
and the process function manipulates the above state using records present in the window.
Is this an anti-pattern? Using state within a processWindow function? Are there any other recommendations to using state within a processWindow function?
We need to maintain state in this case as we don't capture all fields in a single window and we need to aggregate the records per user, hence the use of a window function.
Thanks
If you want to maintain state beyond the lifetime of a single window instance, you should use
KeyedStateStore ProcessWindowFunction.Context#globalState
All other state is cleared when the window is closed.
Since globalState is never cleared by Flink, you should set state TTL on the state descriptor you use if you will have keys that go stale, in order to avoid leaking state over time.

Converting paging function to a Flow

I have a large quantity of sqlite databases, represented as Source[File, NotUsed]. For each db, I want to paginate through the results. Memory limits mean I cannot do this eagerly. Say that the result type is Foo, then I'm trying to figure out how to create a Flow[File, Foo, NotUsed] that internally uses a lazy, recursive call on the resource.
I see that the Source.unfold method allows me to do this, but it can only create a Source, which means I can't feed it the necessary input of File. I can't see how to convert a Source to a Flow (except via fromSinkAndSource, but that doesn't pipe the values through). I'm not sure if this path of inquiry will yield anything.
It was suggested to me that I should use the GraphDSL and Merge, but I'm stuck trying to understand how many input ports the Merge should have and how I would actually wire it together.
I think you're looking for the flatMapConcat operator:
Signature
def flatMapConcat[T, M](f: Out ⇒ Graph[SourceShape[T], M]): Repr[T]
Description
Transform each input element into a Source whose elements are then flattened into the output stream through concatenation. This means each source is fully consumed before consumption of the next source starts.
emits when the current consumed substream has an element available
backpressures when downstream backpressures
completes when upstream completes and all consumed substreams complete

Parallel design of program working with Flink and scala

This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.

What do the type parameters to Source<Out,Mat> mean?

I'm trying to understand the Source type for Akka streams, specified here.
Unfortunately, the documentation and examples I've found don't explain what each of the type parameters actually mean. I'm guessing that Out is the type that the source emits when materialized. Is that correct? What is the other type parameter Mat ?
Out
You are correct, this is the type of the elements which are emitted by the Source.
Mat
It is the type of the Source's materialisation. Note that every stage (Flows, Sinks, etc.) will materialise to a value as well.
This is essentially a byproduct of the stage itself after it is run.
You can picture it as a mean of interacting with the stage while it's running. Looking at examples of ready-made Sources offered by Akka is a good way of getting the gist of it.
Source.single will materialise to NotUsed. You got no mean of interacting with the source, as it will produce only one element straightaway and then complete.
Source.queue will materialise to a SourceQueue. This is a more interesting case, as you can interact with the source by offering messages to it. The messages you offer will be emitted by the source.
Source.maybe will materialise to a Promise. You can use the Promise to control the source and make it emit one element, or None.
When you concatenate different stages, note that every stage can potentially have a useful materialized value. You get to choose which ones to keep by using viaMat/toMat and Keep DSL.
One or more materialized values will be returned when run() is called on the composed graph.
Taking a look at the types in the examples below is the best way to get the gist of the API:
val source: Source[Int, MatSrc]
val sink: Sink[Int, MatSnk]
val matSrc: MatSrc = source.toMat(sink)(Keep.left).run()
val matSnk: MatSnk = source.toMat(sink)(Keep.right).run()
val (m1: MatSrc, m2: MatSnk) = source.toMat(sink)(Keep.both).run()
val n: NotUsed = source.toMat(sink)(Keep.none).run()
Note that the more succinct DSL which you can find in many example is actually a shortcut for the above, where only the materialized value of the last stage (e.g. the sink) is kept.
val mat3: Mat3 = source.viaMat(flow)(Keep.right).toMat(sink)(Keep.right).run()
is the same as
val mat3: Mat3 = source.via(flow).runWith(sink)
See the docs below for further reading.
http://doc.akka.io/docs/akka/2.4/java/stream/stream-quickstart.html#Materialized_values

Combining Observables when both change simultaneously

I am trying to integrate ReactiveX into my GUI using RxPY. This is a general ReactiveX question.
Say I have a visualization that depends on multiple Observable streams using combine_latest(stream1, stream2, plot_function). This works great when one Observable changes, like when the user modifies a value; the visualization updates using the new values. However, sometimes both Observables are updated simultaneously, like when the user loads data for both streams from a single file. Technically, one Observable will be updated before the other (whichever comes first in the import function). As a result, the plot will be updated twice, but for all intents and purposes it need only be updated once.
Some of the visualizations I have are expensive to compute, so I want to make sure that if both streams are updated simultaneously, then the combined stream only emits one value. I can think of a few options:
Use debounce() with a small timeout (like 50ms) on the combined stream. This approach seems dirty to me.
Don't use combine_latest directly. Wrap the two streams in a new object that also has some sort of updating flag. If I set the updating flag to True, then don't emit anything until I set the updating flag to False. This approach feels to stateful, and it ruins the composability of the streams.
Tell all visualizations not to update until all the streams are updated. Again, this breaks encapsulation because the visualization shouldn't care what is happening upstream. It should just receive the new values from the combined stream and make a pretty picture.
Make the visualizations fine-grained enough that one stream updating first only introduces a small performance penalty. This is impossible for some visualizations, like a visualization that computes a mesh based on points and a mesh size. If the points or the mesh size changes, then the whole mesh needs to be recomputed.
Is there some facility in Rx to handle "simultaneously" updating multiple streams? I feel like what I'm asking for is magic.
For anyone who has made GUI programs using Rx: is there some better architecture I should be using for models besides sending new values through streams?
If this question is unclear, please tell me in a comment and I will try to make a more concrete example.
Example
Here is a sample Python RxPY program:
import rx
stream1 = rx.subjects.BehaviorSubject(1)
stream2 = rx.subjects.BehaviorSubject(2)
rx.Observable\
.combine_latest(stream1, stream2, lambda x, y: (x, y))\
.subscribe(print)
stream1.on_next(3)
stream2.on_next(4)
This prints:
(1, 2)
(3, 2)
(3, 4)
How could I update the values of stream1 and stream2 simultaneously so that the following becomes the result?
(1, 2)
(3, 4)
In other words, how could I modify combine_latest in such a way that I can tell it downstream "hey, wait a second while I update the other streams before you emit your next value"?
I have found one possible answer, but it is not the best and I would like others.
I discovered the pausable combinator. By passing in a stream that emits True or False, you can control if a sequence will be paused. Here is a modification of my example:
import rx
stream1 = rx.subjects.BehaviorSubject(1)
stream2 = rx.subjects.BehaviorSubject(2)
pauser = rx.subjects.BehaviorSubject(True)
rx.Observable\
.combine_latest(stream1, stream2, lambda x, y: (x, y))\
.pausable(pauser)\
.subscribe(print)
# Begin updating simultaneously
pauser.on_next(False)
stream1.on_next(3)
stream2.on_next(4)
# Update done, resume combined stream
pauser.on_next(True)
# Prints:
# (1, 2)
# (3, 4)
To apply to my GUI, I can create a BehaviorSubject called updating in my model that emits whether or not the whole model is being updated. For example, if stream1 and stream2 are being simultaneously updated, then I can set updating to True. On any visualizations that are expensive to produce, I can apply the value of updating to pause the combined stream.
This works in Rx for c#:
var throttled = source.Publish(hot => hot.Buffer(() => hot.Throttle(dueTime));
The dueTime value here is a TimeSpan in .NET. It merely says what the window of time is that you want to have inactivity before a value it produced. This basically gobbles up values produced "simultaneously" within a margin of time.
The source in this case would be your .combine_latest(...) observable.