I have a large quantity of sqlite databases, represented as Source[File, NotUsed]. For each db, I want to paginate through the results. Memory limits mean I cannot do this eagerly. Say that the result type is Foo, then I'm trying to figure out how to create a Flow[File, Foo, NotUsed] that internally uses a lazy, recursive call on the resource.
I see that the Source.unfold method allows me to do this, but it can only create a Source, which means I can't feed it the necessary input of File. I can't see how to convert a Source to a Flow (except via fromSinkAndSource, but that doesn't pipe the values through). I'm not sure if this path of inquiry will yield anything.
It was suggested to me that I should use the GraphDSL and Merge, but I'm stuck trying to understand how many input ports the Merge should have and how I would actually wire it together.
I think you're looking for the flatMapConcat operator:
Signature
def flatMapConcat[T, M](f: Out ⇒ Graph[SourceShape[T], M]): Repr[T]
Description
Transform each input element into a Source whose elements are then flattened into the output stream through concatenation. This means each source is fully consumed before consumption of the next source starts.
emits when the current consumed substream has an element available
backpressures when downstream backpressures
completes when upstream completes and all consumed substreams complete
Related
The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.
The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.
This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.
I want to implement the following functions in the most re-active way. I need these for implementing the bijections for automatic conversion between the said types.
def convertScalaRXObservableToTwitterFuture[A](a: Observable[A]): TwitterFuture[A] = ???
def convertScalaRXObservableToTwitterFutureList[A](a: Observable[A]): TwitterFuture[List[A]] = ???
I came across this article on a related subject but I can't get it working.
Unfortunately the claim in that article is not correct and there can't be a true bijection between Observable and anything like Future. The thing is that Observable is more powerful abstraction that can represent things that can't be represented by Future. For example, Observable might actually represent an infinite sequence. For example see Observable.interval. Obviously there is no way to represent something like this with a Future. The Observable.toList call used in that article explicitly mentions that:
Returns a Single that emits a single item, a list composed of all the items emitted by the finite source ObservableSource.
and later it says:
Sources that are infinite and never complete will never emit anything through this operator and an infinite source may lead to a fatal OutOfMemoryError.
Even if you limit yourself to only finite Observables, still Future can't fully express semantics of Observable. Consider Observable.intervalRange that generates a limited range one by one over some time period. With Observable the first event comes after initialDelay and then you get event each period. With Future you can get only one event and it must be only when the sequence is fully generated so Observable is completed. It means that by transforming Observable[A] into Future[List[A]] you immediately break the main benefit of Observable - reactivity: you can't process events one by one, you have to process them all in a single bunch.
To sum up the claim at the first paragraph of the article:
convert between the two, without loosing asynchronous and event-driven nature of them.
is false because conversion Observable[A] -> Future[List[A]] exactly looses the "event-driven nature" of Observable and there is no way to work this around.
P.S. Actually the fact that Future is less powerful than Observable should not be a big surprise. If it was not, why anybody would create Observable in the first place?
I have a simple test code for Akka Streams (written in F# but Scala version isn't match different):
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.Via(flow).To(sink);
Since Via helper method is just a shortcut for ViaMaterialized(flow, Keep.Left) I can rewrite the code like this:
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.ViaMaterialized(flow, Keep.Left).To(sink);
Keep property (Left, Right, Both or None) tells the stream materializer that is should preserve the value on a specified side of the stream operation. But I notice that if I change Keep.Left to Keep.Right, Keep.Both or event Keep.None, that doesn't change anything in the execution outcome: the sink will always receive the output according to the flow transformation function.
I thought that using non-None Keep value for Flow stages in a stream graph is necessary to ensure the values gets sent to the sink. I must have misunderstood the meaning of this, so my question is why a stream flow works even when materialization is disabled for both sides? And can you give an example when changing Keep values between Left, Right, Both and None affects the values that reach the sink?
You are confusing the fact that a stream gets materialized and the fact that it has a materialized value.
A flow (or more generally a graph) is a blueprint for a stream. When you use the run() method on a runnable graph, a stream is materialized using this blueprint. This stream does whatever is expected of it without any regards for materialized values.
What is a materialized value? When you use the method run(), a value is returned. That's the materialized value for your stream. Most of the time (for simple built-in stages), the materialized value is unimportant (it's called NotUsed in scala, I don't know about .NET). A non-trivial example is the Sink.ignore that is materialized as a Future[Done]. It gives you a handle on when the particular stream you have materialized will have completely consumed its input (or thrown an error). More generally, the materialized value gives you some circumstantial information on what's going in your stream (sorry about the vagueness of this statement, but the principle at hand is too general for me to be more explicit).
When building a graph, you put together different pieces that all have a different materialized value. Since you can only have one for your runnable graph, you need to combine them in some way. Keep.{right, left, both, none} are simple functions that combine those values by keeping only one of the values, or both, or none. However, it does not change the fact that both graphs will be materialized, and the values generated, even if you decide not to keep them.
Keep.* functions don't influence the materialization process itself, only what you get out of it.
More specifically, at materialization time (i.e. when run() is called), each and every stage of your stream (in your example, source, flow and sink) will always be materialized - and therefore produce a materialized value under the hood. You can clearly see what that value will be from their last type parameter.
For the user's convenience, as most likely you will not be interested in all of them, you can use Keep.* accordingly to select what to keep around. This directly reflects on the return type of run().
I'm trying to understand the Source type for Akka streams, specified here.
Unfortunately, the documentation and examples I've found don't explain what each of the type parameters actually mean. I'm guessing that Out is the type that the source emits when materialized. Is that correct? What is the other type parameter Mat ?
Out
You are correct, this is the type of the elements which are emitted by the Source.
Mat
It is the type of the Source's materialisation. Note that every stage (Flows, Sinks, etc.) will materialise to a value as well.
This is essentially a byproduct of the stage itself after it is run.
You can picture it as a mean of interacting with the stage while it's running. Looking at examples of ready-made Sources offered by Akka is a good way of getting the gist of it.
Source.single will materialise to NotUsed. You got no mean of interacting with the source, as it will produce only one element straightaway and then complete.
Source.queue will materialise to a SourceQueue. This is a more interesting case, as you can interact with the source by offering messages to it. The messages you offer will be emitted by the source.
Source.maybe will materialise to a Promise. You can use the Promise to control the source and make it emit one element, or None.
When you concatenate different stages, note that every stage can potentially have a useful materialized value. You get to choose which ones to keep by using viaMat/toMat and Keep DSL.
One or more materialized values will be returned when run() is called on the composed graph.
Taking a look at the types in the examples below is the best way to get the gist of the API:
val source: Source[Int, MatSrc]
val sink: Sink[Int, MatSnk]
val matSrc: MatSrc = source.toMat(sink)(Keep.left).run()
val matSnk: MatSnk = source.toMat(sink)(Keep.right).run()
val (m1: MatSrc, m2: MatSnk) = source.toMat(sink)(Keep.both).run()
val n: NotUsed = source.toMat(sink)(Keep.none).run()
Note that the more succinct DSL which you can find in many example is actually a shortcut for the above, where only the materialized value of the last stage (e.g. the sink) is kept.
val mat3: Mat3 = source.viaMat(flow)(Keep.right).toMat(sink)(Keep.right).run()
is the same as
val mat3: Mat3 = source.via(flow).runWith(sink)
See the docs below for further reading.
http://doc.akka.io/docs/akka/2.4/java/stream/stream-quickstart.html#Materialized_values