We are seeing some weird behaviour with a processWindow function emitting two records,
the first record contains complete information using aggregated data present in the window and the second record contains partial information with some information removed from the record.
The processWindow function is using state(MapState) as follows:
override def open(parameters: Configuration): Unit = {
cfState = getRuntimeContext.getMapState(
new MapStateDescriptor[(String, Int), mutable.Map[Int, mutable.Set[Int]]] (
"customFieldsState",
classOf[(String, Int)],
classOf[mutable.Map[Int, mutable.Set[Int]]]
)
)
}
and the process function manipulates the above state using records present in the window.
Is this an anti-pattern? Using state within a processWindow function? Are there any other recommendations to using state within a processWindow function?
We need to maintain state in this case as we don't capture all fields in a single window and we need to aggregate the records per user, hence the use of a window function.
Thanks
If you want to maintain state beyond the lifetime of a single window instance, you should use
KeyedStateStore ProcessWindowFunction.Context#globalState
All other state is cleared when the window is closed.
Since globalState is never cleared by Flink, you should set state TTL on the state descriptor you use if you will have keys that go stale, in order to avoid leaking state over time.
Related
Apache Flink and Kafka Streams have the concept of a session window.
The window is defined based on the time between two consecutive messages from the same key.
If the time between two consecutive messages is less than the specified session gap, then the messages are considered to belong to the same session.
If the gap is larger than the session gap, the window is emitted and a new window is started.
It is trivial to configure a session window in both Flink and KafkaStreams:
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.windowedBy(SessionWindows.with(Duration.ofMinutes(5)).grace(Duration.ofSeconds(30)))
I tried to do the same thing with Reactor, but I cannot find a way to do it, probably my knowledge of Reactor is too limited.
I see that Reactor has multiple variations of the window operation, like windowWhile, windowUntil, windowUntilChanged.
But the predicates that they take as arguments evaluate only the current key, not the gap to the previous key.
Thanks!
I can answer my own question, after reading through the Reactor docs and searching for the relevant operator.
The session window functionality can be achieved using the 2 operators bufferUntilChanged or windowUntilChanged with the following signatures:
<V> Flux<List<T>> bufferUntilChanged(Function<? super T,? extends V> keySelector, BiPredicate<? super V,? super V> keyComparator)
<V> Flux<Flux<T>> windowUntilChanged(Function<? super T,? extends V> keySelector, BiPredicate<? super V,? super V> keyComparator)
The first Function argument selects the key that should be compared.
In case of a session window, this key should be the event time stored in the event.
The second BiPredicate argument compares the current key with the previous key.
In case of a session window, I can subtract the previous event time from the current event time and if the difference is larger than a certain interval, I can return either true and add the current item to the ongoing buffer/window or return false and emit the ongoing window and start a new one.
The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.
The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.
I have a simple test code for Akka Streams (written in F# but Scala version isn't match different):
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.Via(flow).To(sink);
Since Via helper method is just a shortcut for ViaMaterialized(flow, Keep.Left) I can rewrite the code like this:
var source = Source.From(Enumerable.Range(1, 3));
var flow = Flow.FromFunction(new Func<int, string>(x => (x * 2).ToString()));
var sink = Sink.ForEach<string>(output.Add);
var runnable = source.ViaMaterialized(flow, Keep.Left).To(sink);
Keep property (Left, Right, Both or None) tells the stream materializer that is should preserve the value on a specified side of the stream operation. But I notice that if I change Keep.Left to Keep.Right, Keep.Both or event Keep.None, that doesn't change anything in the execution outcome: the sink will always receive the output according to the flow transformation function.
I thought that using non-None Keep value for Flow stages in a stream graph is necessary to ensure the values gets sent to the sink. I must have misunderstood the meaning of this, so my question is why a stream flow works even when materialization is disabled for both sides? And can you give an example when changing Keep values between Left, Right, Both and None affects the values that reach the sink?
You are confusing the fact that a stream gets materialized and the fact that it has a materialized value.
A flow (or more generally a graph) is a blueprint for a stream. When you use the run() method on a runnable graph, a stream is materialized using this blueprint. This stream does whatever is expected of it without any regards for materialized values.
What is a materialized value? When you use the method run(), a value is returned. That's the materialized value for your stream. Most of the time (for simple built-in stages), the materialized value is unimportant (it's called NotUsed in scala, I don't know about .NET). A non-trivial example is the Sink.ignore that is materialized as a Future[Done]. It gives you a handle on when the particular stream you have materialized will have completely consumed its input (or thrown an error). More generally, the materialized value gives you some circumstantial information on what's going in your stream (sorry about the vagueness of this statement, but the principle at hand is too general for me to be more explicit).
When building a graph, you put together different pieces that all have a different materialized value. Since you can only have one for your runnable graph, you need to combine them in some way. Keep.{right, left, both, none} are simple functions that combine those values by keeping only one of the values, or both, or none. However, it does not change the fact that both graphs will be materialized, and the values generated, even if you decide not to keep them.
Keep.* functions don't influence the materialization process itself, only what you get out of it.
More specifically, at materialization time (i.e. when run() is called), each and every stage of your stream (in your example, source, flow and sink) will always be materialized - and therefore produce a materialized value under the hood. You can clearly see what that value will be from their last type parameter.
For the user's convenience, as most likely you will not be interested in all of them, you can use Keep.* accordingly to select what to keep around. This directly reflects on the return type of run().
A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?
This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.
In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.
Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.
If I want to read EventStore (http://geteventstore.com/) stream from e.g. an event number 123, I simply use ReadStreamEventsForwardAsync and specify the starting stream position (setting it to 123 in this case).
I'm wondering if there's a similar function that allows a user to read all the events created after a specified date (e.g. I want all the events created after 20-Dec-2014).
A naive implementation would be just to read a whole stream and then filter the result by ResolvedEvent's "Created" field.
Edit:
I've just implemented the naive solution and noticed that it causes the new function to return "ResolvedEvent []", instead of "StreamEventsSlice", so in introduces unnecessary inconsistency.