How to aggregate events in flink stream before merging with current state by reduce function? - scala

My events are like: case class Event(user: User, stats: Map[StatType, Int])
Every event contains +1 or -1 values in it.
I have my current pipeline that works fine but produces new event for every change of statistics.
eventsStream
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
I'd like to aggregate these increments in a time window before merging them with the current state. So I want the same rolling reduce but with a time window.
Current simple rolling reduce:
500 – last reduced value
+1
-1
+1
Emitted events: 501, 500, 501
Rolling reduce with a window:
500 – last reduced value
v-- window
+1
-1
+1
^-- window
Emitted events: 501
I've tried naive solution to put time window just before reduce but after reading the docs I see that reduce now has different behavior.
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
It seems that I should make keyed stream and reduce it after reducing my time window:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
Is it the right pipeline to solve a problem?

There's probably different options, but one would be to implement a WindowFunction and then run apply after the windowing:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.apply(new MyWindowFunction)
(WindowFuntion takes type parameters for the type of the input value, the type of the output value and the type of the key.)
There's an example of that in here. Let me copy the relevant snippet:
/** User-defined WindowFunction to compute the average temperature of SensorReadings */
class TemperatureAverager extends WindowFunction[SensorReading, SensorReading, String, TimeWindow] {
/** apply() is invoked once for each window */
override def apply(
sensorId: String,
window: TimeWindow,
vals: Iterable[SensorReading],
out: Collector[SensorReading]): Unit = {
// compute the average temperature
val (cnt, sum) = vals.foldLeft((0, 0.0))((c, r) => (c._1 + 1, c._2 + r.temperature))
val avgTemp = sum / cnt
// emit a SensorReading with the average temperature
out.collect(SensorReading(sensorId, window.getEnd, avgTemp))
}
I don't know how your data looks so I can't attempt a full answer, but that should serve as inspiration.

Yes, your proposed pipeline will have the desired effect. The window will reduce together the 2-minute batches. The results of those batches will flow into the final reduce, which will produce an updated result after each of its inputs (which are the window results).

Related

Scala - divide the dataset into dataset of arrays with a fixed size

I have a function whose purpose is to divide a dataset into arrays of a given size.
For example - I have a dataset with 123 objects of the Foo type, I provide to the function arraysSize 10 so as a result I will have a Dataset[Array[Foo]] with 12 arrays with 10 Foo's and 1 array with 3 Foo.
Right now function is working on collected data - I would like to change it on dataset based because of performance but I dont know how.
This is my current solution:
private def mapToFooArrays(data: Dataset[Foo],
arraysSize: Int): Dataset[Array[Foo]]= {
data.collect().grouped(arraysSize).toSeq.toDS()
}
The reason for doing this transformation is because the data will be sent in the event. Instead of sending 1 million events with information about 1 object, I prefer to send, for example, 10 thousand events with information about 100 objects
IMO, this is a weird use case. I can not think of any efficient solution to do this, as it is going to require a lot of shuffling no matter how we do it.
But, the following is still better, as it avoids collecting to the driver node and will thus be more scalable.
Things to keep in mind -
what is the value of data.count() ?
what is the size of a single Foo ?
what is the value of arraySize ?
what is your executor configuration ?
Based on these factors you will be able to come up with the desiredArraysPerPartition value.
val desiredArraysPerPartition = 50
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] = {
val size = data.count()
val numArrays = (size.toDouble / arrarySize).ceil
val numPartitions = (numArrays.toDouble / desiredArraysPerPartition).ceil
data
.repartition(numPartitions)
.mapPartitions(_.grouped(arrarySize).map(_.toArray))
}
After reading the edited part, I think that 100 size in 10 thousand events with information about 100 objects is not really important. As it is referred as about 100. There can be more than one events with less than 100 Foo's.
If we are not very strict about that 100 size, then there is no need of reshuffle.
We can locally group the Foo's present in each of the partitions. As this grouping is being done locally and not globally, this might result in more than one (potentially one for each partition) Arrays with less than 100 Foo's.
private def mapToFooArrays(
data: Dataset[Foo],
arraysSize: Int
): Dataset[Array[Foo]] =
data
.mapPartitions(_.grouped(arrarySize).map(_.toArray))

Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

How to create WindowSpec to count rows per type before and after the current row?

I have had to implement an event centric Windowing batch, with a varying number of event names.
The rule is as follows, for a certain event, every time it occurs, we sum all other events according to certain time windows.
action1 00:01
action2 00:02
action1 00:03
action3 00:04
action3 00:05
For the above dataset, it should be:
window_before: Map(action1 -> 1)
window_after: Map(action1 -> 1, action3 -> 2)
In order to achieve this, we use WindowSpec and a custom udaf that aggregates all counters into a map. The udaf is necessary because the number of action names is completely arbitrary.
Of course at first, the UDAF used Spark's catalyst converters, which was horrendously slow.
Now I've reached what I think is a decent optimum, where I just maintain an array of keys and values with immutable lists (lower GC times, lower iterator overhead) all serialized as binary, so the Scala runtime handles boxing/unboxing and not Spark, using byte arrays instead of strings.
The problem is that some stragglers are very problematic, and the workload cannot be parallelized, unlike when we had a static number of columns and were just summing/counting numeric columns.
I tried to test another technique where I created a number of columns equal to the max cardinality of events and then aggregated back to a map, but the number of columns in the projection was simply killing spark (think a thousand columns easily).
One of the problems, is the huge stragglers, where most of the time a single partition (something like userid, app) will take 100 times longer than the median, even though everything is properly repartitioned.
Has anyone else come to a similar problem ?
Example WindowSpec:
val windowSpec = Window
.partitionBy($"id", $"product_id")
.orderBy("time")
.rangeBetween(-30days, -1)
then
df.withColumn("over30days", myUdaf("name", "count").over(windowSpec))
A naive version of the UDAF:
class UDAF[A] {
private var zero: A = ev.zero
val dt = schemaFor[A].dataType
override def bufferSchema: StructType =
StructType(StructField("actions", MapType(StringType, dt) :: Nil)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
name = row.get(0)
count = row.get(1)
buffer.update(name, buffer.getOrElse(name, ev.zero) + count)
}
}
My current version is less readable than the above naive version but effectively does the same, two binary arrays to circumvent CatalystConverters.

spark filter on external variable

I have a series of sales records which are in an RDD like so,
case class salesRecord(startDate: Int, startTime: Int, itemNumber: Int)
val transactions: RDD[salesRecords]
each day is separated by a startDate and startTime which are the seconds since midnight. I need to be able to filter the start time between two bands which have been input by the user.
so imagine the user has input
val timeBandStart: Int //example 100
val timeBandEnd: Int // 5000
and they only want the records of each day between these bands, so to do this I've tried the following.
val timeFiltered = transactions.filter { record =>
record.startTime >= timeBandStart && record.startTime <= timeBandEnd
}
the issue i'm facing is that I get nothing on my output, I know for sure the records are within these time bands, so should appear on my output. To try and debug this I've tried extract what the timebands which i'm trying to filter on, with this.
val test = transactions.map( (record.startTime, timeBandStart, timeBandEnd)
My output of this is the following (32342, 0, 0), (32455, 0, 0).
Why are my timeBands not being set within the filter? I thought this could be something to with the variables not being broadcast to all of the nodes, so I tried placing the timebands with a broadcast variable. But that didn't work....
Probably something really stupid, can somebody point out what i'm doing wrong?
Cheers!
I've fixed my bug, it's related to this issue. https://issues.apache.org/jira/browse/SPARK-4170
I'm now not extending from App and instead using the main method. My filter is now being applied correctly.
Thanks for your help!

Unexpected spark caching behavior

I've got a spark program that essentially does this:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times. I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case. When I look in the "storage" tab of the running job, very few of the partitions of c are cached.
Also, 10 copies of that stage immediately show as "active". 10 copies of the stage corresponding to val next = some_other_rdd_ops(c, current) show up as pending, and they roughly alternate execution.
Am I misunderstanding how to get Spark to cache RDDs?
Edit: here is a gist containing a program to reproduce this: https://gist.github.com/jfkelley/f407c7750a086cdb059c. It expects as input the edge list of a graph (with edge weights). For example:
a b 1000.0
a c 1000.0
b c 1000.0
d e 1000.0
d f 1000.0
e f 1000.0
g h 1000.0
h i 1000.0
g i 1000.0
d g 400.0
Lines 31-42 of the gist correspond to the simplified version above. I get 10 stages corresponding to line 31 when I would only expect 1.
The problem here is that calling cache is lazy. Nothing will be cached until an action is triggered and the RDD is evaluated. All the call does is set a flag in the RDD to indicate that it should be cached when evaluated.
Unpersist however, takes effect immediately. It clears the flag indicating that the RDD should be cached and also begins a purge of data from the cache. Since you only have a single action at the end of your application, this means that by the time any of the RDDs are evaluated, Spark does not see that any of them should be persisted!
I agree that this is surprising behaviour. The way that some Spark libraries (including the PageRank implementation in GraphX) work around this is by explicitly materializing each RDD between the calls to cache and unpersist. For example, in your case you could do the following:
def foo(a: RDD[...], b: RDD[...]) = {
val c = a.map(...)
c.persist(StorageLevel.MEMORY_ONLY_SER)
var current = b
for (_ <- 1 to 10) {
val next = some_other_rdd_ops(c, current)
next.persist(StorageLevel.MEMORY_ONLY)
next.foreachPartition(x => {}) // materialize before unpersisting
current.unpersist()
current = next
}
current.saveAsTextFile(...)
}
Caching doesn't reduce stages, it just won't recompute the stage every time.
In the first iteration, in the stage's "Input Size" you can see that the data is coming from Hadoop, and that it reads shuffle input. In subsequent iterations, the data is coming from memory and no more shuffle input. Also, execution time is vastly reduced.
New map stages are created whenever shuffles have to be written, for example when there's a change in partitioning, in your case adding a key to the RDD.