apache beam (python SDK): Late (or early) events discarded and triggers. How to know how many discarded and why? - apache-beam

I have a streaming pipeline connected with a PubSub subscription (with around 2MLN elements every hour. I need to collect them in a group and then extract some information.
def expand(self, pcoll):
return (
pcoll |
beam.WindowInto(FixedWindows(10),
trigger=AfterAny(AfterCount(2000), AfterProcessingTime(30)),
allowed_lateness=10,
trigger=AfterAny(
AfterCount(8000),
AfterProcessingTime(30),
AfterWatermark(
early=AfterProcessingTime(60),
late=AfterProcessingTime(60)
)
),
allowed_lateness=60 * 60 * 24,
accumulation_mode=AccumulationMode.DISCARDING)
| "Group by Key" >> beam.GroupByKey()
I try my best to NOT miss any data. But I found out that I have around 4% missing data.
As you can see in the code I trigger anytime I hit 8k elements or every 30 seconds.
Allowing lateness 1 day, and it should trigger both if the pipeline is analyzing early or late events.
Still missing those 4% though. So, is there a way to know if the pipeline is discarding some data? How many elements? For which reason?
Thank you so much in advance

First, I see you have two triggers in the sample code, I assume this is a typo, though.
It looks you are dropping elements due to no using Repeatedly, so all elements after the first trigger get lost. There's an official doc on this from Beam.
Allow me to post an example:
test_stream = (TestStream()
.add_elements([
TimestampedValue('in_time_1', 0),
TimestampedValue('in_time_2', 0)])
.advance_watermark_to(9)
.advance_processing_time(9)
.add_elements([TimestampedValue('late_but_in_window', 8)])
.advance_watermark_to(10)
.advance_processing_time(10)
.add_elements([TimestampedValue('in_time_window2', 12)])
.advance_watermark_to(20) # Past window time
.advance_processing_time(20)
.add_elements([TimestampedValue('late_window_closed', 9),
TimestampedValue('in_time_window2_2', 12)])
.advance_watermark_to_infinity())
class RecordFn(beam.DoFn):
def process(
self,
element=beam.DoFn.ElementParam,
timestamp=beam.DoFn.TimestampParam):
yield ("key", (element, timestamp))
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
with TestPipeline(options=options) as p:
records = (p | test_stream
| beam.ParDo(RecordFn())
| beam.WindowInto(FixedWindows(10),
allowed_lateness=0,
# trigger=trigger.AfterCount(1),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.GroupByKey()
| beam.Map(print)
)
If we have trigger trigger.Repeatedly(trigger.AfterCount(1)), all elements are fired as they come, with no dropped element (but late_window_closed which is expected as it was late):
('key', [('in_time_1', Timestamp(0)), ('in_time_2', Timestamp(0))]) # this two are together since they arrived together
('key', [('late_but_in_window', Timestamp(8))])
('key', [('in_time_window2', Timestamp(12))])
('key', [('in_time_window2_2', Timestamp(12))])
If we use trigger.AfterCount(1) (no repeatedly), we only get the first elements that arrived in the pipeline:
('key', [('in_time_1', Timestamp(0)), ('in_time_2', Timestamp(0))])
('key', [('in_time_window2', Timestamp(12))])
Note that both in_time_(1,2) appear in the first fired pane because the arrived at the same time (0), were one of them appear later it would have been dropped.

Related

How to aggregate events in flink stream before merging with current state by reduce function?

My events are like: case class Event(user: User, stats: Map[StatType, Int])
Every event contains +1 or -1 values in it.
I have my current pipeline that works fine but produces new event for every change of statistics.
eventsStream
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
I'd like to aggregate these increments in a time window before merging them with the current state. So I want the same rolling reduce but with a time window.
Current simple rolling reduce:
500 – last reduced value
+1
-1
+1
Emitted events: 501, 500, 501
Rolling reduce with a window:
500 – last reduced value
v-- window
+1
-1
+1
^-- window
Emitted events: 501
I've tried naive solution to put time window just before reduce but after reading the docs I see that reduce now has different behavior.
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
It seems that I should make keyed stream and reduce it after reducing my time window:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.reduce(reduceFunc)
.keyBy(extractKey)
.reduce(reduceFunc)
.map(prepareRequest)
.addSink(sink)
Is it the right pipeline to solve a problem?
There's probably different options, but one would be to implement a WindowFunction and then run apply after the windowing:
eventsStream
.keyBy(extractKey)
.timeWindow(Time.minutes(2))
.apply(new MyWindowFunction)
(WindowFuntion takes type parameters for the type of the input value, the type of the output value and the type of the key.)
There's an example of that in here. Let me copy the relevant snippet:
/** User-defined WindowFunction to compute the average temperature of SensorReadings */
class TemperatureAverager extends WindowFunction[SensorReading, SensorReading, String, TimeWindow] {
/** apply() is invoked once for each window */
override def apply(
sensorId: String,
window: TimeWindow,
vals: Iterable[SensorReading],
out: Collector[SensorReading]): Unit = {
// compute the average temperature
val (cnt, sum) = vals.foldLeft((0, 0.0))((c, r) => (c._1 + 1, c._2 + r.temperature))
val avgTemp = sum / cnt
// emit a SensorReading with the average temperature
out.collect(SensorReading(sensorId, window.getEnd, avgTemp))
}
I don't know how your data looks so I can't attempt a full answer, but that should serve as inspiration.
Yes, your proposed pipeline will have the desired effect. The window will reduce together the 2-minute batches. The results of those batches will flow into the final reduce, which will produce an updated result after each of its inputs (which are the window results).

Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

10 most common female first names - order changes

I am running through the exercise in Databricks and the below code returns firstName in different order everytime I run. Please explain the reason why the order is not same for every run:
val peopleDF = spark.read.parquet("/mnt/training/dataframes/people-10m.parquet")
id:integer
firstName:string
middleName:string
lastName:string
gender:string
birthDate:timestamp
ssn:string
salary:integer
/* Create a DataFrame called top10FemaleFirstNamesDF that contains the 10 most common female first names out of the people data set.*/
import org.apache.spark.sql.functions.count
val top10FemaleFirstNamesDF_1 = peopleDF.filter($"gender"=== "F").groupBy($"firstName").agg(count($"firstName").alias("cnt_firstName")).withColumn("cnt_firstName",$"cnt_firstName".cast("Int")).sort($"cnt_firstName".desc).limit(10)
val top10FemaleNamesDF = top10FemaleFirstNamesDF_1.orderBy($"firstName")
Some runs the assertion passes and in some run the assertion fails:
lazy val results = top10FemaleNamesDF.collect()
dbTest("DF-L2-names-0", Row("Alesha", 1368), results(0))
// dbTest("DF-L2-names-1", Row("Alice", 1384), results(1))
// dbTest("DF-L2-names-2", Row("Bridgette", 1373), results(2))
// dbTest("DF-L2-names-3", Row("Cristen", 1375), results(3))
// dbTest("DF-L2-names-4", Row("Jacquelyn", 1381), results(4))
// dbTest("DF-L2-names-5", Row("Katherin", 1373), results(5))
// dbTest("DF-L2-names-5", Row("Lashell", 1387), results(6))
// dbTest("DF-L2-names-7", Row("Louie", 1382), results(7))
// dbTest("DF-L2-names-8", Row("Lucille", 1384), results(8))
// dbTest("DF-L2-names-9", Row("Sharyn", 1394), results(9))
println("Tests passed!")
The problem might be the limit 10. Due to distributed nature of spark, you can't assume every time it runs the limit function it is going to give you same result. Spark might find different partition in different runs to give you 10 elements.
If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition.
However, I do realize you are sorting the data first and then limiting on that. The limit function supposed to return deterministically when the underlying rdd is sorted. It might be non-deterministic for unsorted data.
It will be worthwhile to see the physical plan of your query.

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!

spark filter on external variable

I have a series of sales records which are in an RDD like so,
case class salesRecord(startDate: Int, startTime: Int, itemNumber: Int)
val transactions: RDD[salesRecords]
each day is separated by a startDate and startTime which are the seconds since midnight. I need to be able to filter the start time between two bands which have been input by the user.
so imagine the user has input
val timeBandStart: Int //example 100
val timeBandEnd: Int // 5000
and they only want the records of each day between these bands, so to do this I've tried the following.
val timeFiltered = transactions.filter { record =>
record.startTime >= timeBandStart && record.startTime <= timeBandEnd
}
the issue i'm facing is that I get nothing on my output, I know for sure the records are within these time bands, so should appear on my output. To try and debug this I've tried extract what the timebands which i'm trying to filter on, with this.
val test = transactions.map( (record.startTime, timeBandStart, timeBandEnd)
My output of this is the following (32342, 0, 0), (32455, 0, 0).
Why are my timeBands not being set within the filter? I thought this could be something to with the variables not being broadcast to all of the nodes, so I tried placing the timebands with a broadcast variable. But that didn't work....
Probably something really stupid, can somebody point out what i'm doing wrong?
Cheers!
I've fixed my bug, it's related to this issue. https://issues.apache.org/jira/browse/SPARK-4170
I'm now not extending from App and instead using the main method. My filter is now being applied correctly.
Thanks for your help!