Kafka windowed stream make grace and suppress key aware - apache-kafka

I currently have a simple stream of data, for example:
|-----|--------|-------|
| Key | TS(ms) | Value |
|-----|--------|-------|
| A | 1000 | 0 |
| A | 1000 | 0 |
| A | 61000 | 0 |
| A | 61000 | 0 |
| A | 121000 | 0 |
| A | 121000 | 0 |
| A | 181000 | 10 |
| A | 181000 | 10 |
| A | 241000 | 10 |
| A | 241000 | 10 |
| B | 1000 | 0 |
| B | 1000 | 0 |
| B | 61000 | 0 |
| B | 61000 | 0 |
| B | 121000 | 0 |
| B | 121000 | 0 |
| B | 181000 | 10 |
| B | 181000 | 10 |
| B | 1000 | 10 |
| B | 241000 | 10 |
| B | 241000 | 10 |
|-----|--------|-------|
this is also the order I publish the data in a topic, the value isn't really an integer but an avro value but the key is a string.
My code is this:
KStream<Windowed<String>, Long> aggregatedStream = inputStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)).grace(Duration.ZERO))
.count()
.toStream();
aggregatedStream.print(Printed.toSysOut());
The output of the print is:
[KTABLE-TOSTREAM-0000000003]: [A#0/60000], 1
[KTABLE-TOSTREAM-0000000003]: [A#0/60000], 2
[KTABLE-TOSTREAM-0000000003]: [A#60000/120000], 1
[KTABLE-TOSTREAM-0000000003]: [A#60000/120000], 2
[KTABLE-TOSTREAM-0000000003]: [A#120000/180000], 1
[KTABLE-TOSTREAM-0000000003]: [A#120000/180000], 2
[KTABLE-TOSTREAM-0000000003]: [A#180000/240000], 1
[KTABLE-TOSTREAM-0000000003]: [A#180000/240000], 2
[KTABLE-TOSTREAM-0000000003]: [A#240000/300000], 1
[KTABLE-TOSTREAM-0000000003]: [A#240000/300000], 2
[KTABLE-TOSTREAM-0000000003]: [B#240000/300000], 1
[KTABLE-TOSTREAM-0000000003]: [B#240000/300000], 2
It seems that the grace period applies globally independently of the key of the stream, what I expect instead (if possible) is to receive all the 10 window counts of key A and the 10 window counts of key B.
In a way that the grace only closes windows based on the key of the stream.
Is that possible?

It seems that grace and suppress uses a global timestamp for each partition, so it's not possible to have a different one per each key.
What's possible instead is to disable the grace period and use a custom transformer instead of the regular suppress to do be able to suppress by key.
For example this is part of our code:
KStream<String, ...> aggregatedStream = pairsStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.aggregate(...your aggregation logic...)
.toStream()
.flatTransform(new TransformerSupplier<Windowed<String>, AggregateOutput, Iterable<KeyValue<String, SuppressedOutput>>>() {
#Override
public Transformer<Windowed<String>, AggregateOutput, Iterable<KeyValue<String, SuppressedOutput>>> get() {
return new Transformer<Windowed<String>, AggregateOutput, Iterable<KeyValue<String, SuppressedOutput>>>() {
KeyValueStore<String, SuppressedOutput> store;
#SuppressWarnings("unchecked")
#Override
public void init(ProcessorContext context) {
store = (KeyValueStore<String, SuppressedOutput>) context.getStateStore("suppress-store");
}
#Override
public Iterable<KeyValue<String, SuppressedOutput>> transform(Windowed<String> window, AggregateOutput sequenceList) {
String messageKey = window.key();
long windowEndTimestamp = window.window().endTime().toEpochMilli();
SuppressedOutput currentSuppressedOutput = new SuppressedOutput(windowEndTimestamp, sequenceList);
SuppressedOutput storeValue = store.get(messageKey);
if (storeValue == null) {
// First time we receive a window for that key
}
if (windowEndTimestamp > storeValue.getTimestamp()) {
// Received a new window
}
if (windowEndTimestamp < storeValue.getTimestamp()) {
// Window older than the last window we've received
}
store.put(messageKey, currentSuppressedOutput);
return new ArrayList<>();
}
#Override
public void close() {
}
};
}
}, "suppress-store")

Related

Conditionally lag value over multiple rows

I am trying to find cases where one type of error causes multiple sequential instances of a second type of error on a vehicle. For example, if there are two vehicles, 'a' and 'b', and vehicle a has an error of type 1 ('error_1') on day 0, it can cause errors of type 2 ('error_2') on days 1, 2, 3, and 4. I want to create a variable named cascading_error that shows every consecutive error_2 following an error_1. Note that in the case of vehicle b, it is possible to have an error_2 without a preceding error_1, in which case the value for cascading_error should be 0.
Here's what I've tried:
vals = [('a',0,1,0),('a',1,0,1),('a',2,0,1),('a',3,0,1),('b',0,0,0),('b',1,0,0),('b',2,0,1), ('b',3,0,1)]
df = spark.createDataFrame(vals, ['vehicle','day','error_1','error_2'])
w = Window.partitionBy('vehicle').orderBy('day')
df = df.withColumn('cascading_error', F.lag(df.error_1).over(w) * df.error_2)
df = df.withColumn('cascading_error', F.when((F.lag(df.cascading_error).over(w)==1) & (df.error_2==1), F.lit(1)).otherwise(df.cascading_error))
df.show()
This is my result
| vehicle | day | error_1 | error_2 | cascading_error |
| ------- | --- | ------- | ------- | --------------- |
| a | 0 | 1 | 0 | null |
| a | 1 | 0 | 1 | 1 |
| a | 2 | 0 | 1 | 1 |
| a | 3 | 0 | 1 | 0 |
| a | 4 | 0 | 1 | 0 |
| b | 0 | 0 | 0 | null |
| b | 1 | 0 | 0 | 0 |
| b | 2 | 0 | 1 | 0 |
| b | 3 | 0 | 1 | 0 |
The code is generating the correct cascading_error value on days 1 and 2 for vehicle a, but not on days 3 and 4, which should also be 1. It seems that the logic of combining cascading_error with error_2 to update cascading_error only works for a single row, not sequential ones.

Compare consecutive rows and extract words(excluding the subsets) in spark

I am working on a spark dataframe. Input dataframe looks like below (Table 1). I need to write a logic to get the keywords with maximum length for each session ids. There are multiple keywords that would be part of output for each sessionid. expected output looks like Table 2.
Input dataframe:
(Table 1)
|-----------+------------+-----------------------------------|
| session_id| value | Timestamp |
|-----------+------------+-----------------------------------|
| 1 | cat | 2021-01-11T13:48:54.2514887-05:00 |
| 1 | catc | 2021-01-11T13:48:54.3514887-05:00 |
| 1 | catch | 2021-01-11T13:48:54.4514887-05:00 |
| 1 | par | 2021-01-11T13:48:55.2514887-05:00 |
| 1 | part | 2021-01-11T13:48:56.5514887-05:00 |
| 1 | party | 2021-01-11T13:48:57.7514887-05:00 |
| 1 | partyy | 2021-01-11T13:48:58.7514887-05:00 |
| 2 | fal | 2021-01-11T13:49:54.2514887-05:00 |
| 2 | fall | 2021-01-11T13:49:54.3514887-05:00 |
| 2 | falle | 2021-01-11T13:49:54.4514887-05:00 |
| 2 | fallen | 2021-01-11T13:49:54.8514887-05:00 |
| 2 | Tem | 2021-01-11T13:49:56.5514887-05:00 |
| 2 | Temp | 2021-01-11T13:49:56.7514887-05:00 |
|-----------+------------+-----------------------------------|
Expected Output:
(Table 2)
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 1 | partyy |
| 2 | fallen |
| 2 | Temp |
|-----------+------------|
Solution I tried:
I added another column called col_length which captures the length of each word in value column. later on tried to compare each row with its subsequent row to see if it is of maximum lenth. But this solution only works party.
val df = spark.read.parquet("/project/project_name/abc")
val dfM = df.select($"session_id",$"value",$"Timestamp").withColumn("col_length",length($"value"))
val ts = Window
.orderBy("session_id")
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val result = dfM
.withColumn("running_max", max("col_length") over ts)
.where($"running_max" === $"col_length")
.select("session_id", "value", "Timestamp")
Current Output:
|-----------+------------+
| session_id| value |
|-----------+------------+
| 1 | catch |
| 2 | fallen |
|-----------+------------|
Multiple columns does not work inside an orderBy clause with window function so I didn't get desired output.I got 1 output per sesison id. Any suggesions would be highly appreciated. Thanks in advance.
You can solve it by using lead function:
val windowSpec = Window.orderBy("session_id")
dfM
.withColumn("lead",lead("value",1).over(windowSpec))
.filter((functions.length(col("lead")) < functions.length(col("value"))) || col("lead").isNull)
.drop("lead")
.show

JPA : how to generate a common id?

I need to allocate a same unique id (batchid) for each row inserted in a BD during a batch execution as illustrated below.
| id | batchid |
| -- | ------- |
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 2 |
| 5 | 2 |
| 6 | 3 |
Was wondering if there is an automated way to do it with jpa annotation, like with a sequence ?
Did it for now this way:
#Repository
public interface SeqRepository extends JpaRepository<CsvEntity, Long> {
#Query(value = "SELECT nextval('batch_id_seq')", nativeQuery = true)
Integer getNextBatchId();
}
schema.sql
CREATE SEQUENCE IF NOT EXISTS batch_id_seq
INCREMENT 1
START 1;

Multiple Joins with Filters in DAX

I'm trying to recreate simple SQL query in DAX. The output Query needs to work in Power BI Report Builder and I have been trying all day reading all sorts of PowerBI / DAX online resources to rewrite this.
A little bit about the data:
The data is structured in three tables, CustomCar, Engine and Chassis.
Basically "CarId" is the key that connects all three tables.
Let's assume all tables have more than 20 columns. so only a few of the columns are needed in the final output.
All three tables (CustomCar, Chassis and Engine) have an IsActive property (the relationship between Engine/Chassis to CustomCar is MANY-TO-ONE. Because an engine might blow up and they change it therefore somehow we want to track which Engine is on the car today and what engine was on it last year, however, at any time, there is only one active engine for each car.. The same goes for Chassis)
Both Engine and Chassis have 'Manufacturer' and 'Model' columns so in the output query they need to be identified from each other.
I am not trying to sum any sort of sales number, just a list of cars with their current configuration.
Any help is appreciated.
Select
CC.Name, CC.Model as 'CustomCarModel', CC.MaxSpeed,
Ch.Manufacturer as 'ChassisManufacturer', Ch.Model as 'ChassisModel', Ch.ManufacturedDate as 'ChassisManfDate',
E.Manufactuer as 'EngineManufacturer', E.Model as 'EngineModel', E.Power, E.CylCount, E.ManufacturedDate
From CustomCars CC
Join Chassis Ch on Ch.CarID = CC.CarId
Join Engine E on E.CarID = CC.CarID
where
CC.IsActive = 1 and CC.FirstTestDriveYear < 1980 and
Ch.IsActive = 1 and
E.IsActive = 1
More info, here are my tables.
Classic Car:
CarId (Primary Key) | Model | MaxSpeed | NumOfPax | TankCapacity | IsActive | FirstTestDriveYear |....
1 | SuperChev | 220 | 2 | 60 | 1 | 1985 |
2 | CustomBranco | 185 | 2 | 90 | 1 | 1979 |
3 | RebuiltToyo | 251 | 4 | 20 | 0 | 1990 |
Chassis:
ChassisId (Primary Key) | CarId (Foreign Key)| IsActive | Manufacturer | Model | ManufacturedDate | ...
1 | 1 | 0 | ACME Chassis | M1 | '04-Jan-1985' | ...
2 | 1 | 1 | SuperChassis | T5 | '03-Feb-1987' | ...
3 | 2 | 0 | Ford | S2 | '25-Mar-1965' | ...
4 | 2 | 0 | Ford | S2 | '25-Mar-1968' | ...
5 | 3 | 0 | JapanChass | X123 | '25-Feb-1988' | ...
6 | 2 | 1 | Ford | S8 | '08-Jul-1978' | ...
7 | 2 | 0 | Ford | S2 | '25-Mar-1968' | ...
8 | 3 | 1 | JapanChass | Y765 | '25-Feb-1992' | ...
Engine:
EngineId (Primary Key) | CarId (Foreign Key)| IsActive | Manufacturer | Model | ManufacturedDate | Power | CylCount | ...
1 | 1 | 0 | GM | AB1 | '04-Jan-1985' | 320 | 8 | ...
2 | 1 | 1 | Bently | ZY2 | '03-Feb-1987' | 285 | 8 | ...
3 | 2 | 0 | Ford | S2 | '25-Mar-1965' | 290 | 6 | ...
4 | 2 | 0 | Ford | S2 | '25-Mar-1968' | 292 | 6 | ...
5 | 3 | 0 | Toyota | X123 | '25-Feb-1988' | 180 | 4 | ...
6 | 2 | 1 | Ford | S8 | '08-Jul-1978' | 222 | 8 | ...
7 | 2 | 0 | Ford | S2 | '25-Mar-1968' | 320 | 8 | ...
8 | 3 | 1 | Toyota | Y765 | '25-Feb-1992' | 211 | 6 | ...
I have found a work around for this. I added the query when adding the data pipeline in Power BI dashboard and will use the values from the query as is.

Count occurrences of value in field for a particular ID using Redshift

I want to count the occurrences of particular values in a certain field for an ID. So what I have is this:
| Location ID | Group |
|:----------- |:---------|
| 1 | Group A |
| 2 | Group B |
| 3 | Group C |
| 4 | Group A |
| 4 | Group B |
| 4 | Group C |
| 3 | Group A |
| 2 | Group B |
| 1 | Group C |
| 2 | Group A |
And what I would hope to yield through some computer magic is this:
| Location ID | Group A Count | Group B Count | Group C count|
|:----------- |:--------------|:--------------|:-------------|
| 1 | 1 | 0 | 1 |
| 2 | 1 | 2 | 0 |
| 3 | 1 | 0 | 1 |
| 4 | 1 | 1 | 1 |
Is there some sort of pivoting function I can use in Redshift to achieve this?
This will require the usage of the CASE function and GROUP clause, as in example.
SELECT l_id,
SUM(CASE WHEN l_group = 'Group A' THEN 1 ELSE 0 END) AS a,
SUM(CASE WHEN l_group = 'Group B' THEN 1 ELSE 0 END) AS b-- and so on
FROM location
GROUP BY l_id;
This should give you such result:
| l_id | a | b |
|------|---|---|
| 4 | 1 | 1 |
| 1 | 1 | 0 |
| 3 | 1 | 0 |
| 2 | 1 | 2 |
You can play with it on this SQL Fiddle.