applying keyed state on top of stream from co group stream

applying keyed state on top of stream from co group stream - apache-kafka

I have two kafka sources
I am trying to perform world count and merge the counts from two streams
I have created window of 1 min for both data streams and applying coGroupBykey , from DoFn , i am emitting <Key,Value> (word,count)
On top of this coGroupByKey function , I am applying stateful ParDo
Let say if i get (Test,2) from stream 1, (Test,3) from stream 2 in same window time then in CogroupByKey function , i ll merge as (Test,5), but if they are not falling in same window , i will emit (Test,2) and (Test,3)
Now i will apply state for merging these elements
So finally as result i should get (Test,5), but i am not getting the expected result , All elements form stream 1 are going to one partition and
elements from stream 2 to another partition , thats why i am getting result
(Test,2)
(Test,3)
// word count stream from kafka topic 1
PCollection<KV<String,Long>> stream1 = ...
// word count stream from kafka topic 2
PCollection<KV<String,Long>> stream2 = ...
PCollection<KV<String,Long>> windowed1 =
stream1.apply(
Window
.<KV<String,Long>>into(FixedWindows.of(Duration.millis(60000)))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.millis(1000))
.discardingFiredPanes());
PCollection<KV<String,Long>> windowed2 =
stream2.apply(
Window
.<KV<String,Long>>into(FixedWindows.of(Duration.millis(60000)))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.millis(1000))
.discardingFiredPanes());
final TupleTag<Long> count1 = new TupleTag<Long>();
final TupleTag<Long> count2 = new TupleTag<Long>();
// Merge collection values into a CoGbkResult collection.
PCollection<KV<String, CoGbkResult>> joinedStream =
KeyedPCollectionTuple.of(count1, windowed1).and(count2, windowed2)
.apply(CoGroupByKey.<String>create());
// applying state operation after coGroupKey fun
PCollection<KV<String,Long>> finalCountStream =
joinedStream.apply(ParDo.of(
new DoFn<KV<String, CoGbkResult>, KV<String,Long>>() {
#StateId(stateId)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext processContext,
#StateId(stateId) MapState<String, Long> state) {
KV<String, CoGbkResult> element = processContext.element();
Iterable<Long> count1 = element.getValue().getAll(web);
Iterable<Long> count2 = element.getValue().getAll(assist);
Long sumAmount =
StreamSupport
.stream(
Iterables.concat(count1, count2).spliterator(), false)
.collect(Collectors.summingLong(n -> n));
System.out.println(element.getKey()+"::"+sumAmount);
// processContext.output(element.getKey()+"::"+sumAmount);
Long currCount =
state.get(element.getKey()).read() == null
? 0L
: state.get(element.getKey()).read();
Long newCount = currCount+sumAmount;
state.put(element.getKey(),newCount);
processContext.output(KV.of(element.getKey(),newCount));
}
}));
finalCountStream
.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState =
StateSpecs.map();
#ProcessElement
public void processElement(
ProcessContext c,
#StateId(myState) MapState<String, Long> state) {
KV<String,Long> e = c.element();
Long currCount = state.get(e.getKey()).read()==null
? 0L
: state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
}))
.apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values());

Alternatively, you can use a Flatten + Combine approach, which should be give you simpler code:
PCollection<KV<String, Long>> pc1 = ...;
PCollection<KV<String, Long>> pc2 = ...;
PCollectionList<KV<String, Long>> pcs = PCollectionList.of(pc1).and(pc2);
PCollection<KV<String, Long>> merged = pcs.apply(Flatten.<KV<String, Long>>pCollections());
merged.apply(windiw...).apply(Combine.perKey(Sum.ofLongs()))

You have set up both streams with the trigger Repeatedly.forever(AfterPane.elementCountAtLeast(1)) and discardingFiredPanes(). This will cause the CoGroupByKey to output as soon as possible after each input element and then reset its state each time. So it is normal behavior that it basically passes each input straight through.
Let me explain more: CoGroupByKey is executed like this:
All elements from stream1 and stream2 are tagged as you specified. So every (key, value1) from stream1 effectively becomes (key, (count1, value1)). And every (key, value2) from stream2 becomes `(key, (count2, value2))
These tagged collects are flattened together. So now there is one collection with elements like (key, (count1, value1)) and (key, (count2, value2)).
The combined collection goes through a normal GroupByKey. This is where triggers happen. So with the default trigger, you get (key, [(count1, value1), (count2, value2), ...]) with all the values for a key getting grouped. But with your trigger, you will often get separate (key, [(count1, value1)]) and (key, [(count2, value2)]) because each grouping fires right away.
The output of the GroupByKey is wrapped in just an API that is CoGbkResult. In many runners this is just a filtered view of the grouped iterable.
Of course, triggers are nondeterministic and runners are also allowed to have different implementations of CoGroupByKey. But the behavior you are seeing is expected. You probably don't want to use trigger like that or discarding mode, or else you need to do more grouping downstream.
Generally, doing a join with CoGBK is going to require some work downstream, until Beam supports retractions.

PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class)
.setRunner(FlinkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<KV<String,Long>> stream1 = new KafkaWordCount("localhost:9092","test1")
.build(p);
PCollection<KV<String,Long>> stream2 = new KafkaWordCount("localhost:9092","test2")
.build(p);
PCollectionList<KV<String, Long>> pcs = PCollectionList.of(stream1).and(stream2);
PCollection<KV<String, Long>> merged = pcs.apply(Flatten.<KV<String, Long>>pCollections());
merged.apply("finalState", ParDo.of(new DoFn<KV<String,Long>, String>() {
#StateId(myState)
private final StateSpec<MapState<String, Long>> mapState = StateSpecs.map();
#ProcessElement
public void processElement(ProcessContext c, #StateId(myState) MapState<String, Long> state){
KV<String,Long> e = c.element();
System.out.println("Thread ID :"+ Thread.currentThread().getId());
Long currCount = state.get(e.getKey()).read()==null? 0L:state.get(e.getKey()).read();
Long newCount = currCount+e.getValue();
state.put(e.getKey(),newCount);
c.output(e.getKey()+":"+newCount);
}
})).apply(KafkaIO.<Void, String>write()
.withBootstrapServers("localhost:9092")
.withTopic("test")
.withValueSerializer(StringSerializer.class)
.values()
);
p.run().waitUntilFinish();

Related

Kafka streams join duplicates

Please don't mark this question as a duplicate of kafka-streams join produce duplicates. I think my scenario is different. I'm also already using kafka EOS via processing.guarantee=exactly_once
I have an input topic transactions_topic with json data that looks like
{
"timestamp": "2022-10-08T13:04:30Z",
"transactionId": "842d38ea-1d3d-41a4-b724-bcc7e81aec9a",
"accountId": "account123",
"amount": 1.0
}
It's represented as a simple class using lombok #Data
#Data
class Transaction {
String transactionId;
String timestamp;
String accountId;
Double amount;
}
I want to compute the total amount spent by accountId for the past 1 hour, past 1 day and past 30 days. These computations are the features represented by the the following class
#Data
public class Features {
double totalAmount1Hour;
double totalAmount1Day;
double totalAmount30Day;
}
I'm using kafka-streams and springboot to achieve this.
First I subscribe to the input topic and select the accountId as key
KStream<String, Transaction> kStream = builder.stream(inputTopic,
Consumed.with(Serdes.String(), new JsonSerde<>(Transaction.class)).
withTimestampExtractor(new TransactionTimestampExtractor())).
selectKey((k,v)-> v.getAccountId());
TransactionTimestampExtractor is implemented as follows
public class TransactionTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> consumerRecord, long l) {
Transaction value = (Transaction) consumerRecord.value();
long epoch = Instant.parse(value.getTimestamp()).toEpochMilli();
return epoch;
}
}
Now in order to compute the total amount for the past 1 hour, past 1 day and past 30 days, I created a function that will aggregate the amount based on a sliding window
private <T> KStream<String, T> windowAggregate(KStream<String, Transaction> kStream,
SlidingWindows window,
Initializer<T> initializer,
Aggregator<String, Transaction, T> aggregator,
Class<T> t) {
return kStream.
groupByKey(Grouped.with(Serdes.String(), new JsonSerde<>(Transaction.class))).
windowedBy(window).
aggregate(initializer,
aggregator,
Materialized.with(Serdes.String(), Serdes.serdeFrom(t))).
suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).
toStream().
map((k, v) -> KeyValue.pair(k.key(), v));
}
Now we can use it like
Aggregator<String, Transaction, Double> amountAggregator = (k, v, aggregate) -> aggregate + v.getAmount();
KStream<String, Double> totalAmount1Hour = windowAggregate(kStream, SlidingWindows.ofTimeDifferenceWithNoGrace(Duration.ofHours(1)), () -> 0.0, amountAggregator, Double.class);
KStream<String, Double> totalAmount1Day = windowAggregate(kStream, SlidingWindows.ofTimeDifferenceWithNoGrace(Duration.ofDays(1)), () -> 0.0, amountAggregator, Double.class);
KStream<String, Double> totalAmount30Day = windowAggregate(kStream, SlidingWindows.ofTimeDifferenceWithNoGrace(Duration.ofDays(30)), () -> 0.0, amountAggregator, Double.class);
Now all I need to do is to join these streams and return a new stream with Features as values
private KStream<String, Features> joinAmounts(KStream<String, Double> totalAmount1Hour, KStream<String, Double> totalAmount1Day, KStream<String, Double> totalAmount30Day) {
JoinWindows joinWindows = JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofSeconds(0));
KStream<String, Features> totalAmount1HourAnd1Day = totalAmount1Hour.join(totalAmount1Day,
(amount1Hour, amount1Day) -> {
Features features = new Features();
features.setTotalAmount1Hour(amount1Hour);
features.setTotalAmount1Day(amount1Day);
return features;
},
joinWindows,
StreamJoined.with(Serdes.String(), Serdes.Double(), Serdes.Double()));
KStream<String, Features> featuresKStream = totalAmount1HourAnd1Day.join(totalAmount30Day,
(features, amount30Day) -> {
features.setTotalAmount30Day(amount30Day);
return features;
},
joinWindows,
StreamJoined.with(Serdes.String(), new JsonSerde<>(Features.class), Serdes.Double()));
return featuresKStream;
}
I print the features stream for debugging purposes
KStream<String, Features> features = joinAmounts(totalAmount1Hour, totalAmount1Day, totalAmount30Day);
features.print(Printed.<String, Features>toSysOut().withLabel("features"));
This works and prints the correct values for the features however when I process the same payload more than once, the features stream produces duplicates. For example processing the following payload twice produces the following output.
{
"timestamp":"2022-10-08T01:09:32Z",
"accountId":"account1",
"transactionId":"33694a6e-8c15-4cc2-964a-b8b0ecce2682",
"amount":1.0
}
Output
[features]: account1, Features(totalAmount1Hour=2.0, totalAmount1Day=1.0, totalAmount30Day=1.0)
[features]: account1, Features(totalAmount1Hour=1.0, totalAmount1Day=2.0, totalAmount30Day=1.0)
[features]: account1, Features(totalAmount1Hour=2.0, totalAmount1Day=2.0, totalAmount30Day=1.0)
[features]: account1, Features(totalAmount1Hour=1.0, totalAmount1Day=1.0, totalAmount30Day=2.0)
[features]: account1, Features(totalAmount1Hour=2.0, totalAmount1Day=1.0, totalAmount30Day=2.0)
[features]: account1, Features(totalAmount1Hour=1.0, totalAmount1Day=2.0, totalAmount30Day=2.0)
[features]: account1, Features(totalAmount1Hour=2.0, totalAmount1Day=2.0, totalAmount30Day=2.0)
My expected output would be just the last one
[features]: account1, Features(totalAmount1Hour=2.0, totalAmount1Day=2.0, totalAmount30Day=2.0)
How can I achive this and get rid of the duplicates in the features stream? Is kafka-streams join() doing a cartesian product because I have the same timestamp and key?

Yes, the toStream will convert from a KTable back to a KStream, giving you full changelogs for the tables. Then, for every single change of the each of the 3 tables, you will also get a join result.
Maybe a better idea to achieve what you want is to chain your aggregations. So that you generate the KTable for 1 hour changes, and from this table you derive the 1 day changes, and from the resulting table you finally generate the 30 day changes. See this Wiki page for an example: https://cwiki.apache.org/confluence/display/KAFKA/Windowed+aggregations+over+successively+increasing+timed+windows

Limit the size of a kafka streams session window

I have some code that aggregates all events occuring (read: uploaded to kafka) close to each other in time, this sounds like the perfect case for "SessionWindow" to me. However, the session window tend to get very big, like 2-10Mbs, I tried reconfiguring the topic and producer to allow this.
But in my case the sessions may grow even bigger in the future. So my question is: Can I cap the size of a session window, so that a real big window becomes split before put on the changelog?
private val log: Logger = logger()
private val sessionWindowForBatchUploads = SessionWindows.ofInactivityGapAndGrace(appConfig.batchSessionInactivityGap, appConfig.batchSessionGracePeriod)
private val stateLogConfig: Map<String, String> = mapOf("max.message.bytes" to appConfig.batchSessionStateLogMaxMessageBytes)
private val maxHeapUseageSupress = heapCheckService.getMaxHeapMemory() / 2
fun start(builder: StreamsBuilder): StreamsBuilder {
val result = builder.stream<String, GenericRecord>(appConfig.metadataTopic)
.mapValues { _, v -> Avro.default.fromRecord(deserializer = IngestionMetadataEvent.serializer(), record = v) }
.filter { _, v -> v.serialNumber.isNotEmpty() || v.sampleRecorded.isNotEmpty() }
.mapValues { _, v -> IngestionMetadataEventsWindow(v) }
.groupByKey()
.windowedBy(sessionWindowForBatchUploads)
// note: timestamp of ingestion is used for batching, not sampleRecorded time.
.reduce(
{ agg, v -> agg.append(v) },
Materialized.`as`<String?, IngestionMetadataEventsWindow?, SessionStore<Bytes, ByteArray>?>(appConfig.batchSessionStoreName)
.withKeySerde(Serdes.String())
.withLoggingEnabled(stateLogConfig)
.withRetention(appConfig.batchSessionRetention)
.withValueSerde(IngestionMetadataEventsWindow.internalSerde())
)
.suppress(
Suppressed
.untilWindowCloses(unbounded().withMaxBytes(maxHeapUseageSupress))
)
// output topic for testing:
.mapValues { key, metadataSet ->
log.debug("batch for ${key.key()} of size: ${metadataSet.events.size}, size_class: ${bigBatchLogger(metadataSet.events.size)}")
storersInvokeable.invokeAll(metadataSet.events)
metadataSet
}
}

Group by and then ungroup causes loss of all events on beam TestPipeline

We found this line(mainly the reshuffle to be INTERMITTENTLY dropping events(causing flaky tests) on apache beam org.apache.beam.sdk.testing.TestPipeline ->
PCollection<OrderlyBeamDto<RosterRecord>> records = PCollectionList.of(csv.get(TupleTags.OUTPUT_RECORD)).and(excel.get(TupleTags.OUTPUT_RECORD))
.apply(Flatten.pCollections())
.apply("Reshuffle Records", Reshuffle.viaRandomKey());
soooo, I modified it to this and now we lose ALL events. In fact, the UngroupFn() is never called
PCollection<OrderlyBeamDto<RosterRecord>> records = PCollectionList.of(csv.get(TupleTags.OUTPUT_RECORD)).and(excel.get(TupleTags.OUTPUT_RECORD))
.apply(Flatten.pCollections())
.apply(ParDo.of(new CreateKeyFn()))
.apply(GroupByKey.<String, OrderlyBeamDto<RosterRecord>>create())
.apply(ParDo.of(new UngroupFn()));
The code for Ungroup is simply this
#ProcessElement
public void processElement(final ProcessContext c) {
log.info("running ungroup");
KV<String, Iterable<OrderlyBeamDto<RosterRecord>>> keyValue = c.element();
Iterable<OrderlyBeamDto<RosterRecord>> list = keyValue.getValue();
for(OrderlyBeamDto<RosterRecord> record : list) {
log.info("ungroup="+record.getValue().getRowNumber());
c.output(record);
}
}
We never see the 'running ungroup' log :(

kafka streams groupBy aggregate produces unexpected values

my question is about Kafka streams Ktable.groupBy.aggregate. and the resulting aggregated values.
situation
I am trying to aggregate minute events per day.
I have a minute event generator (not shown here) that generates events for a few houses. Sometimes the event value is wrong and the minute event must be republished.
Minute events are published in the topic "minutes".
I am doing an aggregation of these events per day and house using kafka Streams groupBy and aggregate.
problem
Normally as there are 1440 minutes in a day, there should never have an aggregation with more than 1440 values.
Also there should never have an aggregation with a negative amount of events.
... But it happens anyways and we do not understand what is wrong in our code.
sample code
Here is a sample simplified code to illustrate the problem. The IllegalStateException is thrown sometimes.
StreamsBuilder builder = new StreamsBuilder();
KTable<String, MinuteEvent> minuteEvents = builder.table(
"minutes",
Consumed.with(Serdes.String(), minuteEventSerdes),
Materialized.<String, MinuteEvent, KeyValueStore<Bytes, byte[]>>with(Serdes.String(), minuteEventSerdes)
.withCachingDisabled());
// preform daily aggregation
KStream<String, MinuteAggregate> dayEvents = minuteEvents
// group by house and day
.filter((key, minuteEvent) -> minuteEvent != null && StringUtils.isNotBlank(minuteEvent.house))
.groupBy((key, minuteEvent) -> KeyValue.pair(
minuteEvent.house + "##" + minuteEvent.instant.atZone(ZoneId.of("Europe/Paris")).truncatedTo(ChronoUnit.DAYS), minuteEvent),
Grouped.<String, MinuteEvent>as("minuteEventsPerHouse")
.withKeySerde(Serdes.String())
.withValueSerde(minuteEventSerdes))
.aggregate(
MinuteAggregate::new,
(String key, MinuteEvent value, MinuteAggregate aggregate) -> aggregate.addLine(key, value),
(String key, MinuteEvent value, MinuteAggregate aggregate) -> aggregate.removeLine(key, value),
Materialized
.<String, MinuteAggregate, KeyValueStore<Bytes, byte[]>>as(BILLLINEMINUTEAGG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(minuteAggSerdes)
.withLoggingEnabled(new HashMap<>())) // keep this aggregate state forever
.toStream();
// check daily aggregation
dayEvents.filter((key, value) -> {
if (value.nbValues < 0) {
throw new IllegalStateException("got an aggregate with a negative number of values " + value.nbValues);
}
if (value.nbValues > 1440) {
throw new IllegalStateException("got an aggregate with too many values " + value.nbValues);
}
return true;
}).to("days", minuteAggSerdes);
and here are the sample class used in this code snippet :
public class MinuteEvent {
public final String house;
public final double sensorValue;
public final Instant instant;
public MinuteEvent(String house,double sensorValue, Instant instant) {
this.house = house;
this.sensorValue = sensorValue;
this.instant = instant;
}
}
public class MinuteAggregate {
public int nbValues = 0;
public double totalSensorValue = 0.;
public String house = "";
public MinuteAggregate addLine(String key, MinuteEvent value) {
this.nbValues = this.nbValues + 1;
this.totalSensorValue = this.totalSensorValue + value.sensorValue;
this.house = value.house;
return this;
}
public MinuteAggregate removeLine(String key, MinuteEvent value) {
this.nbValues = this.nbValues -1;
this.totalSensorValue = this.totalSensorValue - value.sensorValue;
return this;
}
public MinuteAggregate() {
}
}
If someone could tell us what we are doing wrong here and why we have these unexpected values that would be great.
additional notes
we configure our stream job to run with 4 threads properties.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
we are forced to use a Ktable.groupBy().aggregate() because minute values can be
republished with different sensorValue for an already published Instant. And daily aggregation modified accordingly.
Stream.groupBy().aggregate() does not have an adder AND a substractor.

I think, it is actually possible that the count become negative temporary.
The reason is, that each update in your first KTable sends two messaged downstream -- the old value to be subtracted in the downstream aggregation and the new value to be added to the downstream aggregation. Both message will be processed independently in the downstream aggregation.
If the current count is zero, and a subtractions is processed before an addition, the count would become negative temporarily.

What is the equivalent of Flink's tumbling/session window join in Kafka Streams?

According to documentation of Kafka Streams, we can apply KStream-to-KStream joins only by defining a lower and upper bound:
KStream<String, String> joined = left.join(right,
(leftValue, rightValue) -> ... , /* ValueJoiner */
JoinWindows.of(Duration.ofMinutes(5)),
Joined.with(
Serdes.String(), /* key */
Serdes.Long(), /* left value */
Serdes.Double()) /* right value */
);
This only joins elements that are near to each other defined by length of join window. However, I'm looking for joining two tumbling windows which can be done in Flink:
DataStream<Integer> left = ...
DataStream<Integer> right = ...
left.join(right)
.where(<LeftKeySelector>)
.equalTo(<RightKeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
#Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

applying keyed state on top of stream from co group stream - apache-kafka

Related

Kafka streams join duplicates

Limit the size of a kafka streams session window

Group by and then ungroup causes loss of all events on beam TestPipeline

kafka streams groupBy aggregate produces unexpected values

What is the equivalent of Flink's tumbling/session window join in Kafka Streams?

Categories

Resources