Kafka streams - set a different time window depending on the message group - apache-kafka

I wonder if given a KStream, is possible to set a different time window depending on the message group, for example, for groupBy "A" 5 seconds, for groupBy "B" 10 seconds ...
KStream<String, Msg> stream = builder.stream(stringSerde, msgSerde, input);
stream.groupBy((key, msg) -> msg.getPool())
.aggregate(init, agg, TimeWindows.of(wndLength).advanceBy(wndLength), msgSerde)
...

The simplest way that comes to mind is to .filter() or .branch() before you .groupBy()/.aggregate(), like:
KStream<String, Msg> stream = builder.stream(stringSerde, msgSerde, input);
stream.filter((key, msg) -> msg.getPool().equals("A"))
.groupBy((key, msg) -> msg.getPool())
.aggregate(init, agg, TimeWindows.of(wndLength).advanceBy(wndLength), msgSerde)
...

Related

Intermittent incorrect count from TimeWindowKStream in kafka streams

My intention with this topology is to window incoming messages and then count them, and then send the count to another topic.
When I test this with a single key and one or more values to the input topic, I get inconsistent results. Sometimes the count is correct. Sometimes I will send in a single message, see the single message at the first peek and instead of getting a count of 1, I get some other value at the second peek and in the output topic. When I send in multiple messages, the count is usually right, but sometimes off. I'm careful to send the messages inside the time window, so I don't think they're getting split into two windows.
Is there a flaw in my topology?
public static final String INPUT_TOPIC = "test-topic";
public static final String OUTPUT_TOPIC = "test-output-topic";
public static void buildTopo(StreamsBuilder builder) {
WindowBytesStoreSupplier store = Stores.persistentTimestampedWindowStore(
"my-state-store",
Duration.ofDays(1),
Duration.ofMinutes(1),
false);
Materialized<String, Long, WindowStore<Bytes, byte[]>> materialized = Materialized
.<String, Long>as(store)
.withKeySerde(Serdes.String());
Suppressed<Windowed> suppression = Suppressed
.untilWindowCloses(Suppressed.BufferConfig.unbounded());
TimeWindows window = TimeWindows
.of(Duration.ofMinutes(1))
.grace(Duration.ofSeconds(0));
// windowedKey has a string plus the kafka time window
builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("****key = " + key + " value= " + value))
.groupByKey()
.windowedBy(window)
.count(materialized)
.suppress(suppression)
.toStream()
.peek((key, value) -> System.out.println("key = " + key + " value= " + value))
.map((key, value) -> new KeyValue<>(key.key(), value))
.to(OUTPUT_TOPIC, Produced.with(Serdes.String(), Serdes.Long()));
}

merge records in a kafka stream

Is it possible to merge records in kafka and publish the output to different stream ?
For example , there is a stream of events produced to a kafka topic like below
{txnId:1,startTime:0900},{txnId:1,endTime:0905},{txnId:2,endTime:0912},{txnId:3,endTime:0930},{txnId:2,startTime:0912},{txnId:3,startTime:0925}......
I want to merge these events by txnId and create the merged output like below
{txnId:1,startTime:0900,endTime:0905},{txnId:2,startTime:0910,endTime:0912},{txnId:3,startTime:0925,endTime:0930}
Please note that order is not maintained in the incoming events.So if endTime is received for a txn Id before start time event , then we need to wait till the start time event is received for that txnId before initiating the merge
I went through the word count example that comes along with Kafka Streams example but its not clear how to wait for events and then merge while doing the transformation.
Any thoughts is highly appreciated.
You could try solving this by splitting the start and end events into 2 separate streams with txnId as the key and then joining both the streams.
KStream<String, String> eventSource = new StreamBuilder().stream("INPUT-TOPIC");
KStream<String, JsonNode>[] splitEvents =
eventSource.map((key, eventString) -> {
JsonNode event = new ObjectMapper().readTree(eventString);
String txnId = event.path("txnId").asText();
return KeyValue.pair(txnId, event);
})
.branch((key, event) -> event.findValue("startTime") != null,
(key, event) -> event.findValue("endTime") != null);
KStream<String, JsonNode> startEvents = splitEvents[0];
KStream<String, JsonNode> endEvents = splitEvents[1];
A join between 2 streams as shown will produce a join result when there is an event in either side of the join. So the order of both events won't matter (you will have to ensure that you set an appropriate window period for the join).
Serde<JsonNode> jsonSerde = Serdes.serdeFrom(new JsonSerializer(), new JsonDeserializer());
KStream<String, String> completeEvents = startEvents.join(endEvents,
(startEvent, endEvent) -> {
// Add logic to merge startEvent and endEvent as seen fit
ObjectNode completeEvent = JsonNodeFactory.instance.objectNode();
completeEvent.put("startTime", startEvent.path("startTime).asText());
completeEvent.put("endTime", endEvent.path("endTime").asText());
return completeEvent.toString();
},
JoinWindows.of(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(), // key
jsonSerde, // left object
jsonSerde // right object
)
);

Kafka Streams windowedBy not returning after the specified time

My requirement is to group list of messages which comes into my kafka topic having similar key to be grouped and this grouping has to happen with the window time of 5 secs. So my application should return the grouped elements after every 5secs. Below is the application which i have written. But the problem with the below code is its not returning the group of events after 5 secs its returning after 5 secs but its returning very late like 15 secs or 30 secs or more randomly.
KStream<String, String> source = builder
.stream(sourceTopic, Consumed.with(Serdes.String(), Serdes.String()))
.filter((key, value) -> Objects.nonNull(value));
final KTable<Windowed<String>, List<String>> aggTable = source
.groupByKey(Serialized.with(Serdes.String(), new JsonSerde<>(String.class, objectMapper)))
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(5))))
.aggregate(List<String>::new, (key, value, aggregater) -> {
aggregater.add(value);
return aggregater;
},
Materialized.<String, List<String>, WindowStore<Bytes, byte[]>>as("stateStore")
.withValueSerde(newStatusEventHolderJsonSerde()));
Can you please let us know do we need to do any extra coding to make stream return immediately after the specified timeout.

Kafka Stream producing custom list of messages based on certain conditions

We have the following stream processing requirement.
Source Stream ->
transform(condition check - If (true) then generate MULTIPLE ADDITIONAL messages else just transform the incoming message) ->
output kafka topic
Example:
If condition is true for message B(D,E,F are the additional messages produced)
A,B,C -> A,D,E,F,C -> Sink Kafka Topic
If condition is false
A,B,C -> A,B,C -> Sink Kafka Topic
Is there a way we can achieve this in Kafka streams?
You can use flatMap() or flatMapValues() methods. These methods take one record and produce zero, one or more records.
flatMap() can modify the key, values and their datatypes while flatMapValues() retains the original keys and change the value and value data type.
Here is an example pseudocode considering the new messages "C","D","E" will have a new key.
KStream<byte[], String> inputStream = builder.stream("inputTopic");
KStream<byte[], String> outStream = inputStream.flatMap(
(key,value)->{
List<KeyValue<byte[], String>> result = new LinkedList<>();
// If message value is "B". Otherwise place your condition based on data
if(value.equalsTo("B")){
result.add(KeyValue.pair("<new key for message C>","C"));
result.add(KeyValue.pair("<new key for message D>","D"));
result.add(KeyValue.pair("<new key for message E>","E"));
}else{
result.add(KeyValue.pair(key,value));
}
return result;
});
outStream.to("sinkTopic");
You can read more about this :
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-transformations-stateless

How to send final kafka-streams aggregation result of a time windowed KTable?

What I'd like to do is this:
Consume records from a numbers topic (Long's)
Aggregate (count) the values for each 5 sec window
Send the FINAL aggregation result to another topic
My code looks like this:
KStream<String, Long> longs = builder.stream(
Serdes.String(), Serdes.Long(), "longs");
// In one ktable, count by key, on a five second tumbling window.
KTable<Windowed<String>, Long> longCounts =
longs.countByKey(TimeWindows.of("longCounts", 5000L));
// Finally, sink to the long-avgs topic.
longCounts.toStream((wk, v) -> wk.key())
.to("long-counts");
It looks like everything works as expected, but the aggregations are sent to the destination topic for each incoming record. My question is how can I send only the final aggregation result of each window?
In Kafka Streams there is no such thing as a "final aggregation". Windows are kept open all the time to handle out-of-order records that arrive after the window end-time passed. However, windows are not kept forever. They get discarded once their retention time expires. There is no special action as to when a window gets discarded.
See Confluent documentation for more details: http://docs.confluent.io/current/streams/
Thus, for each update to an aggregation, a result record is produced (because Kafka Streams also update the aggregation result on out-of-order records). Your "final result" would be the latest result record (before a window gets discarded). Depending on your use case, manual de-duplication would be a way to resolve the issue (using lower lever API, transform() or process())
This blog post might help, too: https://timothyrenner.github.io/engineering/2016/08/11/kafka-streams-not-looking-at-facebook.html
Another blog post addressing this issue without using punctuations: http://blog.inovatrend.com/2018/03/making-of-message-gateway-with-kafka.html
Update
With KIP-328, a KTable#suppress() operator is added, that will allow to suppress consecutive updates in a strict manner and to emit a single result record per window; the tradeoff is an increase latency.
From Kafka Streams version 2.1, you can achieve this using suppress.
There is an example from the mentioned apache Kafka Streams documentation that sends an alert when a user has less than three events in an hour:
KGroupedStream<UserId, Event> grouped = ...;
grouped
.windowedBy(TimeWindows.of(Duration.ofHours(1)).grace(ofMinutes(10)))
.count()
.suppress(Suppressed.untilWindowCloses(unbounded()))
.filter((windowedUserId, count) -> count < 3)
.toStream()
.foreach((windowedUserId, count) -> sendAlert(windowedUserId.window(), windowedUserId.key(), count));
As mentioned in the update of this answer, you should be aware of the tradeoff. Moreover, note that suppress() is based on event-time.
I faced the issue, but I solve this problem to add grace(0) after the fixed window and using Suppressed API
public void process(KStream<SensorKeyDTO, SensorDataDTO> stream) {
buildAggregateMetricsBySensor(stream)
.to(outputTopic, Produced.with(String(), new SensorAggregateMetricsSerde()));
}
private KStream<String, SensorAggregateMetricsDTO> buildAggregateMetricsBySensor(KStream<SensorKeyDTO, SensorDataDTO> stream) {
return stream
.map((key, val) -> new KeyValue<>(val.getId(), val))
.groupByKey(Grouped.with(String(), new SensorDataSerde()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(WINDOW_SIZE_IN_MINUTES)).grace(Duration.ofMillis(0)))
.aggregate(SensorAggregateMetricsDTO::new,
(String k, SensorDataDTO v, SensorAggregateMetricsDTO va) -> aggregateData(v, va),
buildWindowPersistentStore())
.suppress(Suppressed.untilWindowCloses(unbounded()))
.toStream()
.map((key, value) -> KeyValue.pair(key.key(), value));
}
private Materialized<String, SensorAggregateMetricsDTO, WindowStore<Bytes, byte[]>> buildWindowPersistentStore() {
return Materialized
.<String, SensorAggregateMetricsDTO, WindowStore<Bytes, byte[]>>as(WINDOW_STORE_NAME)
.withKeySerde(String())
.withValueSerde(new SensorAggregateMetricsSerde());
}
Here you can see the result