merge records in a kafka stream - apache-kafka

Is it possible to merge records in kafka and publish the output to different stream ?
For example , there is a stream of events produced to a kafka topic like below
{txnId:1,startTime:0900},{txnId:1,endTime:0905},{txnId:2,endTime:0912},{txnId:3,endTime:0930},{txnId:2,startTime:0912},{txnId:3,startTime:0925}......
I want to merge these events by txnId and create the merged output like below
{txnId:1,startTime:0900,endTime:0905},{txnId:2,startTime:0910,endTime:0912},{txnId:3,startTime:0925,endTime:0930}
Please note that order is not maintained in the incoming events.So if endTime is received for a txn Id before start time event , then we need to wait till the start time event is received for that txnId before initiating the merge
I went through the word count example that comes along with Kafka Streams example but its not clear how to wait for events and then merge while doing the transformation.
Any thoughts is highly appreciated.

You could try solving this by splitting the start and end events into 2 separate streams with txnId as the key and then joining both the streams.
KStream<String, String> eventSource = new StreamBuilder().stream("INPUT-TOPIC");
KStream<String, JsonNode>[] splitEvents =
eventSource.map((key, eventString) -> {
JsonNode event = new ObjectMapper().readTree(eventString);
String txnId = event.path("txnId").asText();
return KeyValue.pair(txnId, event);
})
.branch((key, event) -> event.findValue("startTime") != null,
(key, event) -> event.findValue("endTime") != null);
KStream<String, JsonNode> startEvents = splitEvents[0];
KStream<String, JsonNode> endEvents = splitEvents[1];
A join between 2 streams as shown will produce a join result when there is an event in either side of the join. So the order of both events won't matter (you will have to ensure that you set an appropriate window period for the join).
Serde<JsonNode> jsonSerde = Serdes.serdeFrom(new JsonSerializer(), new JsonDeserializer());
KStream<String, String> completeEvents = startEvents.join(endEvents,
(startEvent, endEvent) -> {
// Add logic to merge startEvent and endEvent as seen fit
ObjectNode completeEvent = JsonNodeFactory.instance.objectNode();
completeEvent.put("startTime", startEvent.path("startTime).asText());
completeEvent.put("endTime", endEvent.path("endTime").asText());
return completeEvent.toString();
},
JoinWindows.of(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(), // key
jsonSerde, // left object
jsonSerde // right object
)
);

Related

Stale ktable records when joining kstream with ktable created by kstream aggregation

I'm trying to implement the event sourcing pattern with kafka streams in the following way.
I'm in a Security service and handle two use cases:
Register User, handling RegisterUserCommand should produce UserRegisteredEvent.
Change User Name, handling ChangeUserNameCommand should produce UserNameChangedEvent.
I have two topics:
Command Topic, 'security-command'. Every command is keyed and the key is user's email. For example:
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
Event Topic, 'security-event'. Every record is keyed by user's email:
foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
foo#bar.com:{"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
Kafka Streams version 2.8.0
Kafka version 2.8
The implementation idea can be expressed in the following topology:
commandStream = builder.stream("security-command");
eventStream = builder.stream("security-event",
Consumed.with(
...,
new ZeroTimestampExtractor()
/*always returns 0 to get the latest version of snapshot*/));
// build the snapshot to get the current state of the user.
userSnapshots = eventStream.groupByKey()
.aggregate(() -> new UserSnapshot(),
(key /*email*/, event, currentSnapshot) -> currentSnapshot.apply(event));
// join commands with latest snapshot at the time of the join
commandWithSnapshotStream =
commandStream.leftJoin(
userSnapshots,
(command, snapshot) -> new CommandWithUserSnapshot(command, snapshot),
joinParams
);
// handle the command given the current snapshot
resultingEventStream = commandWithSnapshotStream.flatMap((key /*email*/, commandWithSnapshot) -> {
var newEvents = commandHandler(commandWithSnapshot.command(), commandWithSnapshot.snapshot());
return Arrays.stream(newEvents )
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
});
// append events to events topic
resultingEventStream.to("security-event");
For this topology, I'm using EOS exactly_once_beta.
A more explicit version of this topology:
KStream<String, Command<DomainEvent[]>> commandStream =
builder.stream(
commandTopic,
Consumed.with(Serdes.String(), new SecurityCommandSerde()));
KStream<String, DomainEvent> eventStream =
builder.stream(
eventTopic,
Consumed.with(
Serdes.String(),
new DomainEventSerde(),
new LatestRecordTimestampExtractor() /*always returns 0 to get the latest snapshot of the snapshot.*/));
// build the snapshots ktable by aggregating all the current events for a given user.
KTable<String, UserSnapshot> userSnapshots =
eventStream.groupByKey()
.aggregate(
() -> new UserSnapshot(),
(email, event, currentSnapshot) -> currentSnapshot.apply(event),
Materialized.with(
Serdes.String(),
new UserSnapshotSerde()));
// join command stream and snapshot table to get the stream of pairs <Command, UserSnapshot>
Joined<String, Command<DomainEvent[]>, UserSnapshot> commandWithSnapshotJoinParams =
Joined.with(
Serdes.String(),
new SecurityCommandSerde(),
new UserSnapshotSerde()
);
KStream<String, CommandWithUserSnapshot> commandWithSnapshotStream =
commandStream.leftJoin(
userSnapshots,
(command, snapshot) -> new CommandWithUserSnapshot(command, snapshot),
commandWithSnapshotJoinParams
);
var resultingEventStream = commandWithSnapshotStream.flatMap((key /*email*/, commandWithSnapshot) -> {
var command = commandWithSnapshot.command();
if (command instanceof RegisterUserCommand registerUserCommand) {
var handler = new RegisterUserCommandHandler();
var events = handler.handle(registerUserCommand);
// multiple events might be produced when a command is handled.
return Arrays.stream(events)
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
}
if (command instanceof ChangeUserNameCommand changeUserNameCommand) {
var handler = new ChangeUserNameCommandHandler();
var events = handler.handle(changeUserNameCommand, commandWithSnapshot.userSnapshot());
return Arrays.stream(events)
.map(e -> new KeyValue<String, DomainEvent>(e.email(), e))
.toList();
}
throw new IllegalArgumentException("...");
});
resultingEventStream.to(eventTopic, Produced.with(Serdes.String(), new DomainEventSerde()));
Problems I'm getting:
Launching the stream app on a command topic with existing records:
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
Outcome:
1. Stream application fails when processing the ChangeUserNameCommand, because the snapshot is null.
2. The events topic has a record for successful registration, but nothing for changing the name:
/*OK*/foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
Thoughts:
When processing the ChangeUserNameCommand, the snapshot is missing in the aggregated KTable, userSnapshots. Restarting the application succesfully produces the following record:
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
Tried increasing the max.task.idle.ms to 4 seconds - no effect.
Launching the stream app and producing a set of ChangeUserNameCommand commands at a time (fast).
Producing:
// Produce to command topic
foo#bar.com:{"type": "RegisterUserCommand", "command": {"name":"Alex","email":"foo#bar.com"}}
// event topic outcome
/*OK*/ foo#bar.com:{"type":"UserRegisteredEvent","event":{"email":"foo#bar.com","name":"Alex", "version":0}}
// Produce at once to command topic
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex1"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex2"}}
foo#bar.com:{"type": "ChangeUserNameCommand", "command": {"email":"foo#bar.com","newName":"Alex3"}}
// event topic outcome
/*OK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
/*NOK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex2","version":1}}
/*NOK*/foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex3","version":1}}
Thoughts:
'ChangeUserNameCommand' commands are joined with a stale version of snapshot (pay attention to the version attribute).
The expected outcome would be:
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex1","version":1}}
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex2","version":2}}
foo#bar.com: {"type":"UserNameChangedEvent","event":{"email":"foo#bar.com","name":"Alex3","version":3}}
Tried increasing the max.task.idle.ms to 4 seconds - no effect, setting the cache_max_bytes_buffering to 0 has no effect.
What am I missing in building such a topology? I expect that every command to be processed on the latest version of the snapshot. If I produce the commands with a few seconds delay between them, everything works as expected.
I think you missed change-log recovery part for the Tables. Read this to understand what happens with change-log recovery.
For tables, it is more complex because they must maintain additional
information—their state—to allow for stateful processing such as joins
and aggregations like COUNT() or SUM(). To achieve this while also
ensuring high processing performance, tables (through their state
stores) are materialized on local disk within a Kafka Streams
application instance or a ksqlDB server. But machines and containers
can be lost, along with any locally stored data. How can we make
tables fault tolerant, too?
The answer is that any data stored in a table is also stored remotely
in Kafka. Every table has its own change stream for this purpose—a
built-in change data capture (CDC) setup, we could say. So if we have
a table of account balances by customer, every time an account balance
is updated, a corresponding change event will be recorded into the
change stream of that table.
Also keep in mind, Restart a Kafka stream application should not process previously processed events. For that you need to commit offset of the message after processed it.
Found the root cause. Not sure if it is by design or a bug, but a stream task will wait only once per processing cycle for data in other partitions.
So if 2 records from command topic were read first, the stream task will wait max.task.idle.ms, allowing the poll() phase to happen, when processing the first command record. After it is processed, during processing the second one, the stream task will not allow polling to get newly generated events that resulted from first command processing.
In kafka 2.8, the code that is responsible for this behavior is in StreamTask.java. IsProcessable() is invoked at the beginning of processing phase. If it returns false, this will lead to repeating the polling phase.
public boolean isProcessable(final long wallClockTime) {
if (state() == State.CLOSED) {
return false;
}
if (hasPendingTxCommit) {
return false;
}
if (partitionGroup.allPartitionsBuffered()) {
idleStartTimeMs = RecordQueue.UNKNOWN;
return true;
} else if (partitionGroup.numBuffered() > 0) {
if (idleStartTimeMs == RecordQueue.UNKNOWN) {
idleStartTimeMs = wallClockTime;
}
if (wallClockTime - idleStartTimeMs >= maxTaskIdleMs) {
return true;
// idleStartTimeMs is not reset to default, RecordQueue.UNKNOWN, value,
// therefore the next time when the check for all buffered partitions is done, `true` is returned, meaning that the task is ready to be processed.
} else {
return false;
}
} else {
// there's no data in any of the topics; we should reset the enforced
// processing timer
idleStartTimeMs = RecordQueue.UNKNOWN;
return false;
}
}

Get the last records of KStream

I'm very new to Kafka Stream API.
I have a KStream like this:
KStream<Long,String> joinStream = builder.stream(("output"));
The KStream with records value look like this:
The stream will be updated every 1s.
I need to build a Rest API that will be calculated based on the value profit and spotPrice.
But I've struggled to get the value of the last record.
I am assuming that you mean the max value of the stream when you say the last value as the values are continuously arriving. Then you can use the reduce transformation to always update the output stream with the max value.
final StreamsBuilder builder = new StreamsBuilder();
KStream<Long, String> stream = builder.stream("INPUT_TOPIC", Consumed.with(Serdes.Long(), Serdes.String()));
stream
.mapValues(value -> Long.valueOf(value))
.groupByKey()
.reduce(new Reducer<Long>() {
#Override
public Long apply(Long currentMax, Long v) {
return (currentMax > v) ? currentMax : v;
}
})
.toStream().to("OUTPUT_TOPIC");
return builder.build();
And in case that you want to retrive it in a rest api i suggest to take a look at Spring cloud + Kafka streams (https://cloud.spring.io/spring-cloud-stream-binder-kafka/spring-cloud-stream-binder-kafka.html) that you can exchange messages to spring web.

KTable-KStream LeftJoin impacting the performance. Is there any alertnative to it

I have a use-case in which I am receiving the tweets on a topic, and user-details on other topic. I need to find username from the user-details and set it to tweets.
Using following code I am able to get the expected outcome.
KStream<String, Tweet> tweetStream = builder
.stream("tweet-topic",
Consumed.with(Serdes.String(),
serdeProvider.getTweetSerde()));
KTable<String, User> userTable = builder.table("user-topic",
Consumed.with(Serdes.String(),
serdeProvider.getUserSerde()));
KStream<String, Tweet> finalStream = tweetStream.leftJoin(userTable, (tweetDetail, userDetail) -> {
if (userDetail != null) {
return tweetDetail.setUserName(userDetail.getName());
}
return tweetDetail;
}, Joined.with(Serdes.String(), serdeProvider.getTweetSerde(),
serdeProvider.getUserSerde()));
However, if there are 1000 records in kTable topic, to process 1Million this logic is taking more than 2Hours.Earlier it was taking 2 to 3mins.
Earlier, when user-details were in local hash map, it used to approx 10mins to process all the data.
Is there any otherway to avoid LeftJoin or improve its performance?

Kafka Stream producing custom list of messages based on certain conditions

We have the following stream processing requirement.
Source Stream ->
transform(condition check - If (true) then generate MULTIPLE ADDITIONAL messages else just transform the incoming message) ->
output kafka topic
Example:
If condition is true for message B(D,E,F are the additional messages produced)
A,B,C -> A,D,E,F,C -> Sink Kafka Topic
If condition is false
A,B,C -> A,B,C -> Sink Kafka Topic
Is there a way we can achieve this in Kafka streams?
You can use flatMap() or flatMapValues() methods. These methods take one record and produce zero, one or more records.
flatMap() can modify the key, values and their datatypes while flatMapValues() retains the original keys and change the value and value data type.
Here is an example pseudocode considering the new messages "C","D","E" will have a new key.
KStream<byte[], String> inputStream = builder.stream("inputTopic");
KStream<byte[], String> outStream = inputStream.flatMap(
(key,value)->{
List<KeyValue<byte[], String>> result = new LinkedList<>();
// If message value is "B". Otherwise place your condition based on data
if(value.equalsTo("B")){
result.add(KeyValue.pair("<new key for message C>","C"));
result.add(KeyValue.pair("<new key for message D>","D"));
result.add(KeyValue.pair("<new key for message E>","E"));
}else{
result.add(KeyValue.pair(key,value));
}
return result;
});
outStream.to("sinkTopic");
You can read more about this :
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#streams-developer-guide-dsl-transformations-stateless

How to aggregate data hourly?

Whenever a user favorites some content on our site we collect the events and what we were planning to do is to hourly commit the aggregated favorites of a content and update the total favorite count in the DB.
We were evaluating Kafka Streams. Followed the word count example. Our topology is simple, produce to a topic A and read and commit aggregated data to another topic B. Then consume events from Topic B every hour and commit in the DB.
#Bean(name = KafkaStreamsDefaultConfiguration.DEFAULT_STREAMS_CONFIG_BEAN_NAME)
public StreamsConfig kStreamsConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "favorite-streams");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, WallclockTimestampExtractor.class.getName());
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerAddress);
return new StreamsConfig(props);
}
#Bean
public KStream<String, String> kStream(StreamsBuilder kStreamBuilder) {
StreamsBuilder builder = streamBuilder();
KStream<String, String> source = builder.stream(topic);
source.flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split("\\W+")))
.groupBy((key, value) -> value)
.count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>> as("counts-store")).toStream()
.to(topic + "-grouped", Produced.with(Serdes.String(), Serdes.Long()));
Topology topology = builder.build();
KafkaStreams streams = new KafkaStreams(topology, kStreamsConfigs());
streams.start();
return source;
}
#Bean
public StreamsBuilder streamBuilder() {
return new StreamsBuilder();
}
However when I consume this Topic B it gives me aggregated data from the beginning. My question is that can we have some provision wherein I can consume the previous hours grouped data and then commit to DB and then Kakfa forgets about the previous hours data and gives new data each hour rather than cumulative sum. Is the design topology correct or can we do something better?
If you want to get one aggregation result per hour, you can use a windowed aggregation with a window size of 1 hour.
stream.groupBy(...)
.windowedBy(TimeWindow.of(1 *3600 * 1000))
.count(...)
Check the docs for more details: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
The output type is Windowed<String> for the key (not String). You need to provide a custom Window<String> Serde, or convert the key type. Consult SessionWindowsExample.