Kafka Ktable changelog (using toStream()) is missing some ktable updates when several messages with the same key arrive at the same time - apache-kafka

I have an input stream, I use it to create a ktable. Then I create an output stream with the ktable changelog, using toStream() method. The problem is that the stream created by the toStream() method does not contains all the messages from the input stream that has updated my KTable. Here is my code :
final KTable<String, event> KTable = inputStream.groupByKey().aggregate(() -> null,
aggregateKtableMethod,
storageConf);
KStream<String, event> outputStream = KTable.toStream();
I would like to get one message in the outputStream for each message in inputStream. For most of the messages it is working well, but I am losing some events in a particular case : if I receive 2 messages with the same key in a small interval of time (less than 5 seconds). In this case I only receive the second event in the outputStream.
I think it is because the Ktable updates are made by some batch operations, but I can't find any configuration or documentation related to it. Is it the reason of these missing events and do you know how to change the configuration so that I will not lose any message ?

I found the solution. The issue was in the "storageConf" I have used to create my ktable, the cache was able. I just had to disabled it, with the function :
storageConf.withCachingDisabled();
final KTable<String, event> KTable = inputStream.groupByKey().aggregate(() -> null,
aggregateKtableMethod,
storageConf);
Now I have all my events in the output stream.

Related

Stream-KTable join - stale data if same topic is used as source and sink

I'm trying to implement the event sourcing pattern with Kafka Streams. I have two topics: commands and events.
Commands are modelled as KStream. To get the current state for a given entity, I create a KTable by aggregating (replaying) all the events.
Then I join commands KStream with snapshots KTable. As a result of this join, I'm executing the command on the latest snapshot, which in turn produces new events to events topic.
Here is the code:
KStream<String, SecurityCommand> securityCommands =
builder.stream(
"security-command",
Consumed.with(Serdes.String(), JsonSerdes.securityCommand()));
var userAccountSnapshots =
builder.stream(
"security-event",
Consumed.with(Serdes.String(), JsonSerdes.userAccountEvent())
.withTimestampExtractor(new LatestRecordTimestampExtractor()))
.groupByKey()
.aggregate(
() -> UserAccountSnapshot.initial(),
(email, event, snapshot) ->
(new UserAccount(snapshot, new UserAccountEvent[] { event })).snapshot(),
Materialized.with(
Serdes.String(),
JsonSerdes.userAccountSnapshot()));
Joined<String, SecurityCommand, UserAccountSnapshot> joinParams =
Joined.with(Serdes.String(),
JsonSerdes.securityCommand(),
JsonSerdes.userAccountSnapshot());
ValueJoiner<SecurityCommand, UserAccountSnapshot, CommandWithUserAccountSnapshot> commandWithSnapshot =
(command, snapshot) -> new CommandWithUserAccountSnapshot(command, snapshot);
securityCommands.leftJoin(userAccountSnapshots, commandWithSnapshot, joinParams)
.flatMapValues((email, cmd) -> processCommand(cmd))
.to("security-event",
Produced.with(
Serdes.String(),
JsonSerdes.userAccountEvent()));
The problem that I'm facing is that when I have multiple unprocessed commands (CmdA, CmdB) in the commands topic, after processing the first one, the event produced after processing this command doesn't get fetched from events topic. As a result, the next command is processed on an obsolete snapshot.
The reason for this is the KIP-695 implementation.
After CmdA is processed, the consumer lag for events topic doesn't get updated, because no poll is done. The cached lag is 0. Therefore, CmdB is processed without fetching data from events topic. This is a blocker.
Basically, it means that if a have a topology where I'm joining two streams and one of the streams is sourced from a topic that also serves as a sink topic, I will get inconsistent behaviour due to stale data.
Is there a workaround to my problem?

Apache Kafka Streams : Out-of-Order messages

I have an Apache Kafka 2.6 Producer which writes to topic-A (TA).
I also have a Kafka streams application which consumes from TA and writes to topic-B (TB).
In the streams application, I have a custom timestamp extractor which extracts the timestamp from the message payload.
For one of my failure handling test cases, I shutdown the Kafka cluster while my applications are running.
When the producer application tries to write messages to TA, it cannot because the cluster is down and hence (I assume) buffers the messages.
Let's say it receives 4 messages m1,m2,m3,m4 in increasing time order. (i.e. m1 is first and m4 is last).
When I bring the Kafka cluster back online, the producer sends the buffered messages to the topic, but they are not in order. I receive for example, m2 then m3 then m1 and then m4.
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
My code for the Producer is as shown below (I am using Spring Cloud for creating the Producer):
Producer.java
#Service
public class Producer {
private String topicName = "input-topic";
private ApplicationProperties appProps;
#Autowired
private KafkaTemplate<String, MyEvent> kafkaTemplate;
public Producer() {
super();
}
#Autowired
public void setAppProps(ApplicationProperties appProps) {
this.appProps = appProps;
this.topicName = appProps.getInput().getTopicName();
}
public void sendMessage(String key, MyEvent ce) {
ListenableFuture<SendResult<String,MyEvent>> future = this.kafkaTemplate.send(this.topicName, key, ce);
}
}
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
By default, the producer allow for up to 5 parallel in-flight requests to a broker, and thus if some requests fail and are retried the request order might change.
To avoid this re-ordering issue, you can either set max.in.flight.requests.per.connection = 1 (what may have a performance hit) or set enable.idempotence = true.
Btw: you did not say if your topic has a single partition or multiple partitions, and if your messages have a key? If your topic has more then one partition and you messages are sent to different partitions, there is no ordering guarantee on read anyway, because offset ordering is only guaranteed within a partition.
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
The timestamp extractor only extracts a timestamp. Kafka Streams does not re-order any messages, but processes messages always in offset-order.
If not, then what are the specific uses of the timestamp extractor ? Just to associate a timestamp with an event ?
Correct.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
No, it won't do any reordering. The other SO question is just about to change the timestamp, but if you read messages in order a,b,c the result would be written in order a,b,c (just with different timestamps, but offset order should be preserved).
This talk explains some more details: https://www.confluent.io/kafka-summit-san-francisco-2019/whats-the-time-and-why/

What happens to the consumer offset when an error occurs within a custom class in a KStream topology?

I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.

What's a viable solution to deferred processing of the events?

Given the system, which is consuming the event stream from Kafka in order to analyze some records stored in the database.
In some cases, the event matches some condition that means, that corresponding record should be analyzed later in the future.
Perhaps, the most simple solution to implement this logic is to write timestamp of future processing to the database and periodically perform some kind of select to find required records for re-processing.
Maybe there is another more convenient and scalable way to do it? It looks like another timestamped event stream which could be processed when the current time become greater or equal than timestamp of event, what's the options to implement such behavior?
In my opinion depending on how long you need to store it, you can just create a stream that filter for these events and push it into a new topic that can be processed later. If it is more for historical purpose then it might be better to push it into a DBMS.
You can try state store in Kafka Stream. Which can be used by stream processing applications to store and query data in later.
Kafka Stream automatically creates and manages such state stores when you are calling stateful operators such as count() or aggregate(), or when you are windowing a stream. It will be store in In-memory however you can store in somewhere persistent storage e.g. portworx to handle fault scenario.
Below show how you initialize StateStore
StoreBuilder<KeyValueStore<String, String>> statStore = Stores
.keyValueStoreBuilder(Stores.persistentKeyValueStore("uniqueName"), Serdes.String(),
Serdes.String())
.withLoggingDisabled(); // disable backing up the store to a change log topic
Below show how to add state store inside Kafka Stream
Topology builder = new Topology();
builder.addSource("Source", topic)
.addProcessor("SourceProcessName", () -> new ProcessorClass(), "Source")
.addStateStore(statStore, "SourceProcessName")
.addSink("SinkProcessName", sinkTopic, "SourceProcessName");
In the Process Method, You can store Kafka topic message as key, value
KeyValueStore<String, String> dsStore = (KeyValueStore<String, String>) context.getStateStore("statStore");
KeyValueIterator<String, String> iter = this.dsStore.all();
while (iter.hasNext()) {
KeyValue<String, String> entry = iter.next();
}

How to store only latest key values in a kafka topic

I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
I thought a KTable's whole purpose was that it will store the latest value given a key rather than storing the whole stream of events. However I can't seem to get this to work. Running the code below produces the keystore but that keystore (maintopiclatest) has a stream of events in it (not just the latest values). So if I send a request with 1000 records in the topic twice, rather than seeing 1000 records, I see 2000 records.
var serializer = new KafkaSpecificRecordSerializer();
var deserializer = new KafkaSpecificRecordDeserializer();
var stream = kStreamBuilder.stream("maintopic",
Consumed.with(Serdes.String(), Serdes.serdeFrom(serializer, deserializer)));
var table = stream
.groupByKey()
.reduce((aggV, newV) -> newV, Materialized.as("maintopiclatest"));
The other problem is if I want to store the KTable in a new topic I'm not sure how to do that. In order to do that it seems that I have to turn it back into a Stream so that I can call ".to" on it. But then that has the whole stream of events in it not just the latest values.
This is not how a KTable works.
A KTable itself, has an internal state store and stores exactly one record per key. However, a KTable is constantly updated and subject to the so-called stream-table-duality. Each update to the KTable is sent downstream as a changelog record: https://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables. Thus, each input record result in an output record.
Because it's stream processing, there is no "last key per value".
I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
At which point in time do you want a KTable to emit an update? There is no answer to this question because the input stream is conceptually infinite.