Looking up specific value in KTable - apache-kafka

I have a Kafka topic.
I have a stream with the key being a stock symbol and the Value being a Hi/Low pojo. I also have a KTable that captures the current state of the stream.
I want to process every record in the stream one by one. For each record, I want to look up the current value of that symbol from a KTable. Then depending on whether the Hi/Low changes, I want to update the KTable and then write the message to the stream.

Related

Maintain separate KTable

I have a topic which contains events of user connection and disconnection for each session. I would like to use Kafka stream to process this topic and update KTable based on some condition. Each record cannot update KTable. So I need to process multiple records to know if KTable has to be updated.
For eg, process stream and aggregate by user and then by sessionid. If atleast one sessionid of that user has only Connected event, KTable must be updated as user online if not already.
If all sessionId of the user has Disconnected event, KTable must be updated as user offline if not already.
How can I implement such a logic?
Can we implement this KTable in all application instances so that each instance has this data available locally?
Sounds like a rather complex scenario.
Maybe, it's best to use the Processor API for this case? A KTable is basically just a KV-store, and using the Processor API, allows you to apply complex processing to decide if you want to update the state store or not. A KTable itself does not allow you to apply complex logic but it will apply each update it receives.
Thus, using the DSL, you would need to do some per-processing, and if you want to update a KTable send an update record only for this case. Something like this:
KStream stream = builder.stream("input-topic");
// apply your processing and write an update record into `updates` when necessary
KStream updates = stream...
KTable table = updates.toTable();

How to store only latest key values in a kafka topic

I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
I thought a KTable's whole purpose was that it will store the latest value given a key rather than storing the whole stream of events. However I can't seem to get this to work. Running the code below produces the keystore but that keystore (maintopiclatest) has a stream of events in it (not just the latest values). So if I send a request with 1000 records in the topic twice, rather than seeing 1000 records, I see 2000 records.
var serializer = new KafkaSpecificRecordSerializer();
var deserializer = new KafkaSpecificRecordDeserializer();
var stream = kStreamBuilder.stream("maintopic",
Consumed.with(Serdes.String(), Serdes.serdeFrom(serializer, deserializer)));
var table = stream
.groupByKey()
.reduce((aggV, newV) -> newV, Materialized.as("maintopiclatest"));
The other problem is if I want to store the KTable in a new topic I'm not sure how to do that. In order to do that it seems that I have to turn it back into a Stream so that I can call ".to" on it. But then that has the whole stream of events in it not just the latest values.
This is not how a KTable works.
A KTable itself, has an internal state store and stores exactly one record per key. However, a KTable is constantly updated and subject to the so-called stream-table-duality. Each update to the KTable is sent downstream as a changelog record: https://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables. Thus, each input record result in an output record.
Because it's stream processing, there is no "last key per value".
I have a topic that has a stream of data coming to it. What I need is to create a separate topic from this topic that only has the latest set of values given the keys.
At which point in time do you want a KTable to emit an update? There is no answer to this question because the input stream is conceptually infinite.

Kafka Streams TimestampExtractor

Hi everybody I have a question about TimestampExtractor and Kafka Streams....
In our application there is a possibility of receiving out-of-order events, so I like to order the events depending on a business date inside of the payload instead in point of time they placed in the topic.
For this purpose I programmed a custom TimestampExtractor to be able to pull the timestamp from the payload. Everything until I told here worked perfectly but when I build the KTable to this topic, I discerned that the event that I receive out-of-order (from Business point of view it is not last event but it received at the end) displayed as last state of the object while ConsumerRecord having the timestamp from the payload.
I don't know may be it was my mistake to assume Kafka Stream will fix this out-of-order problem with TimestampExtractor.
Then during debugging I saw that if the TimestampExtractor returns -1 as result Kafka Streams are ignoring the message and TimestampExtractor also delivering the timestamp of the last accepted Event, so I build a logic that realise the following check (payloadTimestamp < previousTimestamp) return -1, which achieves the logic I want but I am not sure I am sailing on dangerous waters or not.
Am I allowed to deal with a logic like this or what other ways exist to deal with out-of-order events in Kafka streams....
Thx for answers..
Currently (Kafka 2.0), KTables don't consider timestamps when they are updated, because the assumption is, that there is no out-of-order data in the input topic. The reason for this assumption is the "single writer principle" -- it's assumed, that for compacted KTable input topic, there is only one producer per key, and thus, there won't be any out-of-order data with regard to single keys.
It's a know issue: https://issues.apache.org/jira/browse/KAFKA-6521
For your fix: it's not 100% correct or safe to do this "hack":
First, assume you have two different messages with two different key <key1, value1, 5>, <key2, value2, 3>. The second record with timestamp 3 is later, compared to the first record with timestamp 5. However, both have different keys and thus, you actually want to put the second record into the KTable. Only if you have two record with the same key, you want to drop late arriving data IHMO.
Second, if you have two records with the same key and the second one if out-of-order and you crash before processing the second one, the TimestampExtractor looses the timestamp of the first record. Thus on restart, it would not discard the out-of-order record.
To get this right, you will need to filter "manually" in your application logic instead of the stateless and key-agnostic TimestampExtractor. Instead of reading the data via builder#table() you can read it as a stream, and apply an .groupByKey().reduce() to build the KTable. In you Reducer logic, you compare the timestamp of the new and old record and return the record with the larger timestamp.

Does the changelog of a Kafka Table emit a new event to its stream when the real value of a key remains unchanged after update?

Let's say I have KTable that's keeping track of location data (e.g.: {'Susan': 'Paris'}) and have materialized a changelog stream using the toStream method of that table.
I know that if were to update the table with a change to key Susan (e.g.: Berlin) then the changelog stream would emit a new event {'Susan': 'Berlin'}. But what if the updated value is the same as the last?--e.g., during an aggregate operation we set Susan to Berlin a second time.
Does the changelog emit a second {'Susan': 'Berlin'} event or are new events only added to the changelog stream when there's a diff between the old and new values?
Updates are always emitted. There is not check that compares the old and new value, and the assumption is, that the new value is different to the old value and thus emitted every time.

Kafka Streams reduceByKey vs. leftJoin

At first glance it seems to me that with a KStream#reduceByKey one can achieve the same functionality as with a KStream to KTable leftJoin. I.e combining records with the same key. What i the difference between the two, also in terms of performance?
Short answer: (What is the difference between the two?)
reduceByKey is applied to a single input stream while leftJoin combines two streams/tables.
Long answer:
If I understand your question correctly, it seems that your incoming KTable changelog stream would be empty, and you want to compute a new join result (ie, update result KTable) for each incoming KStream record? The result KTable of a join is not available as materialized view, but only the changelog topic will be sent downstream. Thus, your input KTable would always be empty and your input KStream record, would always join with "nothing" (because of left join), which would not be really be update the result KTable. You could also do a KStream#map() -- there is no state you can exploit if your input KTable does not provide a state.
In contrast, if you use reduceByKey, the result KTable is available as materialized view, and thus for each KStream input record, the previous result value is available to get updated.
Thus, both operations are fundamentally different. If you have a single input KStream using a join (that required two inputs) would be quite odd, as there is no KTable...
KStream represents a record stream in which each record is self contained. For example, if we are to summarize word occurrences, it would hold the count during a certain frame (e.g. time window or paragraph).
KTable represents a sort of a state and, each record coming in, would normally hold the total occurrences count.
Therefore, the use case to which each method is used is quite different. While KStream#reduceByKey would reduce all records in the same key and summarize the counts for each key, KTable#leftJoin would normally be used in cases when the total count needs to be adjusted according to another information coming in, or combining more data to the record.
The example given in Kafka Stream's documentation is for log compaction. While with KStream, no record could be discarded, in KTable, records that are no longer relevant would be removed.