I have a KStream with key value pairs that are grouped by key. Every key should be unique, and the only reason why it might not be is as the same key is streamed with a null as value.
In my streams application I need to filter out all the same keys if the value of on of the records is null (tombstone). How do I get started?
KStream<Key, Value> table = builder.stream(kafkaProperties.getTopicName());
// If key exists multiple times, check for null value and if found
// remove / ignore record
So when it needs to stay, but when the complete key with all values need to be thrown away.
This is quite tricky to achieve. Data is processed linearly, and thus you would need to buffer all key-value pairs in a state store, eg, using a transform(). You would insert each input key-value pair into the key-value store. If you receive a null value, you can delete it from the store.
The difficult part is, to decide/know that there won't be any null-value for a key in the future. How to determine this, depends on your overall setup, and there is no generic answer. If you can make the decision at some point that a key-value pair in the store cannot have a future tombstone, you can send it downstream and also delete it from the store.
Related
It is my understanding that state stores are local per app instance (re: partition).... from the docs.
Because Kafka Streams partitions the data for processing it, an application’s entire state is spread across the local state stores of the application’s running instances.
I have a use case where I need to only have any arbitrary key that contains a specific value (let's call it value123). If another keyB/value123 message is received and that value123 is the same but previously has a different key (keyD), I need to delete the old keyD/value123.
Here is the problem - I only receive new key/value associations. I don't receive "tombstone" messages for old keys - therefore I have to imply the tombstone because a new key just arrived on the topic having the same value. There's no way to access (delete) the key/value if it's on another app instance's state store because states are local per instance. I need to evict old data. How can I achieve this?
To look at it another way:
If a message with key A comes into a transformer and that transformer's job is the clean up the state to make sure no other keys have that value... let's say Key A's value is currently 'associated' to key B. I need to delete key B from the KTable/state store so that Key A can be the only thing associated to the value now. I can't guarantee that Key B is assigned to the same partition as key A. How do I delete key B from the partition that key A.
I was able to solve my problem by switching my key to the other data point and using the new 2.5.0 feature to join 2 ktables by foreign key. This would control the output because once the new record came in with the same key (but a different foreign key) my other ktable wouldn't join because the foreign key had changed.
I used these two as resources:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics#KafkaStreamsJoinSemantics-KTable-KTableForeign-KeyJoin
https://kafka-tutorials.confluent.io/foreign-key-joins/kstreams.html
Instances of Kafka Streams applications can communicate using RPC - https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html#querying-remote-state-stores-for-the-entire-app.
You could query the other instances by creating a custom RPC endpoint and building logic to delete the value from the remote state store, if found.
I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).
Is there a way to identify any record in any table without using RDB$DB_KEY or a table's key?
Unfortunately RDB$DB_KEY can only be guaranteed for the current transaction and might be different outside of it and without another key in the table one would not be able to uniquely identify a record if a record is an exact duplicate of another.
Other than RDB$DB_KEY, a primary key, or a unique key, there is nothing else to uniquely identify a row.
It is possible to extend the lifetime of RDB$DB_KEY to the lifetime of a connection using DPB property isc_dpb_dbkey_scope. However, using that is a bad idea: it will start an internal transaction for the lifetime of your connection which will prevent garbage collection of old row versions. This can seriously affect the performance of your application.
I am doing a poc on kafka streams and ktables.I was wondering if there is any way to store data(key-value pair or key-object pair) in Kafka, either through streams, ktables, state-stores, so that i can retrieve data bases on both keys and values.
I created a kstream based on topic, on which i pushed some messages and using wordcountalgo, i populated values in ktable created above kstream. Something like this:
StoreBuilder customerStateStore = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("customer-store"),Serdes.String(), customerSerde)
.withLoggingEnabled(new HashMap<>());
streamsBuilder.stream("customer", Consumed.with(Serdes.String(), customerSerde)).to("customer-to-ktable-topic",Produced.with(Serdes.String(), customerSerde));
KTable<String, Customer> customerKTable = streamsBuilder.table("customer-to-ktable-topic", Consumed.with(Serdes.String(), customerSerde),Materialized.as(customerStateStore.name()));
I am not able to fetch records based on values.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/kstream/KTable.html
Only get(String key) function is available in the documentation of kafka doc. However, I am exploring to see if this can be achieved some other way?
Your customerStateStore is a key-value store and as you stated, you can only query based on keys.
One proposal would be to work on the IN flow, in order to use the value (or part of the value) as a key in the store. You can do that with the map() method. The idea could be to achieve something like:
Original IN msg: key1 - value1
Would generate 2 entries in the store:
key1 - value1
value1 - key1 (or whatever depending on your usecase)
Doing this, you will be able to query the store on the value1, because it is a key. (Be careful if in the IN topic you have the same value for different keys.)
Or, as an alternative to #NishuTayal's suggestion, you can loop on all the entries of the store and evaluate the values, with the method all(): https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/state/ReadOnlyKeyValueStore.html#all--
Obviously this will degrade the performance, but depending on the size of your (in memory) store and your usecase (get all the entries for a given value? only one entry? ...), it might not add too much delay to the processing.
But you have to be careful with the partitioning of your input topic: one given value may then be present in several partitions of your topic, and then be present in different instances of your KS app.
You can use filter operation to make key or value based lookups
customerKTable.filter((key, value) -> value.get("country") != "USA")
When you are Materializing a Store from a SessionWindowKStream, it forces you to do it as a SessionStore by setting
Materialized<K,VR,SessionStore<org.apache.kafka.common.utils.Bytes,byte[]>> materialized).
So what you get is a SessionStore<org.apache.kafka.common.utils.Bytes,byte[]>.
In this type of store, you can fetch by key, but not by key and time as in a WindowStore, even of the key type is a Windowed<K>. So you would have to iterate through it to find the time-related entries, which should be less efficient than querying by time.
How can you use the aggregated session store of Windowed<K> in order to query the store with (key, time)?
Or, put differently, why is there no findSessions-like methods (i.e. time-bound access) in ReadOnlySessionStore, while there is in SessionStore?