Unable to retrieve metadata for a state store key in Kafka Streams - apache-kafka

I'm trying to use Kafka Streams with a state store distributed among two instances. Here's how the store and the associated KTable are defined:
KTable<String, Double> userBalancesTable = kStreamBuilder.table(
"balances-table",
Consumed.with(String(), Double()),
Materialized.<String, Double, KeyValueStore<Bytes, byte[]>>as(BALANCES_STORE).withKeySerde(String()).withValueSerde(Double())
);
Next, I have some stream processing logic which is aggregating some data to this balance-table KTable:
transactionsStream
.leftJoin(...)
...
.aggregate(...)
.to("balances-table", Produced.with(String(), Double()));
And at some point I am, from a REST handler, querying the state store.
ReadOnlyKeyValueStore<String, Double> balances = streams.store(BALANCES_STORE, QueryableStoreTypes.<String, Double>keyValueStore());
return Optional.ofNullable(balances.get(userId)).orElse(0.0);
Which works perfectly - as long as I have a single stream processing instance.
Now, I'm adding a second instance (note: my topics all have 2 partititions). As explained in the docs, the state store BALANCES_STORE is distributed among the instances based on the key of each record (in my case, the key is an user ID). Therefore, an instance must:
Make a call to KafkaStreams#metadataForKey to discover which instance is handling the part of the state store containing the key it wants to retrieve
Make a RPC (e.g. REST) call to this instance to retrieve it
My problem is that the call to KafkaStreams#metadataForKey is always returning a null metadata object. However, KafkaStreams#allMetadataForStore() is returning a metadata object containing the two instances. So it behaves like it doesn't know about the key I'm querying, although looking it up in the state stores works.
Additional note: my application.server property is correctly set.
Thank you!

Related

Is it possible to extract the Schema ID when using KStream processing?

I am processing messages from sourceTopic to a targetTopic using KStream (using map method). In the map method, I am generating a new schema (since i need to extract explicit fields) for the targettopic using the incoming messages, but since the KStream operation is per message, i wish to avoid regenerating the schema for every message and would instead want to cache the schemaID of the incoming messages (for both Key and Value) and generate new targetschema only if the source Schema changes.
Is there a way to do this via the KStream object or from the Key/Value objects used in the map method
Update:
I was not able to get the schema ID for my above use case, as a workaround I cached the schema into a local variable and checked on each iteration if it changed and process further as required.
You only will have access to the ID if you use Serde.Bytes; after the records are deserialized, you'll only have access to the Schema.
The AvroSerdes from Confluent already cache the ids, though.

Is it advisable to change kstream message key or message after it is read by consumer for storing internal state

I am doing some heavy processing inside my kstream filter and the same processing would be required by downstream transformer and mapper. As filter is stateless operation, is it a good idea to add some data in the key or value passed in filter and use it in downstream mapper/transformer?
If you need the data for joins, then altering the key can be useful, but you need to be mindful that the data will be re-partitioned into downstream/internal topics
If you simply want metadata on the record, the recommendation would be to use headers, if not modify the value itself.
filter DSL cannot modify the data, however. Use map or the Processor API to forward a new record

How can a key/value stored in app instance A's state store be deleted using app instance B

It is my understanding that state stores are local per app instance (re: partition).... from the docs.
Because Kafka Streams partitions the data for processing it, an application’s entire state is spread across the local state stores of the application’s running instances.
I have a use case where I need to only have any arbitrary key that contains a specific value (let's call it value123). If another keyB/value123 message is received and that value123 is the same but previously has a different key (keyD), I need to delete the old keyD/value123.
Here is the problem - I only receive new key/value associations. I don't receive "tombstone" messages for old keys - therefore I have to imply the tombstone because a new key just arrived on the topic having the same value. There's no way to access (delete) the key/value if it's on another app instance's state store because states are local per instance. I need to evict old data. How can I achieve this?
To look at it another way:
If a message with key A comes into a transformer and that transformer's job is the clean up the state to make sure no other keys have that value... let's say Key A's value is currently 'associated' to key B. I need to delete key B from the KTable/state store so that Key A can be the only thing associated to the value now. I can't guarantee that Key B is assigned to the same partition as key A. How do I delete key B from the partition that key A.
I was able to solve my problem by switching my key to the other data point and using the new 2.5.0 feature to join 2 ktables by foreign key. This would control the output because once the new record came in with the same key (but a different foreign key) my other ktable wouldn't join because the foreign key had changed.
I used these two as resources:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics#KafkaStreamsJoinSemantics-KTable-KTableForeign-KeyJoin
https://kafka-tutorials.confluent.io/foreign-key-joins/kstreams.html
Instances of Kafka Streams applications can communicate using RPC - https://kafka.apache.org/10/documentation/streams/developer-guide/interactive-queries.html#querying-remote-state-stores-for-the-entire-app.
You could query the other instances by creating a custom RPC endpoint and building logic to delete the value from the remote state store, if found.

Ideal way to perform lookup on a stream of Kafka topic

I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).

Kafka Shared State Store Between Kafka DSL KStream Transformers

I have a topology where I use Transformer for aggregation of my objects, later down in my topology, I'm trying to read from the State Store that was used in the first Transformer. It doesn't seem possible to access the data. Is it because state stores are on different partitions?
My topology looks something like this:
streamsBuilder.stream("input")
.transform(new TransformerSupplier1(), "my-store")
.leftJoin(someKTable, myValueJoiner())
.flatTransform(new TransformerSupplier2(), "my-store")
In my TransformerSupplier1's Transformer, my state store is of type <String, Map<String, Object>>
In my TransformerSupplier2's Transformer, I'm trying to get the state store by the key I used to store in the first transformer but I get null and when I do .all().peekNextKey(), nothing is found.
Let me know if I need to put more information for my Transformers and I'll try to obfuscate the real logic. Thanks