I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.
I am processing messages from sourceTopic to a targetTopic using KStream (using map method). In the map method, I am generating a new schema (since i need to extract explicit fields) for the targettopic using the incoming messages, but since the KStream operation is per message, i wish to avoid regenerating the schema for every message and would instead want to cache the schemaID of the incoming messages (for both Key and Value) and generate new targetschema only if the source Schema changes.
Is there a way to do this via the KStream object or from the Key/Value objects used in the map method
Update:
I was not able to get the schema ID for my above use case, as a workaround I cached the schema into a local variable and checked on each iteration if it changed and process further as required.
You only will have access to the ID if you use Serde.Bytes; after the records are deserialized, you'll only have access to the Schema.
The AvroSerdes from Confluent already cache the ids, though.
I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).
I'm trying to use Kafka Streams with a state store distributed among two instances. Here's how the store and the associated KTable are defined:
KTable<String, Double> userBalancesTable = kStreamBuilder.table(
"balances-table",
Consumed.with(String(), Double()),
Materialized.<String, Double, KeyValueStore<Bytes, byte[]>>as(BALANCES_STORE).withKeySerde(String()).withValueSerde(Double())
);
Next, I have some stream processing logic which is aggregating some data to this balance-table KTable:
transactionsStream
.leftJoin(...)
...
.aggregate(...)
.to("balances-table", Produced.with(String(), Double()));
And at some point I am, from a REST handler, querying the state store.
ReadOnlyKeyValueStore<String, Double> balances = streams.store(BALANCES_STORE, QueryableStoreTypes.<String, Double>keyValueStore());
return Optional.ofNullable(balances.get(userId)).orElse(0.0);
Which works perfectly - as long as I have a single stream processing instance.
Now, I'm adding a second instance (note: my topics all have 2 partititions). As explained in the docs, the state store BALANCES_STORE is distributed among the instances based on the key of each record (in my case, the key is an user ID). Therefore, an instance must:
Make a call to KafkaStreams#metadataForKey to discover which instance is handling the part of the state store containing the key it wants to retrieve
Make a RPC (e.g. REST) call to this instance to retrieve it
My problem is that the call to KafkaStreams#metadataForKey is always returning a null metadata object. However, KafkaStreams#allMetadataForStore() is returning a metadata object containing the two instances. So it behaves like it doesn't know about the key I'm querying, although looking it up in the state stores works.
Additional note: my application.server property is correctly set.
Thank you!
I have a kafka stream - say for blogs and a kafka table - say for comments related to those blogs. Key from kafka stream can map to multiple values in Kafka table i.e. one blog can have multiple comments. I want to do a join of these two and create a new object with an array of comment ids. But when I do the join, the stream contains only the last comment id. Is there any documentation or example code which can point me right direction how to achieve this? Basically, is there any documentation elaborating how to do one to many relationship join using Kafka stream and Kafka table?
KStream<Integer, EnrichedBlog> joinedBlogComments = blogsStream.join(commentsTbl,
(blogId, blog) -> blog.getBlogId(),
(blog, comment) -> new EnrichedBlog(blog, comment));
So instead of comment - I need to have an array of comment ids.
I fail to find a join method with a signature matching that in your code example, but here's what I think is the problem:
KTables are interpreted as a changlog, that is to say, every next message with the same key is interpreted as an update to the record, not as a new record. That is why you are seeing only the last "comment" message for a given key (blog id), the previous values are being overwritten.
To overcome this, you'll need to change how you populate your KTable in the first place. What you can do is to add your comment topic as a KStream to your topology and then perform an aggregation that simply builds an array or a list of comments that share the same blog id. That aggregation returns a KTable which you can join your blog KStream with.
Here's a sketch how you can do it to construct a List-valued KTable:
builder.stream("yourCommentTopic") // where key is blog id
.groupByKey()
.aggregate(() -> new ArrayList(),
(key, value, agg) -> new KeyValue<>(key, agg.add(value)),
yourListSerde);
A list is easier to use in an aggregation than an array, so I suggest you convert it to an array downstream if needed. You will also need to provide a serde implementation for your list, "yourListSerde" in the example above.
If you are using avro with schema registry, you should write your own aggregator because kafka stream fails to serialize ArrayList.
val kTable = aStream
.groupByKey()
.aggregate(
{
YourAggregator() // initialize aggregator
},
{ _, value, agg ->
agg.add(value) // add value to a list in YourAggregator
agg
}
)
And then join the kTable with your other stream (bStream).
bStream
.join(
kTable,
{ b, a ->
// do your value join from a to b
b
}
)
Sorry my snippets are written in Kotlin.
As pointed out in the correct answer of Michal above, a KTable keyed by blogId cannot be used to keep track of the blogs in this case since only the latest blog value is retained in such table.
As a suggested optimization to the solution mentioned in his answer, note that keeping an ever growing List in the .aggregate() can potentially become costly in both data size and time if there are a lot of comments per blog. This is because under the hood, each iteration of that aggregation leads to ever-growing instances of a List, which is ok in java or scala because of data re-use, but which are each serialized separately to the underlying state-store. Schematically, assuming that some key has say 10 comments, then this expression is called 10 times:
(key, value, agg) -> new KeyValue<>(key, agg.add(value))
each time producing a list of size 1, then 2, then... ... then 10, each serialized independently to the under the hood state store, meaning that 1+2+3+...+10=55 values will be serialized in total (well, maybe there's some optimization s.t. some of those serializations are skipped, I don't know, although the space and time complexity is the same I think).
An alternative, though more complex, approach is to use range scans in state stores, which makes the data structure look a bit like (partition_key, sort_key) in key-value stores like DynamoDB, in which we store each comment with a key like (blogId, commentId). In that case you would still keyBy() the comments stream by blogId, then .transform(...) it to pass it to the processor API, where you can apply the range scan idea, each time adding (i.e. serializing) one single supplementary comment to the state store instead of a new instance of the whole list.
The one-to-many relationship becomes very visible when we picture a lot instances of (blogId, commentId) keys, all having the same blogId and a different commentId, all stored in the same state store instance in the same physical node, and this whole thing happening in parallel for a lot of blogId in a lot of nodes.
I put more details about that pattern on my blog: One-to-many Kafka Streams Ktable join, and I put a full working example in github