Ideal way to perform lookup on a stream of Kafka topic - apache-kafka

I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?

You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).

Related

Using ksqlDB to implement CDC using multiple event types in a single topic?

I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.

Apache Nifi: Is there a way to publish messages to kafka with a message key as combination of multiple attributes?

I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.

Is there any function in Kafka table(Ktable) to retrieve keys based on values? or Is there any way to retrieve data based on both keys and values

I am doing a poc on kafka streams and ktables.I was wondering if there is any way to store data(key-value pair or key-object pair) in Kafka, either through streams, ktables, state-stores, so that i can retrieve data bases on both keys and values.
I created a kstream based on topic, on which i pushed some messages and using wordcountalgo, i populated values in ktable created above kstream. Something like this:
StoreBuilder customerStateStore = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("customer-store"),Serdes.String(), customerSerde)
.withLoggingEnabled(new HashMap<>());
streamsBuilder.stream("customer", Consumed.with(Serdes.String(), customerSerde)).to("customer-to-ktable-topic",Produced.with(Serdes.String(), customerSerde));
KTable<String, Customer> customerKTable = streamsBuilder.table("customer-to-ktable-topic", Consumed.with(Serdes.String(), customerSerde),Materialized.as(customerStateStore.name()));
I am not able to fetch records based on values.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/kstream/KTable.html
Only get(String key) function is available in the documentation of kafka doc. However, I am exploring to see if this can be achieved some other way?
Your customerStateStore is a key-value store and as you stated, you can only query based on keys.
One proposal would be to work on the IN flow, in order to use the value (or part of the value) as a key in the store. You can do that with the map() method. The idea could be to achieve something like:
Original IN msg: key1 - value1
Would generate 2 entries in the store:
key1 - value1
value1 - key1 (or whatever depending on your usecase)
Doing this, you will be able to query the store on the value1, because it is a key. (Be careful if in the IN topic you have the same value for different keys.)
Or, as an alternative to #NishuTayal's suggestion, you can loop on all the entries of the store and evaluate the values, with the method all(): https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/state/ReadOnlyKeyValueStore.html#all--
Obviously this will degrade the performance, but depending on the size of your (in memory) store and your usecase (get all the entries for a given value? only one entry? ...), it might not add too much delay to the processing.
But you have to be careful with the partitioning of your input topic: one given value may then be present in several partitions of your topic, and then be present in different instances of your KS app.
You can use filter operation to make key or value based lookups
customerKTable.filter((key, value) -> value.get("country") != "USA")

Kafka Streams mapping KTable value into separate values

I have a Kafka Stream App that reads a single topic as a KTable, transforms each element into 0-n elements and writes all of them to another topic. The flow looks like this (simplified by quite a bit):
('a', '123') -> ('a', '1,2,3') -> ('a1', '1'), ('a2', '2'), ('a3', '3')
Is this doable using Kafka Stream DSL? All topics in use are compact, therefore I want to simulate a table and never get rid of old values.
tl;dr; how to transform a message into multiple messages?
Not sure if you can express this with the DSL. But it's rather simple to do this with Processor API:
builder.stream("input-topic").transform(...).to("output-topic");
You attach a key-value store to your transform() and do the following for each input record:
check if there is a corresponding key-value pair in the store
if yes (ie, store.get()!=null), take the old value from the store and split it; replace the value for each "split record" with null and emit all those records
if input record value!=null, put the input record as-is to the store and split the input record and emit the individual output records
if input record value==null, delete key from the store
Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration

Kafka Stream and KTable One-to-Many Relationship Join

I have a kafka stream - say for blogs and a kafka table - say for comments related to those blogs. Key from kafka stream can map to multiple values in Kafka table i.e. one blog can have multiple comments. I want to do a join of these two and create a new object with an array of comment ids. But when I do the join, the stream contains only the last comment id. Is there any documentation or example code which can point me right direction how to achieve this? Basically, is there any documentation elaborating how to do one to many relationship join using Kafka stream and Kafka table?
KStream<Integer, EnrichedBlog> joinedBlogComments = blogsStream.join(commentsTbl,
(blogId, blog) -> blog.getBlogId(),
(blog, comment) -> new EnrichedBlog(blog, comment));
So instead of comment - I need to have an array of comment ids.
I fail to find a join method with a signature matching that in your code example, but here's what I think is the problem:
KTables are interpreted as a changlog, that is to say, every next message with the same key is interpreted as an update to the record, not as a new record. That is why you are seeing only the last "comment" message for a given key (blog id), the previous values are being overwritten.
To overcome this, you'll need to change how you populate your KTable in the first place. What you can do is to add your comment topic as a KStream to your topology and then perform an aggregation that simply builds an array or a list of comments that share the same blog id. That aggregation returns a KTable which you can join your blog KStream with.
Here's a sketch how you can do it to construct a List-valued KTable:
builder.stream("yourCommentTopic") // where key is blog id
.groupByKey()
.aggregate(() -> new ArrayList(),
(key, value, agg) -> new KeyValue<>(key, agg.add(value)),
yourListSerde);
A list is easier to use in an aggregation than an array, so I suggest you convert it to an array downstream if needed. You will also need to provide a serde implementation for your list, "yourListSerde" in the example above.
If you are using avro with schema registry, you should write your own aggregator because kafka stream fails to serialize ArrayList.
val kTable = aStream
.groupByKey()
.aggregate(
{
YourAggregator() // initialize aggregator
},
{ _, value, agg ->
agg.add(value) // add value to a list in YourAggregator
agg
}
)
And then join the kTable with your other stream (bStream).
bStream
.join(
kTable,
{ b, a ->
// do your value join from a to b
b
}
)
Sorry my snippets are written in Kotlin.
As pointed out in the correct answer of Michal above, a KTable keyed by blogId cannot be used to keep track of the blogs in this case since only the latest blog value is retained in such table.
As a suggested optimization to the solution mentioned in his answer, note that keeping an ever growing List in the .aggregate() can potentially become costly in both data size and time if there are a lot of comments per blog. This is because under the hood, each iteration of that aggregation leads to ever-growing instances of a List, which is ok in java or scala because of data re-use, but which are each serialized separately to the underlying state-store. Schematically, assuming that some key has say 10 comments, then this expression is called 10 times:
(key, value, agg) -> new KeyValue<>(key, agg.add(value))
each time producing a list of size 1, then 2, then... ... then 10, each serialized independently to the under the hood state store, meaning that 1+2+3+...+10=55 values will be serialized in total (well, maybe there's some optimization s.t. some of those serializations are skipped, I don't know, although the space and time complexity is the same I think).
An alternative, though more complex, approach is to use range scans in state stores, which makes the data structure look a bit like (partition_key, sort_key) in key-value stores like DynamoDB, in which we store each comment with a key like (blogId, commentId). In that case you would still keyBy() the comments stream by blogId, then .transform(...) it to pass it to the processor API, where you can apply the range scan idea, each time adding (i.e. serializing) one single supplementary comment to the state store instead of a new instance of the whole list.
The one-to-many relationship becomes very visible when we picture a lot instances of (blogId, commentId) keys, all having the same blogId and a different commentId, all stored in the same state store instance in the same physical node, and this whole thing happening in parallel for a lot of blogId in a lot of nodes.
I put more details about that pattern on my blog: One-to-many Kafka Streams Ktable join, and I put a full working example in github