I have a Kafka Stream App that reads a single topic as a KTable, transforms each element into 0-n elements and writes all of them to another topic. The flow looks like this (simplified by quite a bit):
('a', '123') -> ('a', '1,2,3') -> ('a1', '1'), ('a2', '2'), ('a3', '3')
Is this doable using Kafka Stream DSL? All topics in use are compact, therefore I want to simulate a table and never get rid of old values.
tl;dr; how to transform a message into multiple messages?
Not sure if you can express this with the DSL. But it's rather simple to do this with Processor API:
builder.stream("input-topic").transform(...).to("output-topic");
You attach a key-value store to your transform() and do the following for each input record:
check if there is a corresponding key-value pair in the store
if yes (ie, store.get()!=null), take the old value from the store and split it; replace the value for each "split record" with null and emit all those records
if input record value!=null, put the input record as-is to the store and split the input record and emit the individual output records
if input record value==null, delete key from the store
Check out the docs for more details: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration
Related
I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.
I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).
I am doing a poc on kafka streams and ktables.I was wondering if there is any way to store data(key-value pair or key-object pair) in Kafka, either through streams, ktables, state-stores, so that i can retrieve data bases on both keys and values.
I created a kstream based on topic, on which i pushed some messages and using wordcountalgo, i populated values in ktable created above kstream. Something like this:
StoreBuilder customerStateStore = Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("customer-store"),Serdes.String(), customerSerde)
.withLoggingEnabled(new HashMap<>());
streamsBuilder.stream("customer", Consumed.with(Serdes.String(), customerSerde)).to("customer-to-ktable-topic",Produced.with(Serdes.String(), customerSerde));
KTable<String, Customer> customerKTable = streamsBuilder.table("customer-to-ktable-topic", Consumed.with(Serdes.String(), customerSerde),Materialized.as(customerStateStore.name()));
I am not able to fetch records based on values.
https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/kstream/KTable.html
Only get(String key) function is available in the documentation of kafka doc. However, I am exploring to see if this can be achieved some other way?
Your customerStateStore is a key-value store and as you stated, you can only query based on keys.
One proposal would be to work on the IN flow, in order to use the value (or part of the value) as a key in the store. You can do that with the map() method. The idea could be to achieve something like:
Original IN msg: key1 - value1
Would generate 2 entries in the store:
key1 - value1
value1 - key1 (or whatever depending on your usecase)
Doing this, you will be able to query the store on the value1, because it is a key. (Be careful if in the IN topic you have the same value for different keys.)
Or, as an alternative to #NishuTayal's suggestion, you can loop on all the entries of the store and evaluate the values, with the method all(): https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/state/ReadOnlyKeyValueStore.html#all--
Obviously this will degrade the performance, but depending on the size of your (in memory) store and your usecase (get all the entries for a given value? only one entry? ...), it might not add too much delay to the processing.
But you have to be careful with the partitioning of your input topic: one given value may then be present in several partitions of your topic, and then be present in different instances of your KS app.
You can use filter operation to make key or value based lookups
customerKTable.filter((key, value) -> value.get("country") != "USA")
Is ReplaceField transform used only to replace or mask the field name Or can I change the value of the field as well using some expression , with static values ?
My need is to concatenate value of two fields before publishing to kafka topic.
org.apache.kafka.connect.transforms.InsertField is used to add static values or topic metadata (topic name, partition, timestamp, offset, etc), but not concatenate, or use expressions.
org.apache.kafka.connect.transforms.ReplaceField is used to rename/filter existing fields, not add new ones.
That being said, you're going to have to create your own Transformation subclass that can merge a list of fields.
Or publish the existing "raw" data then use Kafka Streams or KSQL to create the "enriched" topic.
I have a kafka stream - say for blogs and a kafka table - say for comments related to those blogs. Key from kafka stream can map to multiple values in Kafka table i.e. one blog can have multiple comments. I want to do a join of these two and create a new object with an array of comment ids. But when I do the join, the stream contains only the last comment id. Is there any documentation or example code which can point me right direction how to achieve this? Basically, is there any documentation elaborating how to do one to many relationship join using Kafka stream and Kafka table?
KStream<Integer, EnrichedBlog> joinedBlogComments = blogsStream.join(commentsTbl,
(blogId, blog) -> blog.getBlogId(),
(blog, comment) -> new EnrichedBlog(blog, comment));
So instead of comment - I need to have an array of comment ids.
I fail to find a join method with a signature matching that in your code example, but here's what I think is the problem:
KTables are interpreted as a changlog, that is to say, every next message with the same key is interpreted as an update to the record, not as a new record. That is why you are seeing only the last "comment" message for a given key (blog id), the previous values are being overwritten.
To overcome this, you'll need to change how you populate your KTable in the first place. What you can do is to add your comment topic as a KStream to your topology and then perform an aggregation that simply builds an array or a list of comments that share the same blog id. That aggregation returns a KTable which you can join your blog KStream with.
Here's a sketch how you can do it to construct a List-valued KTable:
builder.stream("yourCommentTopic") // where key is blog id
.groupByKey()
.aggregate(() -> new ArrayList(),
(key, value, agg) -> new KeyValue<>(key, agg.add(value)),
yourListSerde);
A list is easier to use in an aggregation than an array, so I suggest you convert it to an array downstream if needed. You will also need to provide a serde implementation for your list, "yourListSerde" in the example above.
If you are using avro with schema registry, you should write your own aggregator because kafka stream fails to serialize ArrayList.
val kTable = aStream
.groupByKey()
.aggregate(
{
YourAggregator() // initialize aggregator
},
{ _, value, agg ->
agg.add(value) // add value to a list in YourAggregator
agg
}
)
And then join the kTable with your other stream (bStream).
bStream
.join(
kTable,
{ b, a ->
// do your value join from a to b
b
}
)
Sorry my snippets are written in Kotlin.
As pointed out in the correct answer of Michal above, a KTable keyed by blogId cannot be used to keep track of the blogs in this case since only the latest blog value is retained in such table.
As a suggested optimization to the solution mentioned in his answer, note that keeping an ever growing List in the .aggregate() can potentially become costly in both data size and time if there are a lot of comments per blog. This is because under the hood, each iteration of that aggregation leads to ever-growing instances of a List, which is ok in java or scala because of data re-use, but which are each serialized separately to the underlying state-store. Schematically, assuming that some key has say 10 comments, then this expression is called 10 times:
(key, value, agg) -> new KeyValue<>(key, agg.add(value))
each time producing a list of size 1, then 2, then... ... then 10, each serialized independently to the under the hood state store, meaning that 1+2+3+...+10=55 values will be serialized in total (well, maybe there's some optimization s.t. some of those serializations are skipped, I don't know, although the space and time complexity is the same I think).
An alternative, though more complex, approach is to use range scans in state stores, which makes the data structure look a bit like (partition_key, sort_key) in key-value stores like DynamoDB, in which we store each comment with a key like (blogId, commentId). In that case you would still keyBy() the comments stream by blogId, then .transform(...) it to pass it to the processor API, where you can apply the range scan idea, each time adding (i.e. serializing) one single supplementary comment to the state store instead of a new instance of the whole list.
The one-to-many relationship becomes very visible when we picture a lot instances of (blogId, commentId) keys, all having the same blogId and a different commentId, all stored in the same state store instance in the same physical node, and this whole thing happening in parallel for a lot of blogId in a lot of nodes.
I put more details about that pattern on my blog: One-to-many Kafka Streams Ktable join, and I put a full working example in github