Apache Nifi: Is there a way to publish messages to kafka with a message key as combination of multiple attributes? - apache-kafka

I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.

Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.

Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.

Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.

Related

Using ksqlDB to implement CDC using multiple event types in a single topic?

I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.

How to rename the id header of a debezium mongodb connector outbox message

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?
The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

Ideal way to perform lookup on a stream of Kafka topic

I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).

Conditional routing in Nifi

I am fetching some data from GetFile processor and currently sending it to mongoDB and rabbitMQ both using PutmongoRecord and PublishAMQP processors respectively.
I want to make this conditional, like say reading an attribute from some file that has the value "mongo" then I should be able to only push it to mongoDB and not rabbitMQ.
I read about RouteOnAttribute processor but not sure how it will work in my case. Please help.
If you are trying to read an attribute (column) from the flowfile and based on the value want to route to PutmongoRecord and PublishAMQP
Method-1:(preferred)
You can try PartitionRecord processor and then processor adds the partition field value to the flowfile.
Then by using RouteOnAttribute Processor you can dynamically pass the flowfile to Mongo and AMQP processors.
Flow:
1.GetFile
2.PartitionRecord //define record reader and writer controller services
3.RouteOnAttribute //add new properties to route mongo and amqp then fork
4.1 PutMongoRecord 4.2 PublishAMQP
(or)
Method-2:
Another way would be by using RouteOnContent then add new properties to identify the records that goes to Mongo and AMQP.
Use the property relationships to connect to PutmongoRecord and PublishAMQP processors.
You can add dynamic properties to the RouteOnAttribute processor that compare the incoming flowfile attributes with the expected values (using literal equality or string operations), and drag each outgoing relationship to the desired follow-on processor. This page has an example.

Concatenate value of two fields before publishing to kafka topic using Connect SMT

Is ReplaceField transform used only to replace or mask the field name Or can I change the value of the field as well using some expression , with static values ?
My need is to concatenate value of two fields before publishing to kafka topic.
org.apache.kafka.connect.transforms.InsertField is used to add static values or topic metadata (topic name, partition, timestamp, offset, etc), but not concatenate, or use expressions.
org.apache.kafka.connect.transforms.ReplaceField is used to rename/filter existing fields, not add new ones.
That being said, you're going to have to create your own Transformation subclass that can merge a list of fields.
Or publish the existing "raw" data then use Kafka Streams or KSQL to create the "enriched" topic.