How to rename the id header of a debezium mongodb connector outbox message - mongodb

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?

The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

Related

Can I update Apache Atlas metadata by adding a message directly into the Kafka topic?

I am trying to add a message to Entities_Topic to update the kafka_Topic type metadata in Apache Atlas. I wrote the data according to the JSON format of Message, but it didn't work.
application.log is displayed as follows:
graph rollback due to exception AtlasBaseException:Instance kafka_topicwith unique attribute {qualifiedName=atlas_test00#primary # clusterName to use in qualified name of entities. Default: primary} does not exist (GraphTransactionInterceptor:202)
graph rollback due to exception AtlasBaseException:Instance __AtlasUserProfile with unique attribute {name=admin} does not exist(GraphTransactionInterceptor:202)
And here is the message I passed into Kafka Topic earlier:
{"version":{"version":"1.0.0","versionParts":[1]},"msgCompressionKind":"NONE","msgSplitIdx":1,"msgSplitCount":1,"msgSourceIP":"192.168.1.110","msgCreatedBy":"","msgCreationTime":1664440029000,"spooled":false,"message":{"type":"ENTITY_NOTIFICATION_V2","entity":{"typeName":"kafka_topic","attributes":{"qualifiedName":"atlas_test_k1#primary # clusterName to use in qualifiedName of entities. Default: primary","name":"atlas_test01","description":"atlas_test_k1"},"displayText":"atlas_test_k1","isIncomplete":false},"operationType":"ENTITY_CREATE","eventTime":1664440028000}}
It is worth noting that there is no GUID in the message and I do not know how to create it manually. Also, I changed the time according to the timestamp of the current time. The JSON is passed in through the Kafka tool Offset Explorer.
My team leader wants to update the metadata by sending messages directly into Kafka, and I'm just trying to see if that's possible.
How can I implement this idea, or please tell me what's wrong.

Is there anyway to check duplicate the message control id (MSH:10) in MSH segment using Mirth connect?

Is there anyway to check duplicate the message control id (MSH:10) in MSH segment using Mirth connect?
MSH|^~&|sss|xxx|INSTANCE2|KKLIU 0063/2021|20190905162034||ADT^A28^ADT_A05|Zx20190905162034|P|2.4|||NE|NE|||||
whenever message enters it needs to be validated whether duplicate of control id Zx20190905162034 is already processed or not?
Mirth will not do this for you, but you can write your own JavaScript transformer to check a database or your own set of previously encountered control ids.
Your JavaScript can make use of any appropriate Java classes.
The database check (you can implement this using code template) is the easier way out. You might want to designate the column storing MSH:10 values as a primary key or define an index on it. Queries against unique entries would be faster. Other alternatives include periodically redeploying the Channel while reading all MSH:10 values already in the database and placing them in a global map variable or maintained in an API that you can make a GET request to when processing every message. Any of the options depends on the number of records we are speaking about.

Customize Debezium pubsub message

I am trying to use debezium server to stream "some" changes in a postgresql table. Namely this table being tracked has a json type column named "payload". I would like the message streamed to pubsub by debezium to contain only the contents of the payload column. Is that possible?
I ve explored the custom transformations provided by debezium but from what I could get it would only allow me to enrich the published message with extra fields, but not to publish only certain fields, which is what I want to do.
Edit:
The closest I got to what I wanted was to use the outbox transform but that published the following message:
{
"schema":{
...
},
"payload:{
"key":"value"
}
Whereas what I would like the message to be is:
{"key":"value"}
I ve tried adding an ExtractNewRecordState transform but still got the same results. My application.properties file looks like:
debezium.transforms=outbox,unwrap
debezium.transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
debezium.transforms.outbox.table.field.event.key=grouping_key
debezium.transforms.outbox.table.field.event.payload.id=id
debezium.transforms.outbox.route.by.field=target
debezium.transforms.outbox.table.expand.json.payload=true
debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
debezium.transforms.unwrap.add.fields=payload
Many thanks,
Daniel

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

Apache Nifi: Is there a way to publish messages to kafka with a message key as combination of multiple attributes?

I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.