Customize Debezium pubsub message - debezium

I am trying to use debezium server to stream "some" changes in a postgresql table. Namely this table being tracked has a json type column named "payload". I would like the message streamed to pubsub by debezium to contain only the contents of the payload column. Is that possible?
I ve explored the custom transformations provided by debezium but from what I could get it would only allow me to enrich the published message with extra fields, but not to publish only certain fields, which is what I want to do.
Edit:
The closest I got to what I wanted was to use the outbox transform but that published the following message:
{
"schema":{
...
},
"payload:{
"key":"value"
}
Whereas what I would like the message to be is:
{"key":"value"}
I ve tried adding an ExtractNewRecordState transform but still got the same results. My application.properties file looks like:
debezium.transforms=outbox,unwrap
debezium.transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
debezium.transforms.outbox.table.field.event.key=grouping_key
debezium.transforms.outbox.table.field.event.payload.id=id
debezium.transforms.outbox.route.by.field=target
debezium.transforms.outbox.table.expand.json.payload=true
debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
debezium.transforms.unwrap.add.fields=payload
Many thanks,
Daniel

Related

How to rename the id header of a debezium mongodb connector outbox message

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?
The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

How to make the Kafka Connect BigQuery Sink Connector create one table per event type and not per topic?

I'm using confluentinc/kafka-connect-bigquery on our Kafka (Avro) events. On some topics, we have more than one event type, e.g., UserRegistered and UserDeleted are on the topic domain.user.
The subjects in our Schema Registry look as follows.
curl --silent -X GET http://avro-schema-registry.core-kafka.svc.cluster.local:8081/subjects | jq .
[...]
"domain.user-com.acme.message_schema.domain.user.UserDeleted",
"domain.user-com.acme.message_schema.domain.user.UserRegistered",
"domain.user-com.acme.message_schema.type.domain.key.DefaultKey",
[...]
My properties/connector.properties (I'm using the quickstart folder.) looks as follows:
[...]
topics.regex=domain.*
sanitizeTopics=true
autoCreateTables=true
[...]
In BigQuery a table called domain_user is created. However, I would like to have two tables, e.g., domain_user_userregistered and domain_user_userdeleted or similar, because the schemas of these two event types are quite different. How can I achieve this?
I think you can use the SchemaNameToTopic Single Message Transform to do this. By setting the topic name as the schema name this will propagate through to the name given to the created BigQuery table.

How can I deserialize geometry fields from Kafka messages stream via Debezium Connect?

I have a PostGIS + Debezium/Kafka + Debezium/Connect setup that is streaming changes from one database to another. I have been watching the messages via Kowl and everything is moving accordingly.
My problem relies when I'm reading the message from my Kafka Topic, the geometry (wkb) column in particular.
This is my Kafka message:
{
"schema":{
"type":"struct"
"fields":[...]
"optional":false
"name":"ecotx_geometry_kafka.ecotx_geometry_impo..."
}
"payload":{
"before":NULL
"after":{
"id":"d6ad5eb9-d1cb-4f91-949c-7cfb59fb07e2"
"type":"MultiPolygon"
"layer_id":"244458fa-e6e0-4c6c-a7e1-5bf0afce2fb8"
"geometry":{
"wkb":"AQYAACBqCAAAAQAAAAEDAAAAAQAAAAUAAABwQfUo..."
"srid":2154
}
"custom_style":NULL
"style_id":"default_layer_style"
}
"source":{...}
"op":"c"
"ts_ms":1618854994546
"transaction":NULL
}
}
As can be seem, the WKB information is something like "AQAAAAA...", despite the information inserted in my database being "01060000208A7A000000000000" or "LINESTRING(0 0,1 0)".
And I don't know how to parse/transform it to a ByteArray or a Geometry in my Consumer app (Kotlin/Java) to further use in GeoTools.
I don't know if I'm missing an import that is able to translate this information.
I'm have just a few questions around of people posting their json messages and every message that has a geom field (streamed w/ Debezium) got changed to this "AAAQQQAAAA".
Having said that, how can I parse/decoded/translate it to something that can be used by GeoTools?
Thanks.
#UPDATE
Additional info:
After an insert, when I analyze my slot changes (querying the database using pg_logical_slot_get_changes function), I'm able to see my changes in WKB:
{"change":[{"kind":"insert","schema":"ecotx_geometry_import","table":"geometry_data","columnnames":["id","type","layer_id","geometry","custom_style","style_id"],"columntypes":["uuid","character varying(255)","uuid","geometry","character varying","character varying"],"columnvalues":["469f5aed-a2ea-48ca-b7d2-fe6e54b27053","MultiPolygon","244458fa-e6e0-4c6c-a7e1-5bf0afce2fb8","01060000206A08000001000000010300000001000000050000007041F528CB332C413B509BE9710A594134371E05CC332C4111F40B87720A594147E56566CD332C4198DF5D7F720A594185EF3C8ACC332C41C03BEDE1710A59417041F528CB332C413B509BE9710A5941",null,"default_layer_style"]}]}
Which would be useful in the consumer app, the thing definitely relies on the Kafka Message content itself, just ain't sure who is transforming this value, if Kafka or DBZ/Connect.
I think it is just a different way to represent binary columns in PostGIS and in JSON. The WKB is a binary field, meaning it is has bytes with arbitrary values, many of which has no corresponding printable characters. PostGIS prints it out using HEX encoding, thus it looks like '01060000208A7A...' - hex digits, but internally it is just bytes. Kafka's JSON uses BASE64 encoding instead for exactly the same binary message.
Let's test with a prefix of your string,
select to_base64(from_hex('01060000206A080000010000000103000000010000000500'))
AQYAACBqCAAAAQAAAAEDAAAAAQAAAAUA

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

What does Template:${message.encodedData} mean in mirth?

I am trying to learn a mirth system with a channel that is pulling from a database for its source and outputting hl7 messages for its destination(s). The SQL query pulls the correct data from the source--but Mirth does not output all of the data in the right spots in the HL7 message. The destinations show that it is outputting Template:${message.encodedData}. What does that mean? Where can I see the template that it using. The destinations don'y have any filters or transformers so I am confused.
message.encodedData is the fully transformed message - after any transformation steps.
The transformer is also where you can specify the output template for how you want the data to look. Simply load up a sample template message in the output template of the transformer (message template tab in the transformer) and then create a series of message builder steps. Your output message will be in the variable tmp, and your sql results will be in the variable msg.
So, if your first column is patientID (Select patientiD as patientID ...), you would create a message builder steps along the lines of
mapped segment: tmp['PID']['PID.3']['PID.3.2']
mapping: msg['patientID'];
I don't have exact syntax in front of me right now, but that's the basic idea.
I think "transformed" is the status of the message right after the transformers are executed and "encoded" message is the status after the message that comes from the transformers is encoded into the specified channel outbound datatype. In some cases those messages will be the same but not in all the cases.
Also, is very difficult to find updated and comprehensive Mirth documentation.