How can I deserialize geometry fields from Kafka messages stream via Debezium Connect? - apache-kafka

I have a PostGIS + Debezium/Kafka + Debezium/Connect setup that is streaming changes from one database to another. I have been watching the messages via Kowl and everything is moving accordingly.
My problem relies when I'm reading the message from my Kafka Topic, the geometry (wkb) column in particular.
This is my Kafka message:
{
"schema":{
"type":"struct"
"fields":[...]
"optional":false
"name":"ecotx_geometry_kafka.ecotx_geometry_impo..."
}
"payload":{
"before":NULL
"after":{
"id":"d6ad5eb9-d1cb-4f91-949c-7cfb59fb07e2"
"type":"MultiPolygon"
"layer_id":"244458fa-e6e0-4c6c-a7e1-5bf0afce2fb8"
"geometry":{
"wkb":"AQYAACBqCAAAAQAAAAEDAAAAAQAAAAUAAABwQfUo..."
"srid":2154
}
"custom_style":NULL
"style_id":"default_layer_style"
}
"source":{...}
"op":"c"
"ts_ms":1618854994546
"transaction":NULL
}
}
As can be seem, the WKB information is something like "AQAAAAA...", despite the information inserted in my database being "01060000208A7A000000000000" or "LINESTRING(0 0,1 0)".
And I don't know how to parse/transform it to a ByteArray or a Geometry in my Consumer app (Kotlin/Java) to further use in GeoTools.
I don't know if I'm missing an import that is able to translate this information.
I'm have just a few questions around of people posting their json messages and every message that has a geom field (streamed w/ Debezium) got changed to this "AAAQQQAAAA".
Having said that, how can I parse/decoded/translate it to something that can be used by GeoTools?
Thanks.
#UPDATE
Additional info:
After an insert, when I analyze my slot changes (querying the database using pg_logical_slot_get_changes function), I'm able to see my changes in WKB:
{"change":[{"kind":"insert","schema":"ecotx_geometry_import","table":"geometry_data","columnnames":["id","type","layer_id","geometry","custom_style","style_id"],"columntypes":["uuid","character varying(255)","uuid","geometry","character varying","character varying"],"columnvalues":["469f5aed-a2ea-48ca-b7d2-fe6e54b27053","MultiPolygon","244458fa-e6e0-4c6c-a7e1-5bf0afce2fb8","01060000206A08000001000000010300000001000000050000007041F528CB332C413B509BE9710A594134371E05CC332C4111F40B87720A594147E56566CD332C4198DF5D7F720A594185EF3C8ACC332C41C03BEDE1710A59417041F528CB332C413B509BE9710A5941",null,"default_layer_style"]}]}
Which would be useful in the consumer app, the thing definitely relies on the Kafka Message content itself, just ain't sure who is transforming this value, if Kafka or DBZ/Connect.

I think it is just a different way to represent binary columns in PostGIS and in JSON. The WKB is a binary field, meaning it is has bytes with arbitrary values, many of which has no corresponding printable characters. PostGIS prints it out using HEX encoding, thus it looks like '01060000208A7A...' - hex digits, but internally it is just bytes. Kafka's JSON uses BASE64 encoding instead for exactly the same binary message.
Let's test with a prefix of your string,
select to_base64(from_hex('01060000206A080000010000000103000000010000000500'))
AQYAACBqCAAAAQAAAAEDAAAAAQAAAAUA

Related

Customize Debezium pubsub message

I am trying to use debezium server to stream "some" changes in a postgresql table. Namely this table being tracked has a json type column named "payload". I would like the message streamed to pubsub by debezium to contain only the contents of the payload column. Is that possible?
I ve explored the custom transformations provided by debezium but from what I could get it would only allow me to enrich the published message with extra fields, but not to publish only certain fields, which is what I want to do.
Edit:
The closest I got to what I wanted was to use the outbox transform but that published the following message:
{
"schema":{
...
},
"payload:{
"key":"value"
}
Whereas what I would like the message to be is:
{"key":"value"}
I ve tried adding an ExtractNewRecordState transform but still got the same results. My application.properties file looks like:
debezium.transforms=outbox,unwrap
debezium.transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
debezium.transforms.outbox.table.field.event.key=grouping_key
debezium.transforms.outbox.table.field.event.payload.id=id
debezium.transforms.outbox.route.by.field=target
debezium.transforms.outbox.table.expand.json.payload=true
debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
debezium.transforms.unwrap.add.fields=payload
Many thanks,
Daniel

How to deserialize Avro schema and then abandon schema before write to ES Sink Connector using SMT

Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?

Get BLOB out of Avro FlowFile

I've retrieving some binary files (in this case, some PDFs) from a database using ExecuteSQL, which returns the result in an Avro FlowFile. I can't figure out how to get the binary result out of the Avro records.
I've tried using ConvertAvroToJSON, which gives me an object like:
{"MYBLOB": {"bytes": "%PDF-1.4\n [...] " }}
However, using EvaluateJSONPath and grabbing $.MYBLOB.bytes, causes corruption because the binary bytes get converted to UTF8.
None of the record writer options with ConvertRecord seem appropriate for binary data.
The best solution I can think of is to base64 encode the binary before it leaves the database, then I'm dealing with only character data and can decode it in NiFi. But that's extra steps and I'd prefer not to do that.
You may need a scripted solution in this case (as a workaround), to get the field and decode it using your own encoding. In any case please feel free to file a Jira case, ConvertAvroToJSON is deprecated but we should support Character Sets for the JsonRecordSetWriter in ExecuteSQLRecord/ConvertRecord (if that also doesn't work for you).

Parsing Kafka messages

My question will be short and clean. I would like to parse json data which will be coming from the Kafka topic. Thus, My application will run as the Kafka consumer. I am only interested in some part in JSON data. Do I need to process this data using a library for example Apache-Flink? After that I will send the data to somewhere else.
In the beginning you say "filter data", so, looks like you need a RecordFilterStrategy injected into the AbstractKafkaListenerContainerFactory. See documentation for this matter: https://docs.spring.io/spring-kafka/docs/current/reference/html/#filtering-messages
Then you say "interested in some part in JSON". Well, this doesn't sound like you need records filtering, but more sounds like data projection. For this reason you can use a ProjectingMessageConverter for slicing data by some ProjectionFactory. See their JavaDocs for more info.

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?