JDBC sink topic with multiple structs to postgres - postgresql

I am trying to sink a few topics top a postgres database. However the topic schema defines a array at the top level and within it multiple structs. Automapping does not work and I cannot find any reference how to handle this. I need all structs because they are dependent types, the second struct references the first struct as a field.
Currently it breaks when hitting the 2nd struct stating statusChangeEvent (struct) has no mapping to sql column type. This because it is using auto.create to make a table (probably called ProcessStatus) then when hitting the second entry there is no column of course.
[
{
"type": "record",
"name": "processStatus",
"namespace": "company.some.process",
"fields": [
{
"name": "code",
"doc": "The code of the processStatus",
"type": "string"
},
{
"name": "name",
"doc": "The name of the processStatus",
"type": "string"
},
{
"name": "description",
"type": "string"
},
{
"name": "isCompleted",
"type": "boolean"
},
{
"name": "isSuccessfullyCompleted",
"type": "boolean"
}
]
},
{
"type": "record",
"name": "StatusChangeEvent",
"namespace": "company.some.process",
"fields": [
{
"name": "contNumber",
"type": "string"
},
{
"name": "processId",
"type": "string"
},
{
"name": "processVersion",
"type": "int"
},
{
"name": "extProcessId",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "fromStatus",
"type": "process.status"
},
{
"name": "toStatus",
"doc": "The new status of the process",
"type": "company.some.process.processStatus"
},
{
"name": "changeDateTime",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "isPublic",
"type": "boolean"
}
]
}
]
I am not using ksql atm. Which connector settings are suited for this task? If there is a ksql alternative it would be nice to know but the current requirement is to use the JDBC connector.
I tried using flatten but it does not support struct fields that have a schema. Which seems kind of weird. Aren't schema's the whole selling point of connect with kafka? Or is it more of a constraint you have to work around?

Aren't schema's the whole selling point of connect with kafka?
Yes, but Postgres (or the JDBC Sink, in general) doesn't really support nested objects within columns. For that, you're better off with a document database, such as using Mongo Sink Connector.
Which connector settings are suited for this task?
None, really, other than transforms. You could write your own if flatten doesn't work.
You could try pre-defining your table to use JSONB for the two status columns, however, that's more of a workaround.

Related

What column type do I need for this nested data in BigQuery?

I have a JSON schema for a Kafka stream that I am integrating with BigQuery but I can't get the data type correct at the BigQuery end. This is the schema:
"my_meta_data": {
"type": "object",
"properties": {
"property_1": {
"type": "array",
"items": {
"type": "number"
}
},
"property_2": {
"type": "array",
"items": {
"type": "number"
}
},
"property_3": {
"type": "array",
"items": {
"type": "number"
}
}
}
}
I tried this in the JSON file defining the BigQuery table:
{
"name": "my_meta_data",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "property_1",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_2",
"type": "INT64",
"mode": "REPEATED"
},
{
"name": "property_3",
"type": "INT64",
"mode": "REPEATED"
}
]
}
I am using a hosted connector from Confluent, the Kafka provider, and the error message is
The connector is failing because it cannot write a non-array element to an array column. Please check the schemas of the data in Kafka and the BigQuery tables the connector is writing to, and ensure that all data from Kafka that will be written to an array column in BigQuery is contained in an array.
I haven't defined an array column though, I've defined a RECORD column that contains arrays. Any ideas how I can set up the BigQuery table to capture this data? Thanks in advance.

Default value for a record in AVRO?

I wanted to add a new field into an AVRO schema of type "record" that cannot be null and therefore has a default value. The topic is set to compatibility type "Full_Transitive".
The schema did not change from the last version, only the last field produktType was added:
{
"type": "record",
"name": "Finished",
"namespace": "com.domain.finishing",
"doc": "Schema to indicate the end of the ongoing saga...",
"fields": [
{
"name": "numberOfAThing",
"type": [
"null",
{
"type": "string",
"avro.java.string": "String"
}
],
"default": null
},
{
"name": "previousNumbersOfThings",
"type": {
"type": "array",
"items": {
"type": "string",
"avro.java.string": "String"
}
},
"default": []
},
{
"name": "produktType",
"type": {
"type": "record",
"name": "ProduktType",
"fields": [
{
"name": "art",
"type": "int",
"default": 1
},
{
"name": "code",
"type": "int",
"default": 10003
}
]
},
"default": { "art": 1, "code": 10003 }
}
]
}
I've checked with the schema-registry that the new version of the schema is compatible.
But when we try to read old messages that do not contain that new field with the new schema (where the defaults are) there is a EOF Exception and it does not seem to work.
The part that causes headaches is the new added field "produktType". It cannot be null so we tried adding defaults. Which is possible for primitive type fields ("int" and so on). The line "default": { "art": 1, "code": 10003 } seems to be ok with the schema-registry but does not seem to have an effect when we read messages from the topic that do not contain this field.
The schema registry also marks it as not compatible when the last "default": { "art": 1, "code": 10003 } line is missing (but also "default": true works regarding schema compatibility...).
The AVRO specification for complex types contains an example for type "record" and default {"a": 1} so that is where we got that idea from. But since its not working something is still wrong.
There are similar questions like this one claiming records can only have null as a default or this un-answered one.
Is this supposed to work? And if so how can defaults for these "type": "record" fields be defined? Or is it still true that records can only have null as default?
Thanks!
Update on the compatibility cases:
Schema V1 (old one without the new field): can read v1 and v2 records.
Schema V2 (new field added): cannot read v1 records, can read v2 records
The case where a consumer using schema v2 encountering records using v1 is the surprising one - as I thought the defaults are for that purpose.
Even weirder: when I don't set the new field values at all. The v2 record does contain some values:
I have no idea where the value for code is from. The schema uses other numbers for its defaults:
So one of them seems to work, the other does not.

How to use Schema registry for Kafka Connect AVRO

I have started exploring Kafka and Kafka connect recently and did some initial set up .
But wanted to explore more on schema registry part .
My schema registry is started now what i should do .
I have a AVRO schema stored in avro_schema.avsc.
here is the schema
{
"name": "FSP-AUDIT-EVENT",
"type": "record",
"namespace": "com.acme.avro",
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "VERSION",
"type": "int"
},
{
"name": "ACTION_TYPE",
"type": "string"
},
{
"name": "EVENT_TYPE",
"type": "string"
},
{
"name": "CLIENT_ID",
"type": "string"
},
{
"name": "DETAILS",
"type": "string"
},
{
"name": "OBJECT_TYPE",
"type": "string"
},
{
"name": "UTC_DATE_TIME",
"type": "long"
},
{
"name": "POINT_IN_TIME_PRECISION",
"type": "string"
},
{
"name": "TIME_ZONE",
"type": "string"
},
{
"name": "TIMELINE_PRECISION",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_UTC_DT",
"type": [
"string",
"null"
]
},
{
"name": "AUDIT_EVENT_TO_DATE_PITP",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TZ",
"type": "string"
},
{
"name": "AUDIT_EVENT_TO_DATE_TP",
"type": "string"
},
{
"name": "GROUP_ID",
"type": "string"
},
{
"name": "OBJECT_DISPLAY_NAME",
"type": "string"
},
{
"name": "OBJECT_ID",
"type": [
"string",
"null"
]
},
{
"name": "USER_DISPLAY_NAME",
"type": [
"string",
"null"
]
},
{
"name": "USER_ID",
"type": "string"
},
{
"name": "PARENT_EVENT_ID",
"type": [
"string",
"null"
]
},
{
"name": "NOTES",
"type": [
"string",
"null"
]
},
{
"name": "SUMMARY",
"type": [
"string",
"null"
]
}
]
}
Is my schema is valid .I converted it online from JSON ?
where should i keep this schema file location i am not sure about .
Please guide me with the step to follow
.
I am sending records from Lambda function and from JDBC source both .
So basically how can i enforce AVRO schema and test ?
Do i have to change anything in avro-consumer properties file ?
Or is this correct way to register schema
./bin/kafka-avro-console-producer \
--broker-list b-3.**:9092,b-**:9092,b-**:9092 --topic AVRO-AUDIT_EVENT \
--property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"f1","type":"string"}]}'
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema" : "{\"type\":\"struct\",\"fields\":[{\"type\":\"string\",\"optional\":false,\"field\":\"ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"VERSION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"ACTION_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"EVENT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"CLIENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"DETAILS\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_TYPE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"UTC_DATE_TIME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"POINT_IN_TIME_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIME_ZONE\"},{\"type\":\"string\",\"optional\":true,\"field\":\"TIMELINE_PRECISION\"},{\"type\":\"string\",\"optional\":true,\"field\":\"GROUP_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"OBJECT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_DISPLAY_NAME\"},{\"type\":\"string\",\"optional\":true,\"field\":\"USER_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"PARENT_EVENT_ID\"},{\"type\":\"string\",\"optional\":true,\"field\":\"NOTES\"},{\"type\":\"string\",\"optional\":true,\"field\":\"SUMMARY\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_UTC_DT\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_PITP\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TZ\"},{\"type\":\"string\",\"optional\":true,\"field\":\"AUDIT_EVENT_TO_DATE_TP\"}],\"optional\":false,\"name\":\"test\"}"}' http://localhost:8081/subjects/view/versions
what next i have to do
But when i try to see my schema i get only below
curl --silent -X GET http://localhost:8081/subjects/AVRO-AUDIT-EVENT/versions/latest
this is the result
{"subject":"AVRO-AUDIT-EVENT","version":1,"id":161,"schema":"{\"type\":\"string\",\"optional\":false}"}
Why i do not see my full registered schema
Also when i try to delete schema
i get below error
{"error_code":405,"message":"HTTP 405 Method Not Allowed"
i am not sure if my schema is registered correctly .
Please help me.
Thanks in Advance
is my schema valid
You can use the REST API of the Registry to try and submit it and see...
where should i keep this schema file location i am not sure about
It's not clear how you're sending messages...
If you actually wrote Kafka producer code, you store it within your code (as a string) or as a resource file.. If using Java, you can instead use the SchemaBuilder class to create the Schema object
You need to rewrite your producer to use Avro Schema and Serializers if you've not already
If we create AVRO schema will it work for Json as well .
Avro is a Binary format, but there is a JSONDecoder for it.
what should be the URL for our AVRO schema properties file ?
It needs to be the IP of your Schema Registry once you figure out how to start it. (with schema-registry-start)
Do i have to change anything in avro-consumer properties file ?
You need to use the Avro Deserializer
is this correct way to register schema
.> /bin/kafka-avro-console-producer \
Not quite. That's how you produce a message with a schema (and you need to use the correct schema). You also must provide --property schema.registry.url
You use the REST API of the Registry to register and verify schemas

Transforming Map field in avro schema to string using kafka SMT in jdbc-sink-connector?

I have an avro schema defined as follows:
[
{
"namespace": "com.fun.message",
"type": "record",
"name": "FileData",
"doc": "Avro Schema for FileData",
"fields": [
{"name": "id", "type": "string", "doc": "Unique file id" },
{"name": "absolutePath", "type": "string", "doc": "Absolute path of file" },
{"name": "fileName", "type": "string", "doc": "File name" },
{"name": "source", "type": "string", "doc": "unique identification of source" },
{"name": "metaData", "type": {"type": "map", "values": "string"}}
]
}
]
I want to push this data to postgres using jdbc-sink-connector so that I can transform the "metaData" field (which is map type) in my schema to a string. How do I do this?
You need to use SMTs and AFAIK there is currently no SMT that perfectly meets your requirement (ExtractField is a Map.get operation and therefore nested fields cannot be extracted in one pass). You could take a look at Debezium's io.debezium.transforms.UnwrapFromEnvelope SMT which you can modify in order to extract nested fields.
UnwrapFromEnvelope is being used for CDC Event Flattening in order to extract fields from more complex structures like the data formed by Debezium (which I believe is similar to your structure).

json schema issue on required property

I need to write the JSON Schema based on the specification defined by http://json-schema.org/. But I'm struggling for the required/mandatory property validation. Below is the JSON schema that I have written where all the 3 properties are mandatory but In my case either one should be mandatory. How to do this?.
{
"id": "http://example.com/searchShops-schema#",
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "searchShops Service",
"description": "",
"type": "object",
"properties": {
"city":{
"type": "string"
},
"address":{
"type": "string"
},
"zipCode":{
"type": "integer"
}
},
"required": ["city", "address", "zipCode"]
}
If your goal is to tell that "I want at least one member to exist" then use minProperties:
{
"type": "object",
"etc": "etc",
"minProperties": 1
}
Note also that you can use "dependencies" to great effect if you also want additional constraints to exist when this or that member is present.
{
...
"anyOf": [
{ "required": ["city"] },
{ "required": ["address"] },
{ "required": ["zipcode"] },
]
}
Or use "oneOf" if exactly one property should be present