Aeropspike Kafka Outbound Connector - using map data type with Kafka Avro format - apache-kafka

Am trying to setup Aerospike Kafka Outbound Connector (Source Connector) for Change Data Capture (CDC). When using Kafka Avro format for messages (as explained in, am getting following error from the connector:
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.607 GMT INFO metrics-ticker - requests-total: rate(per second) mean=8.05313731102431, m1=9.48698171570335, m5=2.480116641993411, m15=0.8667674157832074
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.613 GMT INFO metrics-ticker - requests-total: duration(ms) min=1.441459, max=3101.488585, mean=15.582822432504553, stddev=149.48409869767494, median=4.713083, p75=7.851875, p95=17.496458, p98=28.421125, p99=85.418959, p999=3090.952252
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.624 GMT ERROR metrics-ticker - **java.lang.Exception - Map type not allowed, has to be record type**: count=184
My data contains a map field. The Avro schema for the data is as follows:
"name": "mydata",
"type": "record",
"fields": [
"name": "metadata",
"type": {
"name": "com.aerospike.metadata",
"type": "record",
"fields": [
"name": "namespace",
"type": "string"
"name": "set",
"type": [
"default": null
"name": "userKey",
"type": [
"default": null
"name": "digest",
"type": "bytes"
"name": "msg",
"type": "string"
"name": "durable",
"type": [
"default": null
"name": "gen",
"type": [
"default": null
"name": "exp",
"type": [
"default": null
"name": "lut",
"type": [
"default": null
"name": "test",
"type": "string"
"name": "testmap",
"type": {
"type": "map",
"values": "string",
"default": {}
Has anyone tried and got this working?
Looks like Aerospike connector doesn't support map type. I enabled verbose logging:
aerospike-kafka_connector-1 | 2023-01-06 18:41:31.098 GMT ERROR ErrorRegistry - Error stack trace
aerospike-kafka_connector-1 | java.lang.Exception: Map type not allowed, has to be record type
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:155)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:164)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertSchemaValid(KafkaAvroOutboundRecordGenerator.kt:101)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator.<init>(KafkaAvroOutboundRecordGenerator.kt:180)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:59)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.access$getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:29)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule$bindKafkaAvroParserFactory$1.get(KafkaOutboundGuiceModule.kt:48)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.getInbuiltRecordFormatter(XdrExchangeConverter.kt:422)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.access$getInbuiltRecordFormatter(XdrExchangeConverter.kt:75)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter$RouterAndInbuiltFormatter.<init>(XdrExchangeConverter.kt:285)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.processXdrRecord(XdrExchangeConverter.kt:192)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.parse(XdrExchangeConverter.kt:134)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.OutboundBridge$processAsync$1.invokeSuspend(OutboundBridge.kt:182)
aerospike-kafka_connector-1 | at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
aerospike-kafka_connector-1 | at
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor$
aerospike-kafka_connector-1 | at java.base/
However its not mentioned in the documentation.


Sink Connector auto create tables with proper data type

I have the debezium source connector for Postgresql with the value convertor as Avro and it uses the schema registry.
Source DDL:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
id | integer | | not null | nextval('tbl1_id_seq'::regclass) | plain | | |
name | character varying(100) | | | | extended | | |
col4 | numeric | | | | main | | |
col5 | bigint | | | | plain | | |
col6 | timestamp without time zone | | | | plain | | |
col7 | timestamp with time zone | | | | plain | | |
col8 | boolean | | | | plain | | |
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
In the schema registry:
"type": "record",
"name": "Value",
"namespace": "test.public.tbl1",
"fields": [
"name": "id",
"type": {
"type": "int",
"connect.parameters": {
"__debezium.source.column.type": "SERIAL",
"__debezium.source.column.length": "10",
"__debezium.source.column.scale": "0"
"connect.default": 0
"default": 0
"name": "name",
"type": [
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "VARCHAR",
"__debezium.source.column.length": "100",
"__debezium.source.column.scale": "0"
"default": null
"name": "col4",
"type": [
"type": "double",
"connect.parameters": {
"__debezium.source.column.type": "NUMERIC",
"__debezium.source.column.length": "0"
"default": null
"name": "col5",
"type": [
"type": "long",
"connect.parameters": {
"__debezium.source.column.type": "INT8",
"__debezium.source.column.length": "19",
"__debezium.source.column.scale": "0"
"default": null
"name": "col6",
"type": [
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMP",
"__debezium.source.column.length": "29",
"__debezium.source.column.scale": "6"
"": "io.debezium.time.MicroTimestamp"
"default": null
"name": "col7",
"type": [
"type": "string",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMPTZ",
"__debezium.source.column.length": "35",
"__debezium.source.column.scale": "6"
"": "io.debezium.time.ZonedTimestamp"
"default": null
"name": "col8",
"type": [
"type": "boolean",
"connect.parameters": {
"__debezium.source.column.type": "BOOL",
"__debezium.source.column.length": "1",
"__debezium.source.column.scale": "0"
"default": null
"": "test.public.tbl1.Value"
But in the target PostgreSQL the data types are completely mismatched for ID columns and timestamp columns. Sometimes Decimal columns as well(that's due to this)
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
id | text | | not null | | extended | | |
name | text | | | | extended | | |
col4 | double precision | | | | plain | | |
col5 | bigint | | | | plain | | |
col6 | bigint | | | | plain | | |
col7 | text | | | | extended | | |
col8 | boolean | | | | plain | | |
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
Im trying to understand even with schema registry , its not creating the target tables with proper datatypes.
Sink config:
"name": "t1-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "test.public.tbl1",
"connection.url": "jdbc:postgresql://172.31.85.***:5432/test",
"connection.user": "postgres",
"connection.password": "***",
"": "PostgreSqlDatabaseDialect",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key",
"": "tbl1",
"key.converter": "",
"key.converter.schemas.enable": "false",
"internal.key.converter": "",
"internal.key.converter.schemas.enable": "true",
"internal.value.converter": "",
"internal.value.converter.schemas.enable": "true",
"value.converter": "",
"value.converter.schemas.enable": "true",
"value.converter.region": "us-east-1",
"key.converter.region": "us-east-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"": "bhuvi-debezium",
"": "bhuvi-debezium",
"value.converter.column.propagate.source.type": ".*",
"value.converter.datatype.propagate.source.type": ".*"

Kafka jdbc sink connector to Postgres failing with "cross-database references are not implemented"

My setup is based on docker containers - 1 oracle db, kafka, kafka connect and postgres. I first use oracle CDC connector to feed kafka which works fine. Then I am trying to read that topic and feed it into Postgres.
When I start the connector I am getting:
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(\n\tat\n\tat java.base/java.util.concurrent.Executors$\n\tat java.base/\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$\n\tat java.base/\nCaused by: org.apache.kafka.connect.errors.ConnectException: java.sql.SQLException: Exception chain:\norg.postgresql.util.PSQLException: ERROR: cross-database references are not implemented: "ORCLCDB.C__MYUSER.EMP"\n Position: 14\n\n\tat io.confluent.connect.jdbc.sink.JdbcSinkTask.put(\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(\n\t... 10 more\nCaused by: java.sql.SQLException: Exception chain:\norg.postgresql.util.PSQLException: ERROR: cross-database references are not implemented: "ORCLCDB.C__MYUSER.EMP"\n Position: 14\n\n\tat io.confluent.connect.jdbc.sink.JdbcSinkTask.getAllMessagesException(\n\tat io.confluent.connect.jdbc.sink.JdbcSinkTask.put(\n\t... 11 more\n"
My config json looks like :
"name": "SimplePostgresSink",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"name": "SimplePostgresSink",
"topics": "ORCLCDB.C__MYUSER.EMP",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"connection.url": "jdbc:postgresql://postgres:5432/postgres",
"connection.user": "postgres",
"connection.password": "postgres",
"insert.mode": "upsert",
"pk.mode": "record_value",
"pk.fields": "I",
"auto.create": "true",
"auto.evolve": "true"
And the topic schema is:
"type": "record",
"name": "ConnectDefault",
"namespace": "io.confluent.connect.avro",
"fields": [
"name": "I",
"type": {
"type": "bytes",
"scale": 0,
"precision": 64,
"connect.version": 1,
"connect.parameters": {
"scale": "0"
"": "",
"logicalType": "decimal"
"name": "NAME",
"type": [
"default": null
"name": "table",
"type": [
"default": null
"name": "scn",
"type": [
"default": null
"name": "op_type",
"type": [
"default": null
"name": "op_ts",
"type": [
"default": null
"name": "current_ts",
"type": [
"default": null
"name": "row_id",
"type": [
"default": null
"name": "username",
"type": [
"default": null
I am interested in the columns I and Name
This can also be caused by the db name in your connection.url not matching the db name in your I recently experienced this and thought it might help others.
Looking at the log I saw the connector has problem creating a table with such name:
[2022-01-07 23:56:11,737] INFO JdbcDbWriter Connected (io.confluent.connect.jdbc.sink.JdbcDbWriter)
[2022-01-07 23:56:11,759] INFO Checking PostgreSql dialect for existence of TABLE "ORCLCDB"."C__MYUSER"."EMP" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-01-07 23:56:11,764] INFO Using PostgreSql dialect TABLE "ORCLCDB"."C__MYUSER"."EMP" absent (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-01-07 23:56:11,764] INFO Creating table with sql: CREATE TABLE "ORCLCDB"."C__MYUSER"."EMP" (
PRIMARY KEY("I","NAME")) (io.confluent.connect.jdbc.sink.DbStructure)
[2022-01-07 23:56:11,765] WARN Create failed, will attempt amend if table already exists (io.confluent.connect.jdbc.sink.DbStructure)
org.postgresql.util.PSQLException: ERROR: cross-database references are not implemented: "ORCLCDB.C__MYUSER.EMP"
Extending on Steven's comment
This can also occur if your topic name contains full stops
If your it defaults to ${topicName}

selecting and filtering cloudflare pagerules with jq

I was trying to work out the best way to filter out some pagerules data from Cloudflare, and while I've got a solution I'm looking at how ugly it is and thinking "there has to be a simpler way to do this."
I'm specifically asking about a better way to achieve the following goal using jq. I understand there are programming libraries I could use to accomplish the same, but the point of this question is to get a better understanding of how jq is intended to work.
Say I've got a long list of CloudFlare pagerules records, here are a few entries as a minimal example:
"": [
"id": "341",
"targets": [
"target": "url",
"constraint": {
"operator": "matches",
"value": "*"
"actions": [
"id": "always_use_https"
"priority": 12,
"status": "active",
"created_on": "2017-11-29T18:07:36.000000Z",
"modified_on": "2020-09-02T16:09:03.000000Z"
"id": "406",
"targets": [
"target": "url",
"constraint": {
"operator": "matches",
"value": "*"
"actions": [
"id": "always_use_https"
"priority": 9,
"status": "active",
"created_on": "2017-11-29T18:07:55.000000Z",
"modified_on": "2020-09-02T16:09:03.000000Z"
"id": "427",
"targets": [
"target": "url",
"constraint": {
"operator": "matches",
"value": "*"
"actions": [
"id": "ssl",
"value": "flexible"
"priority": 8,
"status": "active",
"created_on": "2017-11-29T18:08:00.000000Z",
"modified_on": "2020-09-02T16:09:03.000000Z"
What I want to do is extract the urls nested in the constraint.value fields for the always_use_https actions. The goal is to extract the values and return them as a json array. What I came up with is this:
jq '[
.[] | .[] | select(.actions[].id | contains("always_use_https"))
] | .[].targets[] | select(.target | contains("url"))
] | .[] | .constraint | select(.operator | contains("matches"))
] | .[].value
Against our example this produces:
Is there a more succinct way to achieve this in jq?
This produces the expected output in accordance with the criteria as I understand them:
jq '.[""]
| map(select( any(.actions[]; .id == "always_use_https"))
| .targets[]
| select(.target == "url")
| .constraint.value )
' cloudfare.json

Create table from topic and got some serialization exception about Size of data received by LongDeserializer is not 8

Environments (Docker):
# 5.5.1
image: confluentinc/cp-zookeeper:latest
# 2.13-2.6.0
image: wurstmeister/kafka:latest
# 5.5.1
image: confluentinc/cp-schema-registry:latest
# 5.5.1
image: confluentinc/cp-kafka-connect:latest
# 0.11.0
image: confluentinc/ksqldb-server:latest
Kafka topic content come from kafka connect (using debezium).
When I use query (select * from user emit changes),
most content are shown, but some content are lost.
I try to see the logs from ksqldb-server, and found some error messages:
ksqldb-server | [2020-08-29 12:44:23,008] ERROR {"type":0,"deserializationError":{"errorMessage":"Error deserializing DELIMITED message from topic: pa.new_pa.user","recordB64":null,"cause":["Size of data received by LongDeserializer is not 8"],"topic":"pa.new_pa.user"},"recordProcessingError":null,"productionError":null} (processing.CTAS_USER2_0.KsqlTopic.Source.deserializer:44)
ksqldb-server | [2020-08-29 12:44:23,008] WARN Exception caught during Deserialization, taskId: 0_0, topic: pa.new_pa.user, partition: 0, offset: 23095 (org.apache.kafka.streams.processor.internals.StreamThread:36)
ksqldb-server | org.apache.kafka.common.errors.SerializationException: Error deserializing DELIMITED message from topic: pa.new_pa.user
ksqldb-server | Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
ksqldb-server | [2020-08-29 12:44:23,009] WARN stream-thread [_confluent-ksql-default_query_CTAS_USER2_0-6637e2a8-c417-49fa-bb65-d0d1a5205af1-StreamThread-1] task [0_0] Skipping record due to deserialization error. topic=[pa.new_pa.user] partition=[0] offset=[23095] (org.apache.kafka.streams.processor.internals.RecordDeserializer:88)
ksqldb-server | org.apache.kafka.common.errors.SerializationException: Error deserializing DELIMITED message from topic: pa.new_pa.user
ksqldb-server | Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
I try to consume the message with offset "23095", and looks fine.
[2020-08-29 13:24:12,021] INFO [Consumer clientId=consumer-console-consumer-37294-1, groupId=console-consumer-37294] Subscribed to partition(s): pa.new_pa.user-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-08-29 13:24:12,026] INFO [Consumer clientId=consumer-console-consumer-37294-1, groupId=console-consumer-37294] Seeking to offset 23095 for partition pa.new_pa.user-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-08-29 13:24:12,570] INFO [Consumer clientId=consumer-console-consumer-37294-1, groupId=console-consumer-37294] Cluster ID: rdsgvpoESzer6IAxQDlLUA (org.apache.kafka.clients.Metadata)
Here is my source connector config and else:
"connector.class" = 'io.debezium.connector.mysql.MySqlConnector',
"tasks.max" = '1',
"database.hostname" = '',
"database.port" = '3306',
"database.user" = 'root',
"database.password" = 'xxxxxxx',
"" = '10001',
"" = 'pa',
"database.whitelist" = 'new_pa',
"table.whitelist" = 'new_pa.user, new_pa.user_created,',
"database.history.kafka.bootstrap.servers" = 'kafka:9092',
"database.history.kafka.topic" = '',
"transforms" = 'unwrap',
"transforms.unwrap.type" = 'io.debezium.transforms.ExtractNewRecordState',
"transforms.unwrap.delete.handling.mode" = 'drop',
"transforms.unwrap.drop.tombstones" = 'true',
"key.converter" = 'io.confluent.connect.avro.AvroConverter',
"value.converter" = 'io.confluent.connect.avro.AvroConverter',
"key.converter.schema.registry.url" = 'http://schema-registry:8081',
"value.converter.schema.registry.url" = 'http://schema-registry:8081',
"key.converter.schemas.enable" = 'true',
"value.converter.schemas.enable" = 'true'
KAFKA_TOPIC = 'pa.new_pa.user',
Schema of topic (auto generated):
"": "pa.new_pa.user.Key",
"fields": [
"name": "id",
"type": "long"
"name": "Key",
"namespace": "pa.new_pa.user",
"type": "record"
"": "pa.new_pa.user.Value",
"fields": [
"name": "id",
"type": "long"
"default": null,
"name": "parent_id",
"type": [
"default": 0,
"name": "upper_id",
"type": {
"connect.default": 0,
"type": "long"
"name": "username",
"type": "string"
"name": "domain",
"type": "int"
"name": "role",
"type": {
"connect.type": "int16",
"type": "int"
"name": "modified_at",
"type": {
"": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
"default": null,
"name": "blacklist_modified_at",
"type": [
"": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
"default": null,
"name": "tied_at",
"type": [
"": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
"name": "name",
"type": "string"
"name": "enable",
"type": {
"connect.type": "int16",
"type": "int"
"name": "is_default",
"type": {
"connect.type": "int16",
"type": "int"
"name": "bankrupt",
"type": {
"connect.type": "int16",
"type": "int"
"name": "locked",
"type": {
"connect.type": "int16",
"type": "int"
"name": "tied",
"type": {
"connect.type": "int16",
"type": "int"
"name": "checked",
"type": {
"connect.type": "int16",
"type": "int"
"name": "failed",
"type": {
"connect.type": "int16",
"type": "int"
"default": null,
"name": "last_login",
"type": [
"": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
"default": null,
"name": "last_online",
"type": [
"": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
"default": null,
"name": "last_ip",
"type": [
"default": null,
"name": "last_country",
"type": [
"name": "last_city_id",
"type": "long"
"name": "Value",
"namespace": "pa.new_pa.user",
"type": "record"
The issue here is that you're keys are in Avro, and ksqlDB currently only supports KAFKA formatted keys (as of version 0.12).
Avro keys are actively being worked on: #4461 adds support for Avro primitives and #4997 extends this to support single key columns within a Avro record, (as you have here).
You're setting the Avro key format with the config:
"key.converter" = 'io.confluent.connect.avro.AvroConverter',
You're SQL:
KAFKA_TOPIC = 'pa.new_pa.user',
Is setting the VALUE_FORMAT to AVRO, but the key format is currently KAFKA. Hence you may be able to use:
"key.converter" = 'org.apache.kafka.connect.converters.IntegerConverter', convert the key into the right format. More info on the correct converters to use for the KAFKA format on the ksqlDB microsite.
The id field type is BIGINT.
I try to modify the config:
key.converter = (ref: ksqlDB microsite), and get error: LongConverter could not be found.
Set key.converter = org.apache.kafka.connect.converters.LongConverter (ref), and get error:
kafka-connect | [2020-09-03 06:48:25,194] ERROR WorkerSourceTask{id=pa_source_avro2-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
kafka-connect | org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
kafka-connect | at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(
kafka-connect | at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(
kafka-connect | at org.apache.kafka.connect.runtime.WorkerSourceTask.convertTransformedRecord(
kafka-connect | at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(
kafka-connect | at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(
kafka-connect | at org.apache.kafka.connect.runtime.WorkerTask.doRun(
kafka-connect | at
kafka-connect | at java.util.concurrent.Executors$
kafka-connect | at
kafka-connect | at java.util.concurrent.ThreadPoolExecutor.runWorker(
kafka-connect | at java.util.concurrent.ThreadPoolExecutor$
kafka-connect | at
I use the json+avro converter, and create table:
"connector.class" = 'io.debezium.connector.mysql.MySqlConnector',
"tasks.max" = '1',
"database.hostname" = '',
"database.port" = '3306',
"database.user" = 'root',
"database.password" = '6881703',
"" = '10001',
"" = 'pa3',
"database.whitelist" = 'new_pa',
"table.whitelist" = 'new_pa.user, new_pa.user_created,',
"database.history.kafka.bootstrap.servers" = 'kafka:9092',
"database.history.kafka.topic" = 'schema-changes.pa3',
"transforms" = 'unwrap',
"transforms.unwrap.type" = 'io.debezium.transforms.ExtractNewRecordState',
"key.converter" = 'org.apache.kafka.connect.json.JsonConverter',
"value.converter" = 'io.confluent.connect.avro.AvroConverter',
"key.converter.schema.registry.url" = 'http://schema-registry:8081',
"value.converter.schema.registry.url" = 'http://schema-registry:8081',
"key.converter.schemas.enable" = 'false',
"value.converter.schemas.enable" = 'true',
"include.schema.changes" = 'true'
CREATE TABLE user_with_string_key (`row_id` STRING PRIMARY KEY) WITH (
KAFKA_TOPIC = 'pa3.new_pa.user',
CREATE TABLE user_with_bigint_key (`row_id` BIGINT PRIMARY KEY) WITH (
KAFKA_TOPIC = 'pa3.new_pa.user',
I can get all data from mysql, but get different content of row_id.
The reason maybe be the KAFKA format.
"row_id": "{\"id\":1}",
"ID": 1,
"PARENT_ID": null,
"UPPER_ID": 0,
"USERNAME": "bmw999",
"DOMAIN": 1,
"ROLE": 3,
"MODIFIED_AT": 1532017653000,
"TIED_AT": null,
"NAME": "bmw999",
"ENABLE": 1,
"LOCKED": 1,
"TIED": 0,
"FAILED": 0,
"LAST_LOGIN": 1539806130000,
"LAST_ONLINE": 1508259330000,
"row_id": 8872770094665183000,
"ID": 1,
"PARENT_ID": null,
"UPPER_ID": 0,
"USERNAME": "bmw999",
"DOMAIN": 1,
"ROLE": 3,
"MODIFIED_AT": 1532017653000,
"TIED_AT": null,
"NAME": "bmw999",
"ENABLE": 1,
"LOCKED": 1,
"TIED": 0,
"FAILED": 0,
"LAST_LOGIN": 1539806130000,
"LAST_ONLINE": 1508259330000,

How to validate my data with jsonSchema scala

I have a dataframe which looks like that
| id | migration|number|string|
|[5e5db036e0403b1a. |mig | 1| str |
and I have a jsonSchema:
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
"string": {
"type": "string"
"number": {
"type": "number",
"min": 0
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you
Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach( // Print the Diff columns