Aeropspike Kafka Outbound Connector - using map data type with Kafka Avro format - apache-kafka

Am trying to setup Aerospike Kafka Outbound Connector (Source Connector) for Change Data Capture (CDC). When using Kafka Avro format for messages (as explained in https://docs.aerospike.com/connect/kafka/from-asdb/formats/kafka-avro-serialization-format), am getting following error from the connector:
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.607 GMT INFO metrics-ticker - requests-total: rate(per second) mean=8.05313731102431, m1=9.48698171570335, m5=2.480116641993411, m15=0.8667674157832074
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.613 GMT INFO metrics-ticker - requests-total: duration(ms) min=1.441459, max=3101.488585, mean=15.582822432504553, stddev=149.48409869767494, median=4.713083, p75=7.851875, p95=17.496458, p98=28.421125, p99=85.418959, p999=3090.952252
aerospike-kafka_connector-1 | 2022-12-22 20:27:17.624 GMT ERROR metrics-ticker - **java.lang.Exception - Map type not allowed, has to be record type**: count=184
My data contains a map field. The Avro schema for the data is as follows:
{
"name": "mydata",
"type": "record",
"fields": [
{
"name": "metadata",
"type": {
"name": "com.aerospike.metadata",
"type": "record",
"fields": [
{
"name": "namespace",
"type": "string"
},
{
"name": "set",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "userKey",
"type": [
"null",
"long",
"double",
"bytes",
"string"
],
"default": null
},
{
"name": "digest",
"type": "bytes"
},
{
"name": "msg",
"type": "string"
},
{
"name": "durable",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "gen",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "exp",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "lut",
"type": [
"null",
"long"
],
"default": null
}
]
}
},
{
"name": "test",
"type": "string"
},
{
"name": "testmap",
"type": {
"type": "map",
"values": "string",
"default": {}
}
}
]
}
Has anyone tried and got this working?
Edit:
Looks like Aerospike connector doesn't support map type. I enabled verbose logging:
aerospike-kafka_connector-1 | 2023-01-06 18:41:31.098 GMT ERROR ErrorRegistry - Error stack trace
aerospike-kafka_connector-1 | java.lang.Exception: Map type not allowed, has to be record type
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:155)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertOnlyValidTypes(KafkaAvroOutboundRecordGenerator.kt:164)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator$Companion.assertSchemaValid(KafkaAvroOutboundRecordGenerator.kt:101)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.parser.KafkaAvroOutboundRecordGenerator.<init>(KafkaAvroOutboundRecordGenerator.kt:180)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:59)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule.access$getKafkaAvroStreamingRecordParser(KafkaOutboundGuiceModule.kt:29)
aerospike-kafka_connector-1 | at com.aerospike.connect.kafka.outbound.inject.KafkaOutboundGuiceModule$bindKafkaAvroParserFactory$1.get(KafkaOutboundGuiceModule.kt:48)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.getInbuiltRecordFormatter(XdrExchangeConverter.kt:422)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.access$getInbuiltRecordFormatter(XdrExchangeConverter.kt:75)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter$RouterAndInbuiltFormatter.<init>(XdrExchangeConverter.kt:285)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.processXdrRecord(XdrExchangeConverter.kt:192)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.converter.XdrExchangeConverter.parse(XdrExchangeConverter.kt:134)
aerospike-kafka_connector-1 | at com.aerospike.connect.outbound.OutboundBridge$processAsync$1.invokeSuspend(OutboundBridge.kt:182)
aerospike-kafka_connector-1 | at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
aerospike-kafka_connector-1 | at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
aerospike-kafka_connector-1 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
aerospike-kafka_connector-1 | at java.base/java.lang.Thread.run(Thread.java:829)
However its not mentioned in the documentation.

Related

Sink Connector auto create tables with proper data type

I have the debezium source connector for Postgresql with the value convertor as Avro and it uses the schema registry.
Source DDL:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+----------------------------------+----------+-------------+--------------+-------------
id | integer | | not null | nextval('tbl1_id_seq'::regclass) | plain | | |
name | character varying(100) | | | | extended | | |
col4 | numeric | | | | main | | |
col5 | bigint | | | | plain | | |
col6 | timestamp without time zone | | | | plain | | |
col7 | timestamp with time zone | | | | plain | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
Access method: heap
In the schema registry:
{
"type": "record",
"name": "Value",
"namespace": "test.public.tbl1",
"fields": [
{
"name": "id",
"type": {
"type": "int",
"connect.parameters": {
"__debezium.source.column.type": "SERIAL",
"__debezium.source.column.length": "10",
"__debezium.source.column.scale": "0"
},
"connect.default": 0
},
"default": 0
},
{
"name": "name",
"type": [
"null",
{
"type": "string",
"connect.parameters": {
"__debezium.source.column.type": "VARCHAR",
"__debezium.source.column.length": "100",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col4",
"type": [
"null",
{
"type": "double",
"connect.parameters": {
"__debezium.source.column.type": "NUMERIC",
"__debezium.source.column.length": "0"
}
}
],
"default": null
},
{
"name": "col5",
"type": [
"null",
{
"type": "long",
"connect.parameters": {
"__debezium.source.column.type": "INT8",
"__debezium.source.column.length": "19",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
},
{
"name": "col6",
"type": [
"null",
{
"type": "long",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMP",
"__debezium.source.column.length": "29",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.MicroTimestamp"
}
],
"default": null
},
{
"name": "col7",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.parameters": {
"__debezium.source.column.type": "TIMESTAMPTZ",
"__debezium.source.column.length": "35",
"__debezium.source.column.scale": "6"
},
"connect.name": "io.debezium.time.ZonedTimestamp"
}
],
"default": null
},
{
"name": "col8",
"type": [
"null",
{
"type": "boolean",
"connect.parameters": {
"__debezium.source.column.type": "BOOL",
"__debezium.source.column.length": "1",
"__debezium.source.column.scale": "0"
}
}
],
"default": null
}
],
"connect.name": "test.public.tbl1.Value"
}
But in the target PostgreSQL the data types are completely mismatched for ID columns and timestamp columns. Sometimes Decimal columns as well(that's due to this)
Target:
Table "public.tbl1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+------------------+-----------+----------+---------+----------+-------------+--------------+-------------
id | text | | not null | | extended | | |
name | text | | | | extended | | |
col4 | double precision | | | | plain | | |
col5 | bigint | | | | plain | | |
col6 | bigint | | | | plain | | |
col7 | text | | | | extended | | |
col8 | boolean | | | | plain | | |
Indexes:
"tbl1_pkey" PRIMARY KEY, btree (id)
Access method: heap
Im trying to understand even with schema registry , its not creating the target tables with proper datatypes.
Sink config:
{
"name": "t1-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "test.public.tbl1",
"connection.url": "jdbc:postgresql://172.31.85.***:5432/test",
"connection.user": "postgres",
"connection.password": "***",
"dialect.name": "PostgreSqlDatabaseDialect",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key",
"table.name.format": "tbl1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"internal.key.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.key.converter.schemas.enable": "true",
"internal.value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"internal.value.converter.schemas.enable": "true",
"value.converter": "com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.region": "us-east-1",
"key.converter.region": "us-east-1",
"key.converter.schemaAutoRegistrationEnabled": "true",
"value.converter.schemaAutoRegistrationEnabled": "true",
"key.converter.avroRecordType": "GENERIC_RECORD",
"value.converter.avroRecordType": "GENERIC_RECORD",
"key.converter.registry.name": "bhuvi-debezium",
"value.converter.registry.name": "bhuvi-debezium",
"value.converter.column.propagate.source.type": ".*",
"value.converter.datatype.propagate.source.type": ".*"
}
}

Kafka jdbc sink connector to Postgres failing with "cross-database references are not implemented"

My setup is based on docker containers - 1 oracle db, kafka, kafka connect and postgres. I first use oracle CDC connector to feed kafka which works fine. Then I am trying to read that topic and feed it into Postgres.
When I start the connector I am getting:
"trace": "org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:614)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:329)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: org.apache.kafka.connect.errors.ConnectException: java.sql.SQLException: Exception chain:\norg.postgresql.util.PSQLException: ERROR: cross-database references are not implemented: "ORCLCDB.C__MYUSER.EMP"\n Position: 14\n\n\tat io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:122)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)\n\t... 10 more\nCaused by: java.sql.SQLException: Exception chain:\norg.postgresql.util.PSQLException: ERROR: cross-database references are not implemented: "ORCLCDB.C__MYUSER.EMP"\n Position: 14\n\n\tat io.confluent.connect.jdbc.sink.JdbcSinkTask.getAllMessagesException(JdbcSinkTask.java:150)\n\tat io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:102)\n\t... 11 more\n"
My config json looks like :
{
"name": "SimplePostgresSink",
"config":{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"name": "SimplePostgresSink",
"tasks.max":1,
"topics": "ORCLCDB.C__MYUSER.EMP",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"confluent.topic.bootstrap.servers":"kafka:29092",
"connection.url": "jdbc:postgresql://postgres:5432/postgres",
"connection.user": "postgres",
"connection.password": "postgres",
"insert.mode": "upsert",
"pk.mode": "record_value",
"pk.fields": "I",
"auto.create": "true",
"auto.evolve": "true"
}
}
And the topic schema is:
{
"type": "record",
"name": "ConnectDefault",
"namespace": "io.confluent.connect.avro",
"fields": [
{
"name": "I",
"type": {
"type": "bytes",
"scale": 0,
"precision": 64,
"connect.version": 1,
"connect.parameters": {
"scale": "0"
},
"connect.name": "org.apache.kafka.connect.data.Decimal",
"logicalType": "decimal"
}
},
{
"name": "NAME",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "table",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "scn",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "op_type",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "op_ts",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "current_ts",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "row_id",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "username",
"type": [
"null",
"string"
],
"default": null
}
]
}
I am interested in the columns I and Name
This can also be caused by the db name in your connection.url not matching the db name in your table.name.format. I recently experienced this and thought it might help others.
Looking at the log I saw the connector has problem creating a table with such name:
[2022-01-07 23:56:11,737] INFO JdbcDbWriter Connected (io.confluent.connect.jdbc.sink.JdbcDbWriter)
[2022-01-07 23:56:11,759] INFO Checking PostgreSql dialect for existence of TABLE "ORCLCDB"."C__MYUSER"."EMP" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-01-07 23:56:11,764] INFO Using PostgreSql dialect TABLE "ORCLCDB"."C__MYUSER"."EMP" absent (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-01-07 23:56:11,764] INFO Creating table with sql: CREATE TABLE "ORCLCDB"."C__MYUSER"."EMP" (
"I" DECIMAL NOT NULL,
"NAME" TEXT NOT NULL,
PRIMARY KEY("I","NAME")) (io.confluent.connect.jdbc.sink.DbStructure)
[2022-01-07 23:56:11,765] WARN Create failed, will attempt amend if table already exists (io.confluent.connect.jdbc.sink.DbStructure)
org.postgresql.util.PSQLException: ERROR: cross-database references are not implemented: "ORCLCDB.C__MYUSER.EMP"
Extending on Steven's comment
This can also occur if your topic name contains full stops
If your table.name.format it defaults to ${topicName}

selecting and filtering cloudflare pagerules with jq

I was trying to work out the best way to filter out some pagerules data from Cloudflare, and while I've got a solution I'm looking at how ugly it is and thinking "there has to be a simpler way to do this."
I'm specifically asking about a better way to achieve the following goal using jq. I understand there are programming libraries I could use to accomplish the same, but the point of this question is to get a better understanding of how jq is intended to work.
Say I've got a long list of CloudFlare pagerules records, here are a few entries as a minimal example:
{
"example.org": [
{
"id": "341",
"targets": [
{
"target": "url",
"constraint": {
"operator": "matches",
"value": "http://ng.example.org/*"
}
}
],
"actions": [
{
"id": "always_use_https"
}
],
"priority": 12,
"status": "active",
"created_on": "2017-11-29T18:07:36.000000Z",
"modified_on": "2020-09-02T16:09:03.000000Z"
},
{
"id": "406",
"targets": [
{
"target": "url",
"constraint": {
"operator": "matches",
"value": "http://nz.example.org/*"
}
}
],
"actions": [
{
"id": "always_use_https"
}
],
"priority": 9,
"status": "active",
"created_on": "2017-11-29T18:07:55.000000Z",
"modified_on": "2020-09-02T16:09:03.000000Z"
},
{
"id": "427",
"targets": [
{
"target": "url",
"constraint": {
"operator": "matches",
"value": "nz.example.org/*"
}
}
],
"actions": [
{
"id": "ssl",
"value": "flexible"
}
],
"priority": 8,
"status": "active",
"created_on": "2017-11-29T18:08:00.000000Z",
"modified_on": "2020-09-02T16:09:03.000000Z"
}
]
}
What I want to do is extract the urls nested in the constraint.value fields for the always_use_https actions. The goal is to extract the values and return them as a json array. What I came up with is this:
jq '[
[
[
[
.[] | .[] | select(.actions[].id | contains("always_use_https"))
] | .[].targets[] | select(.target | contains("url"))
] | .[] | .constraint | select(.operator | contains("matches"))
] | .[].value
]'
Against our example this produces:
[
"http://ng.example.org/*",
"http://nz.example.org/*"
]
Is there a more succinct way to achieve this in jq?
This produces the expected output in accordance with the criteria as I understand them:
jq '.["example.org"]
| map(select( any(.actions[]; .id == "always_use_https"))
| .targets[]
| select(.target == "url")
| .constraint.value )
' cloudfare.json

Create table from topic and got some serialization exception about Size of data received by LongDeserializer is not 8

Environments (Docker):
# 5.5.1
image: confluentinc/cp-zookeeper:latest
# 2.13-2.6.0
image: wurstmeister/kafka:latest
# 5.5.1
image: confluentinc/cp-schema-registry:latest
# 5.5.1
image: confluentinc/cp-kafka-connect:latest
# 0.11.0
image: confluentinc/ksqldb-server:latest
Kafka topic content come from kafka connect (using debezium).
When I use query (select * from user emit changes),
most content are shown, but some content are lost.
I try to see the logs from ksqldb-server, and found some error messages:
ksqldb-server | [2020-08-29 12:44:23,008] ERROR {"type":0,"deserializationError":{"errorMessage":"Error deserializing DELIMITED message from topic: pa.new_pa.user","recordB64":null,"cause":["Size of data received by LongDeserializer is not 8"],"topic":"pa.new_pa.user"},"recordProcessingError":null,"productionError":null} (processing.CTAS_USER2_0.KsqlTopic.Source.deserializer:44)
ksqldb-server | [2020-08-29 12:44:23,008] WARN Exception caught during Deserialization, taskId: 0_0, topic: pa.new_pa.user, partition: 0, offset: 23095 (org.apache.kafka.streams.processor.internals.StreamThread:36)
ksqldb-server | org.apache.kafka.common.errors.SerializationException: Error deserializing DELIMITED message from topic: pa.new_pa.user
ksqldb-server | Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
ksqldb-server | [2020-08-29 12:44:23,009] WARN stream-thread [_confluent-ksql-default_query_CTAS_USER2_0-6637e2a8-c417-49fa-bb65-d0d1a5205af1-StreamThread-1] task [0_0] Skipping record due to deserialization error. topic=[pa.new_pa.user] partition=[0] offset=[23095] (org.apache.kafka.streams.processor.internals.RecordDeserializer:88)
ksqldb-server | org.apache.kafka.common.errors.SerializationException: Error deserializing DELIMITED message from topic: pa.new_pa.user
ksqldb-server | Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
I try to consume the message with offset "23095", and looks fine.
[2020-08-29 13:24:12,021] INFO [Consumer clientId=consumer-console-consumer-37294-1, groupId=console-consumer-37294] Subscribed to partition(s): pa.new_pa.user-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-08-29 13:24:12,026] INFO [Consumer clientId=consumer-console-consumer-37294-1, groupId=console-consumer-37294] Seeking to offset 23095 for partition pa.new_pa.user-0 (org.apache.kafka.clients.consumer.KafkaConsumer)
[2020-08-29 13:24:12,570] INFO [Consumer clientId=consumer-console-consumer-37294-1, groupId=console-consumer-37294] Cluster ID: rdsgvpoESzer6IAxQDlLUA (org.apache.kafka.clients.Metadata)
{"id":8191,"parent_id":{"long":8184},"upper_id":0,"username":"app0623c","domain":43,"role":1,"modified_at":1598733553000,"blacklist_modified_at":{"long":1598733768000},"tied_at":{"long":1598733771000},"name":"test","enable":1,"is_default":0,"bankrupt":0,"locked":0,"tied":0,"checked":0,"failed":0,"last_login":{"long":1598733526000},"last_online":{"long":1598733532000},"last_ip":{"bytes":"ÿÿ\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"},"last_country":{"string":"MY"},"last_city_id":0}
Here is my source connector config and else:
CREATE SOURCE CONNECTOR `pa_source_unwrap` WITH(
"connector.class" = 'io.debezium.connector.mysql.MySqlConnector',
"tasks.max" = '1',
"database.hostname" = 'docker.for.mac.host.internal',
"database.port" = '3306',
"database.user" = 'root',
"database.password" = 'xxxxxxx',
"database.service.id" = '10001',
"database.server.name" = 'pa',
"database.whitelist" = 'new_pa',
"table.whitelist" = 'new_pa.user, new_pa.user_created, new_pa.cash',
"database.history.kafka.bootstrap.servers" = 'kafka:9092',
"database.history.kafka.topic" = 'schema-changes.pa',
"transforms" = 'unwrap',
"transforms.unwrap.type" = 'io.debezium.transforms.ExtractNewRecordState',
"transforms.unwrap.delete.handling.mode" = 'drop',
"transforms.unwrap.drop.tombstones" = 'true',
"key.converter" = 'io.confluent.connect.avro.AvroConverter',
"value.converter" = 'io.confluent.connect.avro.AvroConverter',
"key.converter.schema.registry.url" = 'http://schema-registry:8081',
"value.converter.schema.registry.url" = 'http://schema-registry:8081',
"key.converter.schemas.enable" = 'true',
"value.converter.schemas.enable" = 'true'
);
CREATE TABLE user (`id` BIGINT PRIMARY KEY) WITH (
KAFKA_TOPIC = 'pa.new_pa.user',
VALUE_FORMAT = 'AVRO'
);
Schema of topic (auto generated):
Key:
{
"connect.name": "pa.new_pa.user.Key",
"fields": [
{
"name": "id",
"type": "long"
}
],
"name": "Key",
"namespace": "pa.new_pa.user",
"type": "record"
}
Value:
{
"connect.name": "pa.new_pa.user.Value",
"fields": [
{
"name": "id",
"type": "long"
},
{
"default": null,
"name": "parent_id",
"type": [
"null",
"long"
]
},
{
"default": 0,
"name": "upper_id",
"type": {
"connect.default": 0,
"type": "long"
}
},
{
"name": "username",
"type": "string"
},
{
"name": "domain",
"type": "int"
},
{
"name": "role",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "modified_at",
"type": {
"connect.name": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
}
},
{
"default": null,
"name": "blacklist_modified_at",
"type": [
"null",
{
"connect.name": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
}
]
},
{
"default": null,
"name": "tied_at",
"type": [
"null",
{
"connect.name": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
}
]
},
{
"name": "name",
"type": "string"
},
{
"name": "enable",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "is_default",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "bankrupt",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "locked",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "tied",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "checked",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"name": "failed",
"type": {
"connect.type": "int16",
"type": "int"
}
},
{
"default": null,
"name": "last_login",
"type": [
"null",
{
"connect.name": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
}
]
},
{
"default": null,
"name": "last_online",
"type": [
"null",
{
"connect.name": "io.debezium.time.Timestamp",
"connect.version": 1,
"type": "long"
}
]
},
{
"default": null,
"name": "last_ip",
"type": [
"null",
"bytes"
]
},
{
"default": null,
"name": "last_country",
"type": [
"null",
"string"
]
},
{
"name": "last_city_id",
"type": "long"
}
],
"name": "Value",
"namespace": "pa.new_pa.user",
"type": "record"
}
The issue here is that you're keys are in Avro, and ksqlDB currently only supports KAFKA formatted keys (as of version 0.12).
Avro keys are actively being worked on: #4461 adds support for Avro primitives and #4997 extends this to support single key columns within a Avro record, (as you have here).
You're setting the Avro key format with the config:
"key.converter" = 'io.confluent.connect.avro.AvroConverter',
You're SQL:
CREATE TABLE user (`id` BIGINT PRIMARY KEY) WITH (
KAFKA_TOPIC = 'pa.new_pa.user',
VALUE_FORMAT = 'AVRO'
);
Is setting the VALUE_FORMAT to AVRO, but the key format is currently KAFKA. Hence you may be able to use:
"key.converter" = 'org.apache.kafka.connect.converters.IntegerConverter',
...to convert the key into the right format. More info on the correct converters to use for the KAFKA format on the ksqlDB microsite.
The id field type is BIGINT.
I try to modify the config:
key.converter = org.apache.kafka.connect.storage.LongConverter (ref: ksqlDB microsite), and get error: LongConverter could not be found.
Set key.converter = org.apache.kafka.connect.converters.LongConverter (ref), and get error:
kafka-connect | [2020-09-03 06:48:25,194] ERROR WorkerSourceTask{id=pa_source_avro2-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
kafka-connect | org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
kafka-connect | at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
kafka-connect | at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
kafka-connect | at org.apache.kafka.connect.runtime.WorkerSourceTask.convertTransformedRecord(WorkerSourceTask.java:294)
kafka-connect | at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:323)
kafka-connect | at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:247)
kafka-connect | at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
kafka-connect | at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
kafka-connect | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
kafka-connect | at java.util.concurrent.FutureTask.run(FutureTask.java:266)
kafka-connect | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
kafka-connect | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
kafka-connect | at java.lang.Thread.run(Thread.java:748)
I use the json+avro converter, and create table:
CREATE SOURCE CONNECTOR `pa_source_avro3` WITH (
"connector.class" = 'io.debezium.connector.mysql.MySqlConnector',
"tasks.max" = '1',
"database.hostname" = 'docker.for.mac.host.internal',
"database.port" = '3306',
"database.user" = 'root',
"database.password" = '6881703',
"database.service.id" = '10001',
"database.server.name" = 'pa3',
"database.whitelist" = 'new_pa',
"table.whitelist" = 'new_pa.user, new_pa.user_created, new_pa.cash',
"database.history.kafka.bootstrap.servers" = 'kafka:9092',
"database.history.kafka.topic" = 'schema-changes.pa3',
"transforms" = 'unwrap',
"transforms.unwrap.type" = 'io.debezium.transforms.ExtractNewRecordState',
"key.converter" = 'org.apache.kafka.connect.json.JsonConverter',
"value.converter" = 'io.confluent.connect.avro.AvroConverter',
"key.converter.schema.registry.url" = 'http://schema-registry:8081',
"value.converter.schema.registry.url" = 'http://schema-registry:8081',
"key.converter.schemas.enable" = 'false',
"value.converter.schemas.enable" = 'true',
"include.schema.changes" = 'true'
);
CREATE TABLE user_with_string_key (`row_id` STRING PRIMARY KEY) WITH (
KAFKA_TOPIC = 'pa3.new_pa.user',
VALUE_FORMAT = 'AVRO'
);
CREATE TABLE user_with_bigint_key (`row_id` BIGINT PRIMARY KEY) WITH (
KAFKA_TOPIC = 'pa3.new_pa.user',
VALUE_FORMAT = 'AVRO'
);
I can get all data from mysql, but get different content of row_id.
The reason maybe be the KAFKA format.
String:
{
"row_id": "{\"id\":1}",
"ID": 1,
"PARENT_ID": null,
"UPPER_ID": 0,
"USERNAME": "bmw999",
"DOMAIN": 1,
"ROLE": 3,
"MODIFIED_AT": 1532017653000,
"BLACKLIST_MODIFIED_AT": null,
"TIED_AT": null,
"NAME": "bmw999",
"ENABLE": 1,
"IS_DEFAULT": 0,
"BANKRUPT": 1,
"LOCKED": 1,
"TIED": 0,
"CHECKED": 0,
"FAILED": 0,
"LAST_LOGIN": 1539806130000,
"LAST_ONLINE": 1508259330000,
"LAST_COUNTRY": null,
"LAST_CITY_ID": 0
}
BigInt:
{
"row_id": 8872770094665183000,
"ID": 1,
"PARENT_ID": null,
"UPPER_ID": 0,
"USERNAME": "bmw999",
"DOMAIN": 1,
"ROLE": 3,
"MODIFIED_AT": 1532017653000,
"BLACKLIST_MODIFIED_AT": null,
"TIED_AT": null,
"NAME": "bmw999",
"ENABLE": 1,
"IS_DEFAULT": 0,
"BANKRUPT": 1,
"LOCKED": 1,
"TIED": 0,
"CHECKED": 0,
"FAILED": 0,
"LAST_LOGIN": 1539806130000,
"LAST_ONLINE": 1508259330000,
"LAST_COUNTRY": null,
"LAST_CITY_ID": 0
}

How to validate my data with jsonSchema scala

I have a dataframe which looks like that
+--------------------+----------------+------+------+
| id | migration|number|string|
+--------------------+----------------+------+------+
|[5e5db036e0403b1a. |mig | 1| str |
+--------------------+----------------+------+------+
and I have a jsonSchema:
{
"title": "Section",
"type": "object",
"additionalProperties": false,
"required": ["migration", "id"],
"properties": {
"migration": {
"type": "string",
"additionalProperties": false
},
"string": {
"type": "string"
},
"number": {
"type": "number",
"min": 0
}
}
}
I would like to validate the schema of my dataframe with my jsonSchema.
Thank you
Please find inline code comments for the explanation
val newSchema : StructType = DataType.fromJson("""{
| "type": "struct",
| "fields": [
| {
| "name": "id",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "migration",
| "type": "string",
| "nullable": true,
| "metadata": {}
| },
| {
| "name": "number",
| "type": "integer",
| "nullable": false,
| "metadata": {}
| },
| {
| "name": "string",
| "type": "string",
| "nullable": true,
| "metadata": {}
| }
| ]
|}""".stripMargin).asInstanceOf[StructType] // Load you schema from JSON string
// println(newSchema)
val spark = Constant.getSparkSess // Create SparkSession object
//Correct data
val correctData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.","mig",1,"str")))
val dfNew = spark.createDataFrame(correctData, newSchema) // validating the data
dfNew.show()
//InCorrect data
val inCorrectData: RDD[Row] = spark.sparkContext.parallelize(Seq(Row("5e5db036e0403b1a.",1,1,"str")))
val dfInvalid = spark.createDataFrame(inCorrectData, newSchema) // validating the data which will throw RuntimeException: java.lang.Integer is not a valid external type for schema of string
dfInvalid.show()
val res = spark.sql("") // Load the SQL dataframe
val diffColumn : Seq[StructField] = res.schema.diff(newSchema) // compare SQL dataframe with JSON schema
diffColumn.foreach(_.name) // Print the Diff columns