NPE when encrypting fields for SQLServer databases using debezium - apache-kafka

I am using debezium to do a CDC from SQLServer to kafka, and as per the business needs, some of the columns must be encrypted.
For the environment POV, I have 2 kafka-connect instances running on K8S, and I have in total around 50 connectors running that stream data from SQL-Server to Kafka.
Here is the snippet of the connector json file
{
"name": "live.sql.users",
...
"transforms.unwrap.delete.handling.mode": "drop",
"transforms": "unwrap,cipher",
"predicates.isTombstone.type": "org.apache.kafka.connect.transforms.predicates.RecordIsTombstone",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.cipher.predicate": "isTombstone",
"transforms.cipher.negate": "true",
"transforms.cipher.cipher_data_keys": "[ { \"identifier\": \"my-key\", \"material\": { \"primaryKeyId\": 1000000001, \"key\": [ { \"keyData\": { \"typeUrl\": \"type.googleapis.com/google.crypto.tink.AesGcmKey\", \"value\": \"GhDLeulEJRDC8/19NMUXqw2jK\", \"keyMaterialType\": \"SYMMETRIC\" }, \"status\": \"ENABLED\", \"keyId\": 2000000002, \"outputPrefixType\": \"TINK\" } ] } } ]",
"transforms.cipher.type": "com.github.hpgrahsl.kafka.connect.transforms.kryptonite.CipherField$Value",
"transforms.cipher.cipher_mode": "ENCRYPT",
"predicates": "isTombstone",
"transforms.cipher.field_config": "[{\"name\":\"Password\"},{\"name\":\"MobNumber\"}, {\"name\":\"UserName\"}]",
"transforms.cipher.cipher_data_key_identifier": "my-key"
...
}
and when I applied it, after few seconds I got the below error, when I call the /connectors/<connector_name>/status api
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:206)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:132)\n\t
at org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:50)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:346)\n\t
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:261)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:191)\n\t
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:240)\n\t
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\t
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\t
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\t
at java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: org.apache.kafka.connect.errors.DataException: error: ENCRYPT of field path 'UserName' having data 'deleted605' failed unexpectedly\n\t
at com.github.hpgrahsl.kafka.connect.transforms.kryptonite.RecordHandler.processField(RecordHandler.java:90)\n\t
at com.github.hpgrahsl.kafka.connect.transforms.kryptonite.SchemaawareRecordHandler.lambda$matchFields$0(SchemaawareRecordHandler.java:73)\n\t
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)\n\t
at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)\n\t
at com.github.hpgrahsl.kafka.connect.transforms.kryptonite.SchemaawareRecordHandler.matchFields(SchemaawareRecordHandler.java:50)\n\t
at com.github.hpgrahsl.kafka.connect.transforms.kryptonite.CipherField.processWithSchema(CipherField.java:163)\n\t
at com.github.hpgrahsl.kafka.connect.transforms.kryptonite.CipherField.apply(CipherField.java:140)\n\t
at org.apache.kafka.connect.runtime.PredicatedTransformation.apply(PredicatedTransformation.java:56)\n\t
at org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:50)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)\n\t
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)\n\t
... 11 more\nCaused by: java.lang.NullPointerException\n\t
at com.esotericsoftware.kryo.util.DefaultGenerics.nextGenericTypes(DefaultGenerics.java:77)\n\t
at com.esotericsoftware.kryo.serializers.FieldSerializer.pushTypeVariables(FieldSerializer.java:144)\n\t
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:102)\n\t
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:627)\n\t
at com.github.hpgrahsl.kafka.connect.transforms.kryptonite.RecordHandler.processField(RecordHandler.java:75)\n\t
... 21 more\n
Knowing that, the same configs working with other connectors with no problems

After further debugging and looking into Kryo library, it ends up that Kryo class is not thread-safe, as per Kryo the documentation:
Kryo is not thread safe. Each thread should have its own Kryo, Input, and Output instances.
I opened a thread on kryptonite repo, and it has been confirmed from the main committer that it doesn't support multi threads and the only way to do this is to have separate connector instance or pooling (the full thread), which is not feasible as I have more than 50 connectors running in the same time.
Regarding the pooling option of Kryo Instance, here is guide on how to do it, yet I didn't try it out.
Hope this helps anyone with the same problem or will face it in future.

Related

How to reprocess messages in Kafka Connect?

I am working with a Kafka Sink Connector which reads from a Kafka topic and puts the data into a target database (in my case it is a Neo4j instance) .The messages need to be processed strictly sequentially since they are not idempotent. My question is if for some reason an exception occurs, for e.g. 1. Datbase goes down , 2. Connectivity to DB lost , 3. Schema parsing failure , how can we reprocess the message ?
I understand we can run with error.tolerance=none configuration and redirect failure message to a dead letter queue. But my question is there any way we can process a selected message again ? Also , is there any audit mechanism to track how many messages are processed, to seek from a given offset (without manual offset reset).
Below is my connector configuration . Also suggest if there are better data integration technologies apart from the kafka connectors to sink the data into a target database.
{
"topics": "mytopic",
"connector.class": "streams.kafka.connect.sink.Neo4jSinkConnector",
"tasks.max":"1",
"key.converter.schemas.enable":"true",
"values.converter.schemas.enable":"true",
"errors.retry.timeout": "-1",
"errors.retry.delay.max.ms": "1000",
"errors.tolerance": "none",
"errors.deadletterqueue.topic.name": "deadletter-topic",
"errors.deadletterqueue.topic.replication.factor":1,
"errors.deadletterqueue.context.headers.enable":true,
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.enhanced.avro.schema.support":true,
"value.converter.enhanced.avro.schema.support":true,
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"https://schema-url/",
"value.converter.basic.auth.credentials.source":"USER_INFO",
"value.converter.basic.auth.user.info":"user:pass",
"errors.log.enable": true,
"schema.ignore":"false",
"errors.log.include.messages": true,
"neo4j.server.uri": "neo4j://my-ip:7687/neo4j",
"neo4j.authentication.basic.username": "neo4j",
"neo4j.authentication.basic.password": "neo4j",
"neo4j.encryption.enabled": false,
"neo4j.topic.cypher.mytopic": "MERGE (p:Loc_Con{name: event.geography.name})"
}
For non fatal exceptions, the connector will write to a dead letter topic.
You'd need another connector or some other consumer to read that other topic to process that data. Since it's a topic, there's no straightforward way to "process a selected message"
JMX metrics or Neo4j database metrics should both be able to tell you approximately how many messages have been processed over time

FlinkKafkaConsumer/Producer & Confluent Avro schema registry: Validation failed & Compatibility mode writes invalid schema

Hello together im struggling with (de-)serializing a simple avro schema together with schema registry.
The setup:
2 Flink jobs written in java (one consumer, one producer)
1 confluent schema registry for schema validation
1 kafka cluster for messaging
The target:
The producer should send a message serialized with ConfluentRegistryAvroSerializationSchema which includes updating and validating the schema.
The consumer should then deserialize the message into an object with the received schema. Using ConfluentRegistryAvroDeserializationSchema.
So far so good:
If i configre my subject on the schema registry to be FORWARD-compatible the producer writes the correct avro schema to the registry, but it ends with the error (even if i completely and permanetly delete the subject first):
Failed to send data to Kafka: Schema being registered is incompatible with an earlier schema for subject "my.awesome.MyExampleEntity-value"
The schema was successfully written:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 1,
"id": 100028,
"schema": "{\"type\":\"record\",\"name\":\"input\",\"namespace\":\"my.awesome.MyExampleEntity\",\"fields\":[{\"name\":\"serialNumber\",\"type\":\"string\"},{\"name\":\"editingDate\",\"type\":\"int\",\"logicalType\":\"date\"}]}"
}
following this i could try to set the compability to NONE
If i do so i can produce my data on the kafka but:
The schema registry has a new version of my schema looking like this:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 2,
"id": 100031,
"schema": "\"bytes\""
}
Now i can produce data but the consumer is not able to deserialize this schema emiting the following error:
Caused by: org.apache.avro.AvroTypeException: Found bytes, expecting my.awesome.MyExampleEntity
...
Im currently not sure where the problem exactly is.
Even if i completely and permanetly delete the subject (including schemas) my producer should work fine from scratch registering a whole "new" subject with schema.
On the other hand if i set the compatibility to "NONE" the schema exchange should work anyway by should registering a schema which can be read by the consumer.
Can anybody help me out here?
According to a latest confluent doc NONE: schema compatibility checks are disabled docs:
The whole problem with serialisation was about the usage of the following flag in the kafka config:
"schema.registry.url"
"key.serializer"
"key.deserializer"
"value.serializer"
"value.deserializer"
Setting this flags in flink, even if they are logically correct leads to a undebuggable schema validation and serialisation chaos.
Omitted all of these flags and it works fine.
The registry url needs to be set in ConfluentRegistryAvro(De)serializationSchema only.

Debezium can not capture change from MongoDB

I am using debezium mongo connnect 1.4.2 on Kafka connect 2.2. And it seems 'collection.include.list' configuration is preventing Debezium getting the collection data change. If I delete the collection.include.list config, the capture will start work. But will apply on all the collections which I don't want.
Can anyone send me some example about how collection.include.list can be configured? I tried to input '<db_name>[.]<collection_name>' , however I keep on got this warning and no data was captured.
[2021-04-03 07:58:21,971] WARN After applying the include/exclude list filters, no changes will be captured. Please check your configuration! (io.debezium.connector.mongodb.MongoDbSchema:96)
My config is like below:
{
"name": "pipeline-mongo-connector",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "xxxx_host:3717",
"mongodb.name": "pipeline_mongo",
"mongodb.user": "xxxxxxx",
"mongodb.password":"xxxxxx",
"collection.include.list": "prod-datapipeline[.]*"
}
}
Thanks!

Exception while Deserialize avro data using ConfluentSchemaRegistry?

I am new to flink and Kafka. I am trying to deserialize avro data using Confluent Schema registry. I have already installed flink and Kafka on ec2 machine. Also, the "test" topic has been created before running code.
Code Path: https://gist.github.com/mandar2174/5dc13350b296abf127b92d0697c320f2
The code does the following operation as part of implementation:
1) Create a flink DataStream object using a list of user element. (User class is avro generated class)
2) Write the Datastream source to Kafka using AvroSerializationSchema.
3) Read the data from Kafka using ConfluentRegistryAvroDeserializationSchema by reading the schema from Confluent Schema registry.
Command to run flink executable jar:
./bin/flink run -c com.streaming.example.ConfluentSchemaRegistryExample /opt/flink-1.7.2/kafka-flink-stream-processing-assembly-0.1.jar
Exception while running code:
java.io.IOException: Unknown data format. Magic number does not match
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:55)
at org.apache.flink.formats.avro.RegistryAvroDeserializationSchema.deserialize(RegistryAvroDeserializationSchema.java:66)
at org.apache.flink.streaming.util.serialization.KeyedDeserializationSchemaWrapper.deserialize(KeyedDeserializationSchemaWrapper.java:44)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:140)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:665)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)
Avro schema which I am using for User class is as below:
{
"type": "record",
"name": "User",
"namespace": "com.streaming.example",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favorite_color",
"type": [
"string",
"null"
]
}
]
}
Can someone point out what steps I am missing as part of deserializing avro data using confluent Kafka schema registry?
How you wrote the Avro data needs to use the Registry as well in order for the deserializer that depends on it to work.
But this is an open PR in Flink, still for adding a ConfluentRegistryAvroSerializationSchema class
The workaround, I believe would be to use AvroDeserializationSchema, which does not depend on the Registry.
If you did want to use the Registry in the producer code, then you'd have to do so outside of Flink until that PR is merged.

Using a custom converter with Kafka Connect?

I'm trying to use a custom converter with Kafka Connect and I cannot seem to get it right. I'm hoping someone has experience with this and could help me figure it out !
Initial situation
my custom converter's class path is custom.CustomStringConverter.
to avoid any mistakes, my custom converter is currently just a copy/paste of the pre-existing StringConverter (of course, this will change when I'll get it to work).
https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/storage/StringConverter.java
I have a kafka connect cluster of 3 nodes, The nodes are running confluent's official docker images (confluentinc/cp-kafka-connect:3.3.0).
Each node is configured to load a jar with my converter in it (using a docker volume).
What happens ?
When the connectors start, they correctly load the jars and find the custom converter. Indeed, this is what I see in the logs :
[2017-10-10 13:06:46,274] INFO Registered loader: PluginClassLoader{pluginLocation=file:/opt/custom-connectors/custom-converter-1.0-SNAPSHOT.jar} (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:199)
[2017-10-10 13:06:46,274] INFO Added plugin 'custom.CustomStringConverter' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
[...]
[2017-10-10 13:07:43,454] INFO Added aliases 'CustomStringConverter' and 'CustomString' to plugin 'custom.CustomStringConverter' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:293)
I then POST a JSON config to one of the connector nodes to create my connector :
{
"name": "hdfsSinkCustom",
"config": {
"topics": "yellow",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "custom.CustomStringConverter",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"hdfs.url": "hdfs://hdfs-namenode:8020/hdfs-sink",
"topics.dir": "yellow_storage",
"flush.size": "1",
"rotate.interval.ms": "1000"
}
}
And receive the following reply :
{
"error_code": 400,
"message": "Connector configuration is invalid and contains the following 1 error(s):\nInvalid value custom.CustomStringConverter for configuration value.converter: Class custom.CustomStringConverter could not be found.\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"
}
What am I missing ?
If I try running Kafka Connect stadnalone, the error message is the same.
Has anybody faced this already ? What am I missing ?
Ok, I found out the solution thanks to Philip Schmitt on the Kafka Users mailing list.
He mentioned this issue: https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-6007 , which is indeed the problem I am facing.
To quote him:
To test this, I simply copied my SMT jar to the folder of the connector I was using and adjusted the plugin.path property.
Indeed, I got rid of this error by putting the converter in the connector's folder.
I also tried something else: create a custom connector and use that custom connector with the custom converter, both loaded as plugins. It also works.
Summary: converters are loaded by the connector. If your connector is a plugin, your converter should be as well. If you connector is not a plugin (bundled with your kafka connect distrib), your converter should not be either.