We have data replication solution using Kafka Connect. Data is being read using Debezium SqlServerConnector into multiple topics, and then written to PostgreSQL using JdbcSinkConnector. Each topic/table has dedicated sink.
Now we are having error on one of the sinks: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
One of replicated records has some non-printable characters in text column. Message is placed in Kafka topic successfully, but then it fails when sink connector is writing into target table.
Target tables are created automatically, we are using “auto.create”: “true”
Source field (SQL Server): NVARCHAR(400)
Target field (PostgreSQL): TEXT
Now we need to fix it. I see two options:
(preferred) find a way to write data as is, including non-printable characters
remove non-printable characters, either in source connector, or in sink.
I was looking for some Kafka Connect transformation that we could use, but no luck so far.
Any suggestions?
Related
I have created a very simple ksql stream on a kafka topic.
CREATE TABLE users_original (EVENT STRUCT<HEADER STRING , BODY STRUCT<gender VARCHAR, region VARCHAR>>) WITH
(kafka_topic='users', value_format='JSON');
Stream is created and it is working fine. The data that is coming is as below:
{"event":{"header": "v1","body":{"gender":"M", "region":"London"}}}
From this stream I have created another stream in which I am processing some data.
Now from the source there is some special character(Â) that is coming in region value due to which ksql stream stops and throws error:
{"event":{"header": "v1","body":{"gender":"M", "region":"Âmerica"}}}
The error is:
caused by com.fasterxml.jackson.core.jsonparseexception invalid utf-8 start byte
And now source stream is not running, I am unable to process the data.
Can someone suggest how can this be solved.
One approach that I have tried is instead of reading the data as json format I am reading the data in string format and after that created one more stream in which I am removing special character using REGEX and then again reading the event in JSON format.
This is working fine but in this too many streams are getting created which will impact the performance.
Any other option/approach to handle this issue would be appreciated.
Hello I am currently setting up the kafka sink connector with database associated with it. However, my avro file contains field count as integer and timestamp as timestamp, but in postgresDB, column name count and timestamp are reserved words, and i cannot change the avro file for count and timestamp because it will break my application.
I don't get any data sink to the database, instead got a bunch of warnings from the worker file for example
WARN The configuration 'config.providers' was supplied but isn't a known config (org.apache.kafka.clients.consumer.ConsumerConfig:355)
Which I know that those can be ignored, but usually the warning do not appear in the logs if I set it up correctly. Am I missing any fields to have the data sink to the db when both the topic name as topic-name and db schema name as topic_name are different. I read up theres transform with regex but that is to topic name with drop but I still need the prefix so for example topic-name, I cannot just have the word name but also topic-name , is there a way to change the topic to topic_name so it matches the dbschema or will drop still mean topic_name in the topic. I also read up that changeTopicCase also works but I don't see documentation in the Kafka sink confluent documentation on that I hope this makes sense or if need clarification I can do so
I'm facing an issue with a JDBC Source Connector in (strictly) incrementing mode. In my dataset which the connector is pulling from, there are consecutive ids starting from id=1.
On the initial deployment of the connector, it ingests only about 1/3-1/2 of the expected records in the dataset. Note that I am sure that in the "initial deployment of the connector" a brand new consumer group is created. The ids of the records getting skipped over are seemingly random. I'm determining the "expected records" by running the source SQL directly against the db containing the dataset. I've encountered the skipping in datasets where ids go up to 10k, in datasets where ids go over 130k, and a few in between. What's weirder is that when I force re-ingestion of the data (i.e. same exact data/dataset) via sending a message to the consumer_offsets topic and re-deploying the connector, I do not encounter the skipping; all expected records make it to the topic.
How can that be?
Other notes:
-There is no ordering or CTE in my SQL
I want to use the Confluent's JDBC source connector to retrieve data from a SQL Server table into Kafka.
I want to use the incrementing mode to start retrieving data from the table only from the moment the connector starts running:
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
mode=incrementing
incrementing.column.name=id_column_name
When I run this connector, it starts retrieving all the rows from the table, not the ones that are going to be inserted after that point in time. I've been checking the connector configuration properties but I can't seem to find a configuration element for this situation.
The table doesn't contain any Timestamp values, so I can't use the properties timestamp.initial and timestamp.column.name properties. It includes a Datetime column however, but I think this is not useful in this case.
How can I do this?
Any help would be greatly appreciated.
You can try to use query-based ingest or manually seed the offsets topic with the appropriate value.
Source: Kafka Connect Deep Dive – JDBC Source Connector by Robin Moffatt
I'm fairly new to NiFi and Kafka and I have been struggling with this problem for a few days. I have a NiFi data flow that ends with JSON records being being published to a Kafka topic using PublishKafkaRecord_2_0 processor configured with a JSONRecordSetWriter service as the writer. Everything seems to work great: messages are published to Kafka and looking at the records in the flow file after being published look like well-formed JSON. Though, when consuming the messages on the command line I see that they are prepended with a single letter. Trying to read the messages with ConsumeKafkaRecord_2_0 configured with a JSONTreeReader and of course see the error here.
As I've tried different things the letter has changed: it started with an "h", then "f" (when configuring a JSONRecordSetWriter farther upstream and before being published to Kafka), and currently a "y".
I can't figure out where it is coming from. I suspect it is caused by the JSONRecordSetWriter but not sure. My configuration for the writer is here and nothing looks unusual to me.
I've tried debugging by creating different flows. I thought the issue might be with my Avro schema and tried replacing that. I'm out of things to try, does anyone have any ideas?
Since you have the "Schema Write Strategy" set to "Confluent Schema Reference" this is telling the writer to write the confluent schema id reference at the beginning of the content of the message, so likely what you are seeing is the bytes of that.
If you are using the confluent schema registry then this is correct behavior and those values need to be there for the consuming side to determine what schema to use.
If you are not using confluent schema registry when consuming these messages, just choose one of the other Schema Write Strategies.