Stream both schema and data changes from MySQL to MySQL using Kafka Connect - apache-kafka

How we can stream schema and data changes along with some kind of transformations into another MySQL instance using Kafka connect source connector.
Is there a way to propagate schema changes also if I use Kafka's Python library(confluent_kafka) to consume and transform messages before loading into target DB.

You can use Debezium to stream MySQL binlogs into Kafka. Debezium is built upon Kafka Connect framework.
From there, you can use whatever client you want, including Python, to consume and transform the data.
If you want to write to MySQL, you can use Kafka Connect JDBC sink connector.
Here is an old post on this topic - https://debezium.io/blog/2017/09/25/streaming-to-another-database/

Related

Kafka Connect or Kafka Streams?

I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

Generating timestamp based documenIds in kafka connect

I am sending data from Kafka to Couchbase using kafka sink connector (https://github.com/apache/kafka & https://github.com/couchbase/kafka-connect-couchbase).
I am using CB v5.1.0 and kafka 2.12
I have not enabled any kind of documentId generation in kafka connect (in file quickstart-couchbase-sink.properties). So, the connector is using the whole document as key. I want to generate key as topic-partition-offset-randomString-timestamp.
How can this be achieved? I could find something here - https://docs.confluent.io/current/connect/kafka-connect-elasticsearch/configuration_options.html but I don't see key.ignore option anywhere in kafka and kafka-connect code.

How to fetch Kafka source connector schema based on connector name

I am using Confluent JDBC Kafka connector to publish messages into topic. The source connector will send data to topic along with schema on each poll. I want to retrieve this schema.
Is it possible? How? Can anyone suggest me
My intention is to create a KSQL stream or table based on schema build by Kafka connector on poll.
The best way to do this is to use Avro, in which the schema is stored separately and automatically used by Kafka Connect and KSQL.
You can use Avro by configuring Kafka Connect to use the AvroConverter. In your Kafka Connect worker configuration set:
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://schema-registry:8081
(Update schema-registry to the hostname of where your Schema Registry is running)
From there, in KSQL you just use
CREATE STREAM my_stream WITH (KAFKA_TOPIC='source_topic', VALUE_FORMAT='AVRO');
You don't need to specify the schema itself here, because KSQL fetches it from the Schema Registry.
You can read more about Converters and serialisers here.
Disclaimer: I work for Confluent, and wrote the referenced blog post.

How do I read a Table In Postgresql Using Flink

I want to do some analytics using Flink on the Data in Postgresql. How and where should I give the port address,username and password. I was trying with the table source as mentioned in the link:https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/common.html#register-tables-in-the-catalog.
final static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
final static TableSource csvSource = new CsvTableSource("localhost", port);
I am unable to start with actually. I went through all the documents but detailed report about this not found.
The tables and catalog referred to the link you've shared are part of Flink's SQL support, wherein you can use SQL to express computations (queries) to be performed on data ingested into Flink. This is not about connecting Flink to a database, but rather it's about having Flink behave somewhat like a database.
To the best of my knowledge, there is no Postgres source connector for Flink. There is a JDBC table sink, but it only supports append mode (via INSERTs).
The CSVTableSource is for reading data from CSV files, which can then be processed by Flink.
If you want to operate on your data in batches, one approach you could take would be to export the data from Postgres to CSV, and then use a CSVTableSource to load it into Flink. On the other hand, if you wish to establish a streaming connection, you could connect Postgres to Kafka and then use one of Flink's Kafka connectors.
Reading a Postgres instance directly isn't supported as far as I know. However, you can get realtime streaming of Postgres changes by using a Kafka server and a Debezium instance that replicates from Postgres to Kafka.
Debezium connects using the native Postgres replication mechanism on the DB side and emits all record inserts, updates or deletes as a message on the Kafka side. You can then use the Kafka topic(s) as your input in Flink.