Confluent Kafka Connect : Run multiple sink connectors in synchronous way - apache-kafka

We are using Kafka connect S3 sink connector that connect to Kafka and load data to S3 buckets.Now I want to load data from S3 buckets to AWS Redshift using Copy command, for that I'm creating my own custom connector.Use case is I want to load data that created over S3 to Redshift in synchronous way, and then next time S3 connector should replace the existing file and again our custom connector load data to S3.
How can I do this using Confluent Kafka Connect,or my other better approach to do same task?
Thanks in advance !

If you want data to Redshift, you should probably just use the JDBC Sink Connector and download the Redshift JDBC Driver into the kafka-connect-jdbc directory.
Otherwise, rather than writing a Connector, you could use Lambda to trigger some type of S3 event notification to do some type of Redshift upload
Alternatively, if you are simply looking to query S3 data, you could use Athena instead without dealing with any databases
But basically, Sink Connectors don't communicate between one another. They are independent tasks that are designed to initially consume from a topic and write to a destination, not necessarily trigger external, downstream systems.

You want to achieve synchronous behaviour from Kafka to redshift then S3 sink connector is not right option.
If you are using S3 sink connector then first put the data into s3 and then externally run copy command to push to S3. ( Copy command is extra overhead )
No customize code or validation can happen before pushing to redshift.
Redshift sink connector has come up with native jdbc library which is equivalent fast to S3 copy command.

Related

Dump Kafka to GCS

Can someone help me with the best possible way to dump data from a kafka topic to Google Cloud storage?
I would like to build a near real time pipeline which is capable to creating multiple files based on time or size cuts.
Data in kafka topic is in JSON format.
Confluent offers a GCS Kafka Connect sink, but you could also try using Google DataFlow / Apache Beam to do the same.

Json message is converting when I apply jsonConverter at the sink connector

I have message in kafka as Json like
{"name":"abc"} when I am applying sink connector with Json converter for fileStream sink connector i am getting messages as
{name=abc}
which is not correct Json. I tried simple string connector but no difference.
Can someone please help me with this.
I want message as it is in file
FileStreamSink always writes Connect Struct toString output, and is not meant to be used in production use cases.
It does not support a format.class=JSONFormat such as S3 or HDFS sinks.
As a workaround, you could run Minio as an S3 replacement, or you could use a different sink connector altogether, depending on what you actually want to do with that data. For example, Mongo or JDBC sinks, which respectively offer their own export tooling and can search/analyze your data faster than flat files.

How to stream data from snowflake to kafka

I am trying to build a data pipeline from snowflake to Kafka.
Preferably I want to use AWS MSK.
I could find multiple docs for streaming data in to snowflake from Kafka but I am looking for other way around.

How to use Kafka connect to output to dynamic directory in GCS?

I am fetching JSON data from a Kafka topic. I need to dump this data onto GCS (Google Cloud Storage) into a directory, wherein the directory name will be fetched from the value of "ID" in the JSON data.
I googled and did not find any similar use case wherein Kafka Connect can be used to interpret the JSON data and create directories dynamically based on the value from the JSON data.
Can this be achieved using Kafka Connect?
You can use Kafka Connect GCS sink connector which is provided by Confluent.
The Google Cloud Storage (GCS) connector, currently available as a
sink, allows you to export data from Kafka topics to GCS objects in
various formats. In addition, for certain data layouts, GCS connector
exports data by guaranteeing exactly-once delivery semantics to
consumers of the GCS objects it produces.
Here's an example configuration for the connector:
name=gcs-sink
connector.class=io.confluent.connect.gcs.GcsSinkConnector
tasks.max=1
topics=gcs_topic
gcs.bucket.name=#bucket-name
gcs.part.size=5242880
flush.size=3
gcs.credentials.path=#/path/to/credentials/keys.json
storage.class=io.confluent.connect.gcs.storage.GcsStorage
format.class=io.confluent.connect.gcs.format.avro.AvroFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=BACKWARD
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
# Uncomment and insert license for production use
# confluent.license=
You can find more details for installation and configuration in the link I've provided above.
This isn't really possible out-of-the-box using most connectors. Instead, you can implement your own Kafka Connect sink task that processes Kafka records and then writes them to the correct GCS directories based on your JSON.
Here's the method you'd override in the connector.
Here's a link to the source code for the AWS S3 sink connector.

How do I read a Table In Postgresql Using Flink

I want to do some analytics using Flink on the Data in Postgresql. How and where should I give the port address,username and password. I was trying with the table source as mentioned in the link:https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/common.html#register-tables-in-the-catalog.
final static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
final static TableSource csvSource = new CsvTableSource("localhost", port);
I am unable to start with actually. I went through all the documents but detailed report about this not found.
The tables and catalog referred to the link you've shared are part of Flink's SQL support, wherein you can use SQL to express computations (queries) to be performed on data ingested into Flink. This is not about connecting Flink to a database, but rather it's about having Flink behave somewhat like a database.
To the best of my knowledge, there is no Postgres source connector for Flink. There is a JDBC table sink, but it only supports append mode (via INSERTs).
The CSVTableSource is for reading data from CSV files, which can then be processed by Flink.
If you want to operate on your data in batches, one approach you could take would be to export the data from Postgres to CSV, and then use a CSVTableSource to load it into Flink. On the other hand, if you wish to establish a streaming connection, you could connect Postgres to Kafka and then use one of Flink's Kafka connectors.
Reading a Postgres instance directly isn't supported as far as I know. However, you can get realtime streaming of Postgres changes by using a Kafka server and a Debezium instance that replicates from Postgres to Kafka.
Debezium connects using the native Postgres replication mechanism on the DB side and emits all record inserts, updates or deletes as a message on the Kafka side. You can then use the Kafka topic(s) as your input in Flink.