How to stream data from snowflake to kafka - apache-kafka

I am trying to build a data pipeline from snowflake to Kafka.
Preferably I want to use AWS MSK.
I could find multiple docs for streaming data in to snowflake from Kafka but I am looking for other way around.

Related

Dump Kafka to GCS

Can someone help me with the best possible way to dump data from a kafka topic to Google Cloud storage?
I would like to build a near real time pipeline which is capable to creating multiple files based on time or size cuts.
Data in kafka topic is in JSON format.
Confluent offers a GCS Kafka Connect sink, but you could also try using Google DataFlow / Apache Beam to do the same.

Kafka use case to replace control M jobs

I have few control M jobs running in production..
first job - to load csv file records and insert to database staging table
Second job - to perform data encrichment for each records in staging table
Third job - to read data from staging table and insert to another table..
Currently we use apache camel to do this.
We have bough confluent kafka license so we want to use kafka..
Design proposal
Create csv kafka source connector to read data from csv and insert to kafka input topic
Create spring cloud kafka stream binder application to read data from input topic, enrich the data and push to output topic
To have kafka sink connector to push data from output topic to database
Problem now in steps two we need to have database connection to enrich the data and when watch video in youtube it said spring cloud kafka stream binder should not have database connection.. so how i should design my flow? What spring technology i should use?
There's nothing preventing you from having a database connection, but if you read the database table into a Kafka stream/table then you can join and enrich data using Kafka Streams joins rather than remote database calls

Kafka to BigQuery, best way to consume messages

I need to receive messages to my BigQuery tables and I want to know what is the best way to consume those messages.
My Kafka servers who are at AWS they produce AVRO messages and from what I saw Dataflow needs receive JSON format messages. So I googled and found an article explaining how to receive messages to PubSub, but on PubSub what I only see in this type of architecture, they create a Kafka VM on GCP to produce the messages.
What I need to know is:
It's possible to receive AVRO messages on PubSub from external Kafka Servers and then deserialize the message using my Schema, sending it to Dataflow and finally send it to BigQuery tables?
Or do I need to create a Kafka VM and use it to consume messages from external servers?
This might seem a bit confusing but it is what I am feeling right now. The main goal here is to get messages from Kafka (AVRO format) at AWS and put them on BigQuery tables. If you have any suggestions they are very welcomed
Thanks a lot in advance
The Kafka Connect BigQuery Connector may be exactly what you need. It is a Kafka sink connector that allows you to export messages from Kafka directly to BigQuery. The README page provides detailed configuration instructions, including how to let the connector recognize your Kafka queue and how to enter the information for the destination BigQuery table. This connector should be able to retrieve the AVRO schema automatically from your Kafka project.

How to use Kafka connect to output to dynamic directory in GCS?

I am fetching JSON data from a Kafka topic. I need to dump this data onto GCS (Google Cloud Storage) into a directory, wherein the directory name will be fetched from the value of "ID" in the JSON data.
I googled and did not find any similar use case wherein Kafka Connect can be used to interpret the JSON data and create directories dynamically based on the value from the JSON data.
Can this be achieved using Kafka Connect?
You can use Kafka Connect GCS sink connector which is provided by Confluent.
The Google Cloud Storage (GCS) connector, currently available as a
sink, allows you to export data from Kafka topics to GCS objects in
various formats. In addition, for certain data layouts, GCS connector
exports data by guaranteeing exactly-once delivery semantics to
consumers of the GCS objects it produces.
Here's an example configuration for the connector:
name=gcs-sink
connector.class=io.confluent.connect.gcs.GcsSinkConnector
tasks.max=1
topics=gcs_topic
gcs.bucket.name=#bucket-name
gcs.part.size=5242880
flush.size=3
gcs.credentials.path=#/path/to/credentials/keys.json
storage.class=io.confluent.connect.gcs.storage.GcsStorage
format.class=io.confluent.connect.gcs.format.avro.AvroFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=BACKWARD
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
# Uncomment and insert license for production use
# confluent.license=
You can find more details for installation and configuration in the link I've provided above.
This isn't really possible out-of-the-box using most connectors. Instead, you can implement your own Kafka Connect sink task that processes Kafka records and then writes them to the correct GCS directories based on your JSON.
Here's the method you'd override in the connector.
Here's a link to the source code for the AWS S3 sink connector.

Confluent Kafka Connect : Run multiple sink connectors in synchronous way

We are using Kafka connect S3 sink connector that connect to Kafka and load data to S3 buckets.Now I want to load data from S3 buckets to AWS Redshift using Copy command, for that I'm creating my own custom connector.Use case is I want to load data that created over S3 to Redshift in synchronous way, and then next time S3 connector should replace the existing file and again our custom connector load data to S3.
How can I do this using Confluent Kafka Connect,or my other better approach to do same task?
Thanks in advance !
If you want data to Redshift, you should probably just use the JDBC Sink Connector and download the Redshift JDBC Driver into the kafka-connect-jdbc directory.
Otherwise, rather than writing a Connector, you could use Lambda to trigger some type of S3 event notification to do some type of Redshift upload
Alternatively, if you are simply looking to query S3 data, you could use Athena instead without dealing with any databases
But basically, Sink Connectors don't communicate between one another. They are independent tasks that are designed to initially consume from a topic and write to a destination, not necessarily trigger external, downstream systems.
You want to achieve synchronous behaviour from Kafka to redshift then S3 sink connector is not right option.
If you are using S3 sink connector then first put the data into s3 and then externally run copy command to push to S3. ( Copy command is extra overhead )
No customize code or validation can happen before pushing to redshift.
Redshift sink connector has come up with native jdbc library which is equivalent fast to S3 copy command.