Kafka Sink to Data Lake Storage without Confluent - apache-kafka

I am trying to find options for Open Source Kafka writing directly to Azure Data Lake storage Gen2 . It seems I have few options and mostly circling around Confluent like below :
Use Confluent Cloud with Apache Kafka - Need to Subscribe with Confluent and pay charges (Confluent Cloud with ADLS
Use Azure VM with Confluent Hub and Install Confluent Platform
At present I am not wiling to pay Confluent licensing and not want to test with confluent package (more and more wrappers and hoops around)
Any option to use Open source Kafka directly to write data to ADLS Gen2 ? If yes how can we achieve this any useful information to share ?

Firstly, Kafka Connect is Apache2 licensed product and an open-platform consisting of plugins; Confluent Platform/Cloud is not a requirement to use it. You can download the Azure connector as a ZIP file and install it like any other
However, it is at Confluent's (or any developer) discretion to provide a paid license agreement for their software and any support, and there might otherwise be a limited trial period where you can use the plugin for some time.
That being said, you do not "need" Confluent Platform, and there are no "hoops" to using it if you did because it only adds extras to Apache Kafka+Zookeeper, it is not its own thing (you can use your existing Kafka installation with the other Confluent products)
Regarding other open-source things. StackOverflow is not the place for software recommendations or seeking tools/libraries. You can use Spark/Flink/Nifi, though, I'm sure to reimplement a similar pipeline as Kafka Connect, or you can write your own Kafka Connector based on the open-source kafka-connect-storage-cloud project that is used as a base for S3, GCS, and Azure, AFAIK.

There is Apache Camel Connectors, which has an Azure Datalake connector for sending and receiving data. (sink and source) Check this out: https://camel.apache.org/camel-kafka-connector/latest/connectors/camel-azure-storage-datalake-kafka-sink-connector.html
This is a free solution that doesnt require Confluent licenses or technologies to be used.

Related

Is the Confluent Platform 7.1 based on Kafka free? open source? for production use

I have usecase to start using Kafka and was looking for opensource free (production) kafka.
When check Confluent 7.1 platform looks suitable as it has zookeeper / kafka / schema registry / kafka UI bundled together.
Before deciding to go ahead with it just want to check if the Confluent Platform 7.1 is free and open source? Am I required to purchase licensing or paid support?
The Confluent Community License covers several components of Confluent Platform, including KSQLDB, the Schema Registry, REST Proxy, and various Kafka Connect plugins. Confluent Control Center (what you call Kafka UI) is only available on a trial basis, outside of which requires Enterprise license payment.
Majority of Confluent Platform individual components are "source-available", and free with limitations. Many of the plugin features like RBAC, Tiered Storage, Cluster Linking, and server-side Kafka record Schema Validation require payment. This is an Enterprise license and also includes Control Center, on-call Support, and several other connectors.
Apache Kafka, it's clients, and Zookeeper are Apache 2.0 licensed.
If you want a completely Apache 2.0 stack, you can replace Confluent Schema Registry with Apicurio and replace Control Center with various Kafka GUI projects that exist on Github, such as AKHQ or CMAK

Confluent platform vs Debezium

I'm trying to use Debezium platform to make a Kafka-cdc.But I was confused.
What is really difference between Confluent platform and Debezium?
Confluent (https://www.confluent.io/) is a platform which mainly integrate Apache-Kafka (https://kafka.apache.org/) and its ecosystems. So let say the basic Confluent platform has Zookeeper, apache kafka, KSql and thier Control Center.
Debezium is another platform to focus Database Streaming.
So you think Confluent is the general Streaming, and Debezium actually has a connector https://debezium.io/documentation/reference/stable/connectors/index.html that can be integreted to Confluent like in https://www.confluent.io/hub/debezium/debezium-connector-postgresql
At the time of writing, Confluent Platform does not have any CDC connectors, and you don't really need it. Apache Kafka Connect that is bundled as part of the Confluent Platform is all that's needed, and can be downloaded directly from Apache Kafka site instead.
Debezium is built on Kafka Connect API, and provided as a plug in.

Kafka - Confluent Hub - Exploit only part of it

I already saw a similar question in SO, but not clearly answer my doubts.
We have different Kafka clusters and lot of exploitation operational habits around it. We have our way to start/stop the cluster, lots of exploit scripts that help maintain the cluster etc..
Now we would like to use Kafka connect connectors for new needs, but from what I saw, Kafka connect is extremely coupled to confluent-hub.
It's like I can't even use the connectors without having to install a full operational confluent-hub.
This makes it very difficult for us to use Kafka connect connectors, I understand that confluent-hub might be a framework that help running those connectors, but it's like we can't even use a dissociated Kafka cluster ( a one not exploited by confluent-hub..).
But maybe I miss something..
Do you know if there is any way to use properly Kafka connectors on a already existing Kafka cluster ( completely independent from confluent-hub) ?
EDITED :
It's more a question regarding the high coupled behaviour between confluent-hub and Kafka-connect. All the features that comes with Kafka connect ( distributed workers to handle different fail over scenarios, etc..) are not usable without confluent-hub, thus a "need" to have Kafka cluster running exclusively via confluent-hub, which is not an easy task when you already have an existing big Kafka cluster with lots of OPS habits on it.
Kafka Connect is part of Apache Kafka. It's a pluggable framework for streaming integration between systems in and out of Kafka.
To use Kafka Connect you need connectors for the specific technology with which you want to integrate. For example, S3 sink, Elasticsearch sink, JDBC source or sink, and so on.
The connector API is part of Apache Kafka, and available for anyone who wants to develop a connector.
Connectors are written by various people and organisations, and available in various different ways. How you obtain a connector depends on which connector you want, how its licensed, and how the author has made it available for distribution. It could be you go to github, clone the repo and build the JAR. It could be you can download the JAR directly.
All that Confluent Hub does is make lots of these connectors available for you in one place, easily searchable, and with an optional CLI tool that will install them for you.
Do you have to use Confluent Hub? No, not at all. Might it make your life easier in locating connectors that you want to use, and make it easier to install them? Hopefully :)
Disclaimer: I work for Confluent.

What are the main differences between HDF schema registry and the Confluent one?

I was wondering about the differences of the kafka embebed in the HDF suite and the Confluent one, specifically the schema registry tool.
https://registry-project.readthedocs.io/en/latest/competition.html
The Hortonworks schema registry depends on a Mysql or Postgres database (supposedly this is pluggable, so you could write your own storage layer) to store its schemas while the Confluent one stores schemas directly in Kafka. Therefore there's more infrastructure to manage with Hortonworks implementation.
The Hortonworks one supposedly has some plugin mechanism so that it'll support the Confluent serialization format, but I've not seen it used in practice. It also has pluggable schema storage, but I've not seen anything except Avro used in it.
The Hortonworks one has its own web UI and rich editor, compared to the Confluent one, where you're limited to third-party tools or purchasing a license for Confluent Control Center.
Hortonworks aims to provide integrations with Spark, Nifi, SMM, Storm, Atlas, possibly Ranger, and other components of their HDF stack. Confluent Schema Registry support in those tools is all community driven.

kafka ingestion with cloudera and IBM MQ

Is it possible to capture IBM MQ data with Kafka-Cloudera?
The confluent company offers connectors to capture IBM MQ data, but I'm not sure if I can do the same with Kafka-Cloudera.
Yes.
Kafka Connect is not a framework specific to Confluent or Cloudera. It is built into all Apache Kafka offerings.
If Confluent Platform includes a specific connector as part of the OSS offering, for which you can individually download and use the connector, then that's a separate issue.