How AWS MSK and Confluent Schema Registry and Confluent Kafka connect recommended to use together? - apache-kafka

We are planning to use AWS MSK service for Managed Kafka and Schema Registry and Kafka Connect services from Confluent together to run our connectors (Elasticsearch Sink Connector). We have planned to run Schema Registry and Connectors in EC2.
As per the Confluent team, They could not officially support Confluent Schema Registry and Kafka Connect if we use MSK for Kafka.
So, Anyone can share their experience? like
if Anybuddy has used a combination of MSK and Confluent services together in the production environment?
Is there any risk in using this kind of combination?
Is it recommended or not to use this combination?
How is Confluent community support if we will face any issue with Connectors?
Any other suggestions, comments, or alternatives?
We already have a Confluent Corporate Platform license but We want to have managed Kafka service that's why we have chosen AWS MKS as it's very cost-effective than Confluent Cloud as per our analysis?
Kindly please share your thoughts and Thanks in advance.
Thanks

Objectively answering your question this is something doable but it depends where is your major pain.
From the licensing perspective there is nothing that forces you to have a Confluent subscription just to use Kafka Connect or Schema Registry, as they are based on the Apache License 2.0 and Confluent Community License respectively.
From the technical perspective you can run both Kafka Connect and Schema Registry on EC2 and; as long they are running in the same VPC that the MSK cluster they will work flawlessly.
From the cost perspective you will have to evaluate how much it costs to have Kafka Connect and Schema Registry being managed by you and/or your team. Think not only about the install and setup phase but the manage and evolve phase as well. The software might not have any cost but the effort to operate these components can be translated into cost.
How is Confluent community support if we will face any issue with Connectors?
The Kafka community is usually very helpful whether if you ask for help in the Apache Kafka users group or the community that Confluent owns in Slack. Of course, it is all about best effort and you can't rely on them to get support. It may take several days until some good Samaritan decide to help you. Which also translates to cost: how much costs being down and/or waiting for a resolution?
I am no longer a Confluent employee and therefore I won't even try to convince you to buy from them. But you should evaluate this component of cost and check if using Confluent Cloud wouldn't provide you a more cost effective solution since it includes a managed version of Kafka, Kafka Connect, and Schema Registry. In my experience, the managed Kafka on Confluent Cloud is not that costly and the managed Schema Registry is "free", but using a managed connector can be very costly and it can be worse depending of the number of tasks that you configure in the managed connector. This is the only gotcha that you ought to watch out.

AWS MSK now supports fully managed free schema registry service that easily integrates with Kafka and other AWS services like Kinesis, Glue etc. It's much easier to get started with it.

Related

Does kafka support schema registries out of the box, or is it a confluent platform feature?

I came across the following article on how to use the schema registry available in the confluent platform.
https://docs.confluent.io/platform/current/schema-registry/schema-validation.html
According to that article, we can specify confluent.schema.registry.url in server.properties to point Kafka to the schema registry.
My question is, is it possible to point a Kafka cluster which is not a part of confluent platform deployment, to a schema registry using confluent.schema.registry.url?
Server-side schema validation is part of Confluent Server, not Apache Kafka.
I will make sure that that docs page gets updated to be more clear - thanks for raising it.

Kafka Connect with Amazon MSK

How do I use Kafka Connect adapters with Amazon MSK?
As per the AWS documentation, it supports Kafka connect but not documented about how to setup adapters and use it.
Edit Oct 2021: MSK Connect has been launched, see https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/
AFAIK Amazon MSK does not provide managed connectors, so you have to run them yourself. This is done by running the Kafka Connect worker process (a JVM) and then providing it one or more connector configurations to run.
From the point of view of a Kafka Connect worker it just needs a Kafka cluster to connect to; it shouldn't matter whether it's MSK or on-premises, since it's ultimately 'just' a consumer/producer underneath.
You can see more, including a live demo, here: https://rmoff.dev/bbuzz19-kafka-connect
For an example of configuring Kafka Connect to use a cloud-hosted Kafka platform (in this case, Confluent Cloud), see this article.
If you are interested in managed connectors in the Cloud, check out the connectors that are provided in Confluent Cloud.
Disclaimer: I work for Confluent :)
AWS now supports MSK Connect, a new feature of MSK service based on Kafka Connect allowing you to deploy managed Kafka connectors built for Kafka connect
Check the announcement here: https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/
There are two aspects to this
Kafka Connect is a framework which should be deployed separately from kafka brokers. MSK only provides kafka brokers. If you want to use Kafka Connect with MSK you would need to use EC2 instances and deploy the kafka binaries.Kafka Connect framework is bundled along with kafka
Coming to connectors if you donot have a confluent subscription or similar - I am afraid your choices get very limited. But having said you can always write your own connectors. Writing new connectors is not that difficult rather you can apply your business specific logic and be on your way quite quickly.

Using confluent cp-schema-registry, does it have to talk to the same Kafka you are using for producers/consumers?

We already have Kafka running in production. And unfortunately it's an older version, 0.10.2. I want to start using cp-schema-registry, from the community edition of Confluent Platform. That would mean installing the older 3.2.2 image of schema registry for compatibility with our old kafka.
From what I've read in the documentation, it seems that Confluent Schema Registry uses Kafka as it's backend for storing it's state. But the clients that are producing to/reading from Kafka topics talk to Schema Registry independently of Kafka.
So I am wondering if it would be easier to manage in production, running Schema Registry/Kafka/Zookeeper in one container all together, independent of our main Kafka cluster. Then I can use the latest version of everything. The other benefit is that standing up this new service component up could not cause any unexpected negative consequence to the existing Kafka cluster.
I find the documentation doesn't really explain well what the pros/cons of each deployment strategy are. Can someone offer guidance on how they have deployed schema registry in an environment with an existing Kafka? What is the main advantage of connecting schema registry to your main Kafka cluster?
Newer Kafka clients are backwards compatible with Kafka 0.10, so there's no reason you couldn't use a newer Schema Registry than 3.2
In the docs
Schema Registry that is included in Confluent Platform 3.2 and later is compatible with any Kafka broker that is included in Confluent Platform 3.0 and later
I would certainly avoid putting everything in one container... That's not how they're meant to be used and there's no reason you would need another Zookeeper server
Having a secondary Kafka cluster only to hold one topic of schemas seems unnecessary when you could store the same information on your existing cluster
the clients that are producing to/reading from Kafka topics talk to Schema Registry independently of Kafka
Clients talk to both. Only Avro schemas are sent over HTTP before your regular client code reaches the topic. No, schemas and client data do not have to be part of the same Kafka cluster
Anytime anyone deploys Schema Registry, it's being added to "an existing Kafka", just the difference is yours might have more data in it

Kafka and IIDR CDC

I am trying to build a CDC pipeline using : DB2--IBM CDC --Kafka
and I am trying to figure out the right way to setup this .
I tried below things -
1.Setup a 3 node kafka cluster on linux on prem
2.Installed IIDR CDC software on linux on prem using - setup-iidr-11.4.0.1-5085-linux-x86.bin file . The CDC instance is up and running .
The various online documentation suggest to install 'IIDR management console ' to configure the source datastore and CDC server configuration and also Kafka subscription configuration to build the pipeline .
Currently I do not have the management console installed .
Few questions on this -
1.Is there any alternative to IBM CDC management console for setting up the kafka-CDC pipeline ?
2.How can I get the IIDR management console ? and if we install it on our local windows dekstop and try to connect to CDC/Kafka which are on remote linux servers, will it work ?
3.Any other method to setup the data ingestion IIDR CDC to Kafka ?
I am fairly new to CDC/ IIDR , please help !
I own the development of the IIDR Kafka target for our CDC Replication product.
Management Console is the best way to setup the subscription initially. You can install it on a windows box.
Technically I believe you can use our scripting language called CHCCLP to setup a subscription as well. But I recommend using the GUI.
Here are links to our resources on our IIDR (CDC) Kafka Target. Search for the "Kafka" section.
"https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/IIDR%20Wiki"
An example of setting up a subscription and replicating is this video
https://ibm.box.com/s/ur8jokg6tclsx5fcav5g86a3n57mqtd5
Management console and access server can be obtained from IBM fix central.
I have installed MC/Access server on my VM and on my personal windows box to use it against my linux VMs. You will need connectivity of course.
You can definitely follow up with our Support and they'll be able to sort you out. Plus we have docs in our knowledge centre on MC starting here.... https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/overview_of_cdc.html
You'll find our Kafka target is very flexible it comes with five different formats to write data into Kafka, and you can choose to capture data in an audit format, or the Kafka compaction compatible key, null for a delete method.
Additionally you can even use the product to write several records to several different topics in several formats from a single insert operation. This is useful if some of your consumer apps want JSON and others Avro binary. Additionally you can use this to put all the data to more secure topics, and write out just some of the data to topics that more people have access to.
We even have customers who encrypt columns in flight when replicating.
Finally the product's transformations can be parallelized even if you choose to only use one producer to write out data.
Actually one more finally, we additionally provide the option to use a special consumer which produces database ACID semantics for data written into Kafka and shred across topics and partitions. It re-orders it. we call it the transactionally consistent consumer. It provides operation order, bookmarks for restarting applications, and allows parallelism in performance but ordered, exactly once, deduplicated consumption of data.
From my talk at the Kafka Summit...
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions

Why is Kafka connect light weight?

I have been working with kafka connect, Spark streaming , Nifi with kafka for streaming data.
I am aware that unlike other technologies kafka connect is not a separate application and it is a tool of kafka.
In case of distributed mode all technologies implement the parallelism by the underlying tasks or threads. What makes kafka connect to be efficient when dealing with kafka and why is it called light weight?
It's efficient and lightweight because it uses the built-in Kafka protocols and doesn't require an external system such as YARN. While it is arguably better/easier to deploy Connect in Mesos/Kubernetes/Docker, it is not required
The connect API is also maintained by the core Kafka developers rather than people that just want a simple integration into another tool. For example, last time I checked, NiFi cannot access the Kafka message timestamps. And dealing with the Avro Schema Registry seems to be an after thought in the other tools as compared to using Confluent Certified Connectors