confluent_kafka, how to specify cluster for producer config - apache-kafka

I have two environments in a dev Confluent Cloud account, each with a single cluster. I can see that both clusters have the same bootstrap server, and this is documented:
In the Confluent Cloud Console, you may see the same bootstrap server
for different clusters. This is working as designed; it occurs because
Confluent Cloud clusters are multi-tenant.
My problem is that when attempting to produce to a topic it appears that the producer is connected to the wrong cluster, I'll get:
cimpl.KafkaException: KafkaError{code=_UNKNOWN_TOPIC,val=-188,str="Unable to produce message: Local: Unknown topic"}
And producer.list_topics() shows the topics from the other cluster to the one I'm working on.
So how do I specify the exact cluster, which will have the right topics? I'm expecting to be able to provide cluster.id in my configuration. But that returns
KafkaError{code=_INVALID_ARG,val=-186,str="No such configuration property: "cluster.id""}

Related

Exposing Kafka brokers using Google click to deploy environment

I have used the Kafka cluster (with replication) click to deploy container from Google on kubernetes. How do I expose the brokers so I can consume from an external consumer? I'm very new to kunernetes.
I have tried exposing the broker nodes with a load balancer but the external ip given
I opened the port with a firewall rule
But when connecting from my consumer it throws an error about being disconnected
Any help would be great, I can provide more info if asked.
You cannot use a load balancer. Kafka clients must talk directly to the brokers.
Firewall is a good step, but you need to ensure each of the brokers are exposed properly via advertised.listeners within and outside of the VPC. Refer blog post.
Alternative, within GKE, you can run Strimzi Operator, which handles Kubernetes resources for you with regard to the Kafka cluster.

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

Using Kafka Connect in distributed mode, where are internal topics supposed to exist

As a follow up to my previous question here Attempting to run Kafka Connect in distributed mode locally, problem with internal topics, I have started to figure out what might really be going on (I'm learning Kafka as I go).
Kafka Connect, one way or another, requires three internal topics: config, offset, and status. Are these topics supposed to exist in the Kafka cluster where I am consuming data from? For context, what I'm doing is someone else has a Kafka cluster set up that has topics (messages?) for me to consume. I spin up a Kafka Connect cluster on my local machine (to test) and this local instance (we'll call it that going forward) then connects to the remote Kafka cluster (we'll call it the remote cluster) by way of me typing in the bootstrap servers, some callback handler classes, and a connect.jaas file.
Do these three topics need to already exist on the remote cluster? Here I have been trying to create them on my own broker on my local instance, but through continued research, I'm seeing maybe these three internal topics need to be on the remote cluster (where I'm getting my data from). Does the owner of the remote Kafka cluster need to create these three topics for me? Where would they create them exactly? What if their cluster is not a Kafka Connect cluster specifically?
The topics need to be created on the cluster defined by bootstrap.servers in the Connect worker properties. This can be local or remote, depending on what data you actually want the connector tasks to send/receive. Individual connect tasks cannot override what brokers are being used (not possible to use a source connector to write to multiple Kafka clusters, for example)
Latest versions of Kafka Connect will automatically create those internal topics, if it is authorized to do so. Otherwise, yes, they'll need to be created using kafka-topics --create with appropriate partition counts and replication factors.
If your data exists in a remote Kafka cluster, the only reason to run a local instance is if you want to use MirrorMaker, for example.
What if their cluster is not a Kafka Connect cluster specifically?
Unclear what this means. Kafka Connect is a client just like a Kafka Streams app or normal producer or consumer. It doesn't store topics itself.

How to expand confluent cloud kafka cluster?

I have set up a confluent cloud multizone cluster and it got created with just one bootstrap server. There was no setting for choosing number of servers while creating the cluster. Even after creation, I can’t edit the number of bootstrap servers.
I want to know how to increase the number of servers in confluent cloud kafka cluster.
Under the hood, the Confluent Cloud cluster is already running multiple brokers. Depending on your cluster configuration (specifically, whether you're running Standard or Dedicated, and what region and cloud you're in), the cluster will have between six and several dozen brokers.
The way a Kafka client bootstrap server config works is that the client reaches out to the bootstrap server and requests a list of all brokers, and then uses those broker endpoints to actually produce/consume from Kafka (reference: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-bootstrap-servers.html)
In Confluent Cloud, the provided bootstrap server is actually a load balancer in front of all of the brokers; when the client connects to the bootstrap server it'll receive the actual endpoints for all of the actual brokers, and then use that for subsequent connections.
So TL;DR, in your client, you only need to specify the one bootstrap server; under the hood, the Kafka client will connect to the (many) brokers running in Confluent Cloud, and it should all just work.
Source: I work at Confluent.

Confluent Schema Registry Master

For a cross network confluent platform, we have one kafka cluster on-premise and another on AWS in which data is replicated from on-prem to AWS using mirror maker. Both clusters are independent with their own schema-registry, rest proxy and connect.Both clusters have different set of producers and consumers and selective topics are being mirrored between clusters.
What should be the best practice to deploy schema-registry ? Should we have one master (say on-premise) and others as non-eligible masters on on-prem and AWS ?
We suspect schema-registry can have issues with respect to schema ids when topics are replicated between clusters and we have 2 masters (aws and onprem).
Thanks!
If you use two different master registries, I find that would be difficult to manage. (See mistake #2 for self-managed registries). The purpose of master.eligble=false on a second instance/cluster is that all ID registration events have a single source of truth. As the docs say, The Schema Registry nodes in both datacenters link to the primary Kafka cluster in DC A, so you would need to establish a valid network link between AWS and onprem, anyway.
Otherwise, with multiple masters, you will need to mirror the schemas topic if you want exact same subjects and schema ids between environments. However, this is primarily meant to be used as a backup, and you would eventually run into conflicting schema IDs for any producer in the destination region pushing schemas to the other master. Hence why the first diagram shows only consumers in the remote datacenter.
If you do not do this, then let's say you mirrored a topic from cluster A to cluster B, and the consumer used registry B in the settings, it would attempt to lookup an ID from registry A (which is embedded in the message), and that either would not exist or would be an incorrect ID for the topic being read.
I wrote a Kafka Connect plugin to work around that issue by registering a new ID in a remote master registry - https://github.com/cricket007/schema-registry-transfer-smt , though you said you're using MirrorMaker, so you would need to take the logic there and apply it to the MessageHandler interface in MirrorMaker
I've really only worked with one master, on-prem, and in AWS, the registry settings have Zookeeper connection pointing to the on-prem cluster settings.
And we don't mirror everything as the docs suggest, only specific topics. The purpose of using Replicator rather than MirrorMaker is that consumer failover is better supported, rather than simply getting data "over the wire", your clients are less dependent upon where they are running as well.