how to setup Confluent Kafka Schema Registry in Cluster mode - apache-kafka

Setup :- We have 3 Schema registry instance behind AWS ELB. how to change the schema_registry.properties file to setup schema registry in cluster mode?
We are calling schema registry with ELB endpoint.

The cluster of Schema Registry instances will be established by each instance contacting the same ZooKeeper cluster, so you'll want to basically have each instance have the same configuration. A single master will be elected using the strategy in the docs and any follower that receives a write request will just forward that request to the leader. If for some reason you only want certain instances to be master eligible, you can set master.eligbility=false in your properties file. If you want to get fancy and set non-default advertised listeners for your instances, then those have to be unique per instance (they are host:port combinations so this should be expected).

Related

Same consumer group (s3 sink connector) across two different kafka connect cluster

I'm migrating Kafka connectors from an ECS cluster to a new cluster running on Kubernetes. I successfully migrated the Postgres source connectors over by deleting them and recreating them on the exact replication slots. They keep writing to the same topics in the same Kafka cluster. And the S3 connector in the old cluster continues to read from those and write records into S3. Everything works as usual.
But now to move the AWS s3 sink connectors, I first created a non-critical s3 connector in the new cluster with the same name as the one in the old cluster. I was going to wait a few minutes before deleting the old one to avoid missing data. To my surprise, it looks like (based on the UI provided by akhq.io) the one worker on that new s3 connector joins with the existing same consumer group. I was fully expecting to have duplicated data. Based on the Confluent doc,
All Workers in the cluster use the same three internal topics to share
connector configurations, offset data, and status updates. For this
reason all distributed worker configurations in the same Connect
cluster must have matching config.storage.topic, offset.storage.topic,
and status.storage.topic properties.
So from this "same Connect cluster", I thought having the same consumer group id only works within the same connect cluster. But from my observation, it seems like you could have multiple consumers in different clusters belonging to the same consumer group?
Based on this article __consumer_offsets is used by consumers, and unlike other hidden "offset" related topics, it doesn't have any cluster name designation.
Does that mean I could simply create S3 sink connectors in the new Kubernetes cluster and then delete the ones in the ECS cluster without duplicating or missing data then (as long as they have the same name -> same consumer group)? I'm not sure if this is the right pattern people usually use.
I'm not familiar with using a Kafka Connect Cluster but I understand that it is a cluster of connectors that is independent of the Kafka cluster.
In that case, since the connectors are using the same Kafka cluster and you are just moving them from ECS to k8s, it should work as you describe. The consumer offsets information and the internal kafka connect offsets information is stored in the Kafka cluster, so it doesn't really matter where the connectors run as long as they connect to the same Kafka cluster. They should restart from the same position or behave as additional replicas of the same connector regardless of where ther are running.

For Kafka,what IP values need to be setup in listeners & advertised.listeners value?

I have created an multi-node Azure Databricks cluster inside a VNET & I have created a multi-node Kafka HDInsight cluster inside different VNET. I have peered this 2 VNETs. After peering, my 2 machines are able to ping each other.
I am trying to dump messages to Kafka topic from Databricks cluster using Spark Structured Streaming & I am getting socket timeout error.
Upon research, I found that in Kafka we need to setup listeners & advertised.listeners in server.properties file.
In my scenario, what should I put the values for listeners & advertised.listeners? Would be very helpful if anyone can suggest me what all changes I need to make in server.properties file.
You need to create a listener for the host/IP on which your client machine (where Spark is running) can connect to your broker.
See https://rmoff.net/2018/08/02/kafka-listeners-explained/

Confluent Schema Registry Master

For a cross network confluent platform, we have one kafka cluster on-premise and another on AWS in which data is replicated from on-prem to AWS using mirror maker. Both clusters are independent with their own schema-registry, rest proxy and connect.Both clusters have different set of producers and consumers and selective topics are being mirrored between clusters.
What should be the best practice to deploy schema-registry ? Should we have one master (say on-premise) and others as non-eligible masters on on-prem and AWS ?
We suspect schema-registry can have issues with respect to schema ids when topics are replicated between clusters and we have 2 masters (aws and onprem).
Thanks!
If you use two different master registries, I find that would be difficult to manage. (See mistake #2 for self-managed registries). The purpose of master.eligble=false on a second instance/cluster is that all ID registration events have a single source of truth. As the docs say, The Schema Registry nodes in both datacenters link to the primary Kafka cluster in DC A, so you would need to establish a valid network link between AWS and onprem, anyway.
Otherwise, with multiple masters, you will need to mirror the schemas topic if you want exact same subjects and schema ids between environments. However, this is primarily meant to be used as a backup, and you would eventually run into conflicting schema IDs for any producer in the destination region pushing schemas to the other master. Hence why the first diagram shows only consumers in the remote datacenter.
If you do not do this, then let's say you mirrored a topic from cluster A to cluster B, and the consumer used registry B in the settings, it would attempt to lookup an ID from registry A (which is embedded in the message), and that either would not exist or would be an incorrect ID for the topic being read.
I wrote a Kafka Connect plugin to work around that issue by registering a new ID in a remote master registry - https://github.com/cricket007/schema-registry-transfer-smt , though you said you're using MirrorMaker, so you would need to take the logic there and apply it to the MessageHandler interface in MirrorMaker
I've really only worked with one master, on-prem, and in AWS, the registry settings have Zookeeper connection pointing to the on-prem cluster settings.
And we don't mirror everything as the docs suggest, only specific topics. The purpose of using Replicator rather than MirrorMaker is that consumer failover is better supported, rather than simply getting data "over the wire", your clients are less dependent upon where they are running as well.

Why do we need to add all zookeeper nodes in Kafka Consumer Configuration

Looks like we need to add the ip addresses of all zookeeper nodes in the property "zookeeper.connect" for configuring a consumer.
Now my understanding says the zookeeper cluster has a leader which is managed in a fail-safe way.
So, why cant we just provide a bootstrap list for zookeeper nodes like its done in Producer configuration(while providing bootstrap broker list) & they should provide metadata about the entire zookeeper cluster?
You can specify a subset of the nodes. The nodes in that list are only used to get an initial connection to the cluster of nodes and the client goes through the list until a connection is made. Usually the first node is up and available so the client doesn't have to go too far into the list. So you only need to add extra nodes to the list depending on how pessimistic you are.

Roles of zookeeper in solr cloud

I am new to SolrCloud(4.X), Can anybody explain in detail about roles and responsibility of zookeeper in SolrCloud?
Also how does zookeper work in regards to search/add request to Solr?
Zookeepers are a central repository for SolrCloud configuration. You can consider it as a distributed filesystem which can be accessed by all Solr nodes in the cluster. So if you change any config file you just need to inform or upload it to Zookeeper and not on every node in the cluster.
One more important responsibility of Zookeeper is to keep an eye on the state of all Solr nodes in the cluster. If any node goes down and a search request comes in for that node, Zookeeper routes it to an alternative replica node.
When you are updating any document in SolrCloud, it is zookeeper who delegates your update request to the appropriate node in the cloud holding the document
For in depth details you should be reading this,
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription