Schema Registry URL for IIDR CDC Kafka subscription - apache-kafka

I have created a cluster Amazon MSK. Also, created an EC2 instance and installed Kafka on it to create a topic in Amazon MSK. I am able to produce/consume messages on the topic using Kafka scripts.
I have also installed the IIDR Replication agent on an EC2 instance. The plan is to migrate DB2 table data into the Amazon MSK topic.
In the IDR Management console, I am able to add the IIDR replication server as the target.
Now when creating the subscription, it is asking for ZooKeeper URL and Schema Registry URL. I can get the Zookeeper endpoints from Amazon MSK.
What value to provide for the schema registry URL as there's none created?
Thanks for your help.

If you do not need to specify a schema registry because say you are using a KCOP that generate JSON, just put in a dummy value. Equally if you are specifying a list of Kafka brokers in the kafkaconsumer.propertie and the kafkaproducer.properties files in the CDC instance.conf directory you can put in dummy values for the zookeeper fields.
Hope this helps
Robert

Related

Retrieve secrets from AWS Secrets manager in Confluent ksqlDB

I am trying to create in Confluent Cloud (Kafka) a MongoDB connector sink with ksqlDB. The problem is that I have the data source and credentials in the AWS Secrets manager.
Is there a way to obtain the secrets with ksqlDB to set the connector properties?
Kafka Connect supports Externalized config for secrets. Whether such an implement exists for AWS, I am not sure, but if not, you'll need to write your own ConfigProvider for it.
Alternatively, there may be alternative solutions like running ksql or just Connect itself in MSK Connect, ECS, EC2, or EKS where you write processes around exposing Secrets Manager data into files or environment variables, which can then be used by Connect's default config providers, then setup ksql externally to point at those Connect instances, or just process the topics it outputs

Confluent Cloud Kafka - Audit Log Cluster : Sink Connector

For Kafka cluster hosted in Confluent Cloud, there is an Audit Log cluster that gets created. It seems to be possible to hook a Sink connector to this cluster and drain the events out from "confluent-audit-log-events" topic.
However, I am running into the below error when I run the connector to do the same.
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [connect-offsets]
In my connect-distributed.properties file, I have the settings as :
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3
offset.storage.partitions=3
What extra permission/s needs to be granted so that the connector can create the required topics in the cluster? The key/secret being used in the connect-distributed.properties files is a valid key/secret that is associated to the service account for this cluster.
Also, when I run the consumer in the console using the same key (as above) , I am able to read the audit log events just fine.
It's confirmed that this feature (hooking up a connector to the Audit Log cluster) is not supported at the moment in Confluent Cloud. This feature may be available later this year at some point.

Is there a way to dump Amazon MSK Topic to S3 directly?

I have planned to used Amazon MSK and i want to dump consumer logs to S3 . But i don't see any options. Do i need to write my own consumer or is there a way to consume Amazon MSK consumer output to s3 directly ?
Kafka Connect is generally the best (easiest/scalable/portable/resilient) way to get data between Kafka and systems down (and up) stream such as S3. Learn more about Kafka Connect here and in this talk here.
MSK Connect can run Kafka Connect workloads for your MSK on AWS.
Another option you have is to run your own Kafka Connect worker (which connects to MSK) and use the S3 sink connector (tutorial).
There is not a direct way to do it from MSK. You can use an external consumer to do it or preferably use KafkaConnect in an EC2 within the same VPC as MSK.
Either way you need to consider for high availability and data transfer costs. For HA, use consumers in different AZs. For costs, use MSK 2.4.1 that allows consumers to fetch data from the closest replica.

Confluent Schema Registry Master

For a cross network confluent platform, we have one kafka cluster on-premise and another on AWS in which data is replicated from on-prem to AWS using mirror maker. Both clusters are independent with their own schema-registry, rest proxy and connect.Both clusters have different set of producers and consumers and selective topics are being mirrored between clusters.
What should be the best practice to deploy schema-registry ? Should we have one master (say on-premise) and others as non-eligible masters on on-prem and AWS ?
We suspect schema-registry can have issues with respect to schema ids when topics are replicated between clusters and we have 2 masters (aws and onprem).
Thanks!
If you use two different master registries, I find that would be difficult to manage. (See mistake #2 for self-managed registries). The purpose of master.eligble=false on a second instance/cluster is that all ID registration events have a single source of truth. As the docs say, The Schema Registry nodes in both datacenters link to the primary Kafka cluster in DC A, so you would need to establish a valid network link between AWS and onprem, anyway.
Otherwise, with multiple masters, you will need to mirror the schemas topic if you want exact same subjects and schema ids between environments. However, this is primarily meant to be used as a backup, and you would eventually run into conflicting schema IDs for any producer in the destination region pushing schemas to the other master. Hence why the first diagram shows only consumers in the remote datacenter.
If you do not do this, then let's say you mirrored a topic from cluster A to cluster B, and the consumer used registry B in the settings, it would attempt to lookup an ID from registry A (which is embedded in the message), and that either would not exist or would be an incorrect ID for the topic being read.
I wrote a Kafka Connect plugin to work around that issue by registering a new ID in a remote master registry - https://github.com/cricket007/schema-registry-transfer-smt , though you said you're using MirrorMaker, so you would need to take the logic there and apply it to the MessageHandler interface in MirrorMaker
I've really only worked with one master, on-prem, and in AWS, the registry settings have Zookeeper connection pointing to the on-prem cluster settings.
And we don't mirror everything as the docs suggest, only specific topics. The purpose of using Replicator rather than MirrorMaker is that consumer failover is better supported, rather than simply getting data "over the wire", your clients are less dependent upon where they are running as well.

Configuring Kafka connect Postgress Debezium CDC plugin

I am trying to use kafka connect to read changes in postgress DB.
I have Kafka running on my local system and i want to use the Kafka connect API in standalone mode to read the postgress server DB changes.
connect-standalone.sh connect-standalone.properties dbezium.properties
i would appreciate if someone can help me with setting up configuration properties for CDC postgress debezium connector
https://www.confluent.io/connector/debezium-postgresql-cdc-connector/
I am following the below to construct the properties
https://debezium.io/docs/connectors/postgresql/#how-the-postgresql-connector-works
The name of the Kafka topics takes by default the form
serverName.schemaName.tableName, where serverName is the logical name
of the connector as specified with the database.server.name
configuration property
and here is what i have come up with for dbezium.properties
name=cdc_demo
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1
plugin.name=wal2json
slot.name=debezium
slot.drop_on_stop=false
database.hostname=localhost
database.port=5432
database.user=postgress
database.password=postgress
database.dbname=test
time.precision.mode=adaptive
database.sslmode=disable
Lets say i create a PG schema name as demo and table name as suppliers
So i need to create a topic with name as test.demo.suppliers so that this plugin can push the data to?
Also can someone suggest a docker image which has the postgress server + with suitable replication plugin such as wal2json etc? i am having hard time configuring postgress and the CDC plugin myself.
Check out the tutorial with associated Docker Compose and sample config.
The topic you've come up with sounds correct, but if you have your Kafka broker configured to auto-create topics (which is the default behaviour IIRC) then it will get created for you and you don't need to pre-create it.