Use Kafka for warehousing without Zookeeper - apache-kafka

I want to ingest data from a datawarehouse into kafka and then I want to store the avro records into mySQL RDMS. I want to eliminate the zookeeper dependency. Is it possible to do this without using zookeeper?

It is not considered production ready, but you are looking for Kafka KRaft mode.
bin/test-kraft-server-start.sh script will start the broker in this mode...
Docs - https://github.com/apache/kafka/tree/trunk/raft
Reference - https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum

Related

Kafka connect - completely removing a connector

my question is split to two. I've read Kafka Connect - Delete Connector with configs?. I'd like to completely remove a connector, with offsets and all, so I can recreate it with the same name later. Is this possible? To my understanding, a tombstone message will kill this connector indefinitely.
The second part is - is there a way to have the kafka-connect container automatically delete all connectors he created when bringing it down?
Thanks
There is no such command to completely cleanup connector state. For sink connectors, you can use kafka-consumer-groups to reset it's offsets. For source connectors, it's not as straightforward, as you'll need to manually produce data into the Connect-managed offsets topic.
The config and status topics also persist historical data, but shouldn't prevent you from recreating the connector with the same name/details.
The Connect containers published by Confluent and Debezium always uses Distributed mode. You'll need to override the entrypoint of the container to use standalone mode to not persist the connector metadata in Kafka topics (this won't be fault tolerant, but it'll be fine for testing)

How to ad-hoc snapshots on Debezium

I have Debezium in a container, capturing all changes of PostgeSQL database records.
How to delete all kafka topics which are created already and initiate ad-hoc snapshot from the beginning for all tables configured?
You can use kafka-topics --delete, just like any other topic. The Debezium ones typically match your database schema/table name. You'll also need to find the internal offsets topic created by Kafka Connect framework.
For Docker, though, if you restart Kafka and Zookeeper and they don't have volumes attached, then they'll lose everything, which would be easier for ad-hoc development.
Also, you don't need Zookeeper anymore, as of Kafka 3.3.1

Best practise and methods for Kafka parameters and monitoring

I am going to implement a Snowflake Kafka connector with the continuous ingestion of data to target database snowflake.
What are the best practices for :
Kafka for its clusters
Kafka and its related parameters
Monitoring resources
Kafka for its clusters
Run at least 3 brokers
Kafka and its related parameters
That's too broad and has nothing to do with running a Connect cluster or implementing one. The defaults are mostly fine. You can find the production recommendations in the Kafka documentation.
Monitoring resources
Use JMX. https://docs.confluent.io/platform/current/kafka/monitoring.html
going to implement a Snowflake Kafka connector
Snowflake already has a connector... I'd start by forking rather than making your own

Kafka design questions - Kafka Connect vs. own consumer/producer

I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?
Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.
Anything that you can do with connector can be done through
Producer+Consumer
Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.
Some points to remember..
If the source and sink are both the same Kafka cluster, Connector doesn't make sense
If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
do we have to run the consumer/producer code ( jar file) from all the brokers ?
Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.
when to use Kafka connect vs. own consumer/produce
In my experience, these factors should be taken into consideration
You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
Having structured data not tied to the a particular binary format is preferred
Writing your own or using a community connector is well tested and configurable for your use cases
Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used
Note: Confluent Platform is still the same Apache Kafka
Kafka Connect:
Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.
For more details, you can refer to https://docs.confluent.io/current/connect/index.html
The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.
No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.

Migrating topics,ACL and messages from apache kafka to confluent platform

We are migrating our application from Apache Kafka to Confluent Platform .
Apache Kafka version:1.1.0
Confluent :4.1.0
Tried these options:
Manually copying the zookeeper logs and Kafka Logs- Not an optimal way
because of volume and data correctness.
Mirror Maker - This will replicate newly created topics and ACL. It will not
migrate old details in Apache Kafka
Please suggest better approaches on this.
You can keep your existing Kafka and Zookeeper installation.
Confluent does not change any way these run or manage data.
You can configure the REST Proxy, Schema Registry, Control Center, KSQL, etc. to use your existing bootstrap servers or Zookeeper connection; nothing should need migrated, you're only adding extra consumer/producer services which just happen to be provided by Confluent.
If you later plan on upgrading your brokers, then you can start up new ones from the Confluent package, migrate the partitions, then shut down the old ones. Similarly for Zookeeper, but make sure that you have at least 2 up during this process, and always have an odd number of them available after your transition