How to scale single node Kafka to multiple node cluster? - apache-kafka

I am going to install Kafka for company messaging. The plan is to first install the kafka on a single huge machine and scale it to 4-5 machines (a cluster) later if needed.
I have little experience about kafka. Want to ask whether it is possible to scale by just changing the parameter in broker configuration and install zookeeper on newly joined machine.
Or how can I roughly do this in the easiest way ? More specifically Cloudera Kafka in CDH.
Thanks

To scale Kafka you will have to add more partitions to topics if needed to using kafka-topics.sh. And than reassign partitions to your new brokers using kafka-reassign-partitions.sh.
The reassign utility will replicate and dispatch your data automatically. You can do it for a whole topic or for a selective set of partitions.
The complete documentation is here. Just take a look at section 6.

Related

Kafka cluster migration over clouds, how to ensure consumers consume from right offsets when offsets are managed by us?

For migration of Kafka clusters from AWS to AZURE, the challenge is that we are using our custom offsets management for consumers. If I replicate the ZK nodes with offsets, the Kafka Mirror will change those offsets. Is there any way to ensure the offsets are same so that migration can be smooth?
I think the problem might be your custom management. Without more details on this, it's hard to give suggestions.
The problem I see with trying to copy offsets at all is that you consume from cluster A, topic T offset 1000. You copy this to a brand new cluster B, you now have topic T, offset 0. Having consumers starting at offset 1000 will just fail in this scenario, or if at least 1000 messages were mirrored, then you're effectively skipping that data.
With newer versions of Kafka (post 0.10), MirrorMaker uses the the __consumer_offsets topic, not Zookeeper since it's built on newer Java clients.
As for replication tools, uber/uReplicator uses ZooKeeper for offsets.
There are other tools that manage offsets differently, such as Comcast/MirrorTool or salesforce/mirus via the Kafka Connect Framework.
And the enterprise supported tool would be Confluent Replicator, which has unique ways of handling cluster failover and migrations.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

How to recover Kafka from complete zookeeper loss and new start?

I have a simple Kafka cluster of 3 brokers and 3 zk nodes.
If I wipe out 2/3 zk nodes and bring them back (even new "clean" ones), everything recovers as zk re-syncs.
If I wipe out all 3 zk nodes and restart them "clean" (think docker containers or AWS auto-scaling group instances), the brokers are confused. All of the data structures in zk (basic paths, brokers, topics, etc.) are gone, since I have a blank zk.
How can I recover from this scenario? I am (potentially) willing to live with lost topics (since we automate topic creation), but the brokers (unlike with startup) do not "know" that zk is blank and so do not reinitialize (set up structures, register brokers, etc.). Conversely, I could back up zk and restore it, as long as I know what to backup/restore.
The key element is fully automated, though. In cloud-native, I cannot rely on a human doing the restore or checking.
I'm not sure that managing Zookeeper nodes (or Kafka brokers for that matter) with autoscaling is such a good idea.
For one Zookeeper maintains the topic information (and if you are not using the latest Kafka builds or are sill using the old consumer API it also maintains the consumer offsets).
In addition to that topic partitions are statically assigned to brokers, so if you bring down the current Kafka brokers and spawn new nodes you have to be very careful and start brokers with the same broker.id and data otherwise Kafka might get confused.
Third regarding Zookeeper you have to be careful not to create a cluster of a pair number of nodes otherwise the consensus algorithm will not be able to elect a leader due to missing majority in the voting phase.
Having said all that I think that doing a backup and restore of one of the Zookeeper nodes should work. It would be even easier if you set up things so that at least one of the nodes cannot be turned off (or alternative you use a persistent storage for that one).
This way you ensure that one of the Zookeeper nodes will always have the latest data and it will take care of replicating it to the other nodes.

Kafka- How to automatically use the second cluster when the first cluster is down?

I am trying to replicate data from one to another kafka cluster by using mirror maker . Suppose if master cluster is down, is it possible to automatically send the kafka messages to the second cluster ? And also is it possible to synchronise the cluster 1 with cluster 2 when the cluster 1 is up again automatically with less manual intervention?
any help is highly appreciated .
I think you meant to ask how to maintain copies between Kafka brokers, that together are considered to be a Kafka Cluster.
If that's the case, it's pretty simple, all you have to do is configure a Kafka Cluster and to create a topic with replication factor with size that is equal to the size of the nodes in the Cluster.
For example:
Let's say that we want to have 3 Brokers on our Kafka Cluster, then you'll need to prepare for each broker a different configuration file, then startup them as a cluster, and then create a topic with replication factor of 3.
Kafka will be responsible for maintaining the Fault Tolerance.
For further info on actually do the configuration, watch these videos on youtube:
https://www.youtube.com/channel/UCDLPjuuYHxPbHdN8RXxrGdw

Kafka in distributed system

I am new to kafka , i am running kafka in a single machine as of now. I want to run kafka in an distributed environment on multiple machines. There is no proper documentation for this. Any documentation or suggestion on this will be really helpful.
Adding on to the previous answer by user2720864
Let us assume that Kafka system with below configuration is needed.
7 Kafka nodes
3 Zoo keepers
To achieve this install 7 Kafka instances, in 7 different server/vm(instances), and in each of these instances set a different broker-id, this will let the zookeeper identify the different kafka nodes for bookkeeping, maintenance.
broker.id=X (/config/server.properties)
To start zookeepers, you can use 3 of the previous kafka instances or can use new servers to start zookeepers. Once the servers on which zookeepers run are decided, change the /config/server.properties to specify zookeepers.
zookeeper.connect=hostname1:port1,hostname2:port2
In a distributed environment its nice to have 3 zoo keepers. While there is only one zookeeper which acts as a true master, other 2 zookeepers act as fail overs. When the master fails one of the two ZKs will take over as master.
I found this link to be very useful, it helped me clarify a lot of things about kafka architecture.
This is a good reference for all the configurations on the property files in kafka.
Hope this helps!
Basically you need to do the follwing
1) Set up kafka on all the machines
2) Configure the config/server1.properties properties file to specify an unique id for each machines. You can do that by setting the broker.id properties in the config file. e.g. broker.id=1, broker.id=2. For every brokers this id should be unique. This is how every node is identified in a kafka cluster.
3) Start kafka in all nodes
You can refer Step 6: Setting up a multi broker cluster from their official quick start page.
Also here is a nice article worth taking a look