Kafka Offsets.storage - apache-kafka

What is the "offsets.storage" for kafka 0.10.1.1?
As per the documentation it shows up under Old Consumer Configs as "zookeeper".
offsets.storage zookeeper Select where offsets should be stored (zookeeper or kafka).
My consumer is spring-boot-1.5.13 RELEASE app which uses kafka-clients-0.10.1.1 internally. As per the source code ConsumerConfig.scala, offsetStorage is "zookeeper", but when I run the consumer, I see the "__consumer_offsets" are getting created under /tmp/kafka-logs directory which is defined in server.properties [i.e. broker];
Moreover it doesn't show up under zookeeper ephemeral nodes, when I check with zookeeper-shell.sh.
ls /consumers
[]
If the offsets.stroage is zookeeper, then why does the __consumer_offsets show up under /tmp/kafka-logs and doesn't show up in zookeeper ephemeral nodes?

Spring Kafka uses the "new" consumer (Java) not the old scala consumer.

Related

Doubts Regarding Kafka Cluster Setup

I have a use case I want to set up a Kafka cluster initially at the starting I have 1 Kafka Broker(A) and 1 Zookeeper Node. So below mentioned are my queries:
On adding a new Kafka Broker(B) to the cluster. Will all data present on broker A will be distributed automatically? If not what I need to do distribute the data.
Not let's suppose somehow the case! is solved my data is distributed on both the brokers. Now due to some maintenance issue, I want to take down the server B.
How to transfer the data of Broker B to the already existing broker A or to a new Broker C.
How can I increase the replication factor of my brokers at runtime
How can I change the zookeeper IPs present in Kafka Broker Config at runtime without restarting Kafka?
How can I dynamically change the Kafka Configuration at runtime
Regarding Kafka Client:
Do I need to specify all Kafka broker IP to kafkaClient for connection?
And each and every time a broker is added or removed does I need to add or remove my IP in Kafka Client connection String. As it will always require to restart my producer and consumers?
Note:
Kafka Version: 2.0.0
Zookeeper: 3.4.9
Broker Size : (2 core, 8 GB RAM) [4GB for Kafka and 4 GB for OS]
To run a topic from a single kafka broker you will have to set a replication factor of 1 when creating that topic (explicitly, or implicitly via default.replication.factor). This means that the topic's partitions will be on a single broker, even after increasing the number of brokers.
You will have to increase the number of replicas as described in the kafka documentation. You will also have to pay attention that the internal __consumer_offsets topic has enough replicas. This will start the replication process and eventually the original broker will be the leader of every topic partition, and the other broker will be the follower and fully caught up. You can use kafka-topics.sh --describe to check that every partition has both brokers in the ISR (in-sync replicas).
Once that is done you should be able to take the original broker offline and kafka will elect the new broker as the leader of every topic partition. Don't forget to update the clients so they are aware of the new broker as well, in case a client needs to restart when the original broker is down (otherwise it won't find the cluster).
Here are the answers in brief:
Yes, the data present on broker A will also be distributed in Kafka broker B
You can set up three brokers A, B and C so if A fails then B and C will, and if B fails then, C will take over and so on.
You can increase the replication factor of your broker
you could create increase-replication-factor.json and put this content in it:
{"version":1,
"partitions":[
{"topic":"signals","partition":0,"replicas":[0,1,2]},
{"topic":"signals","partition":1,"replicas":[0,1,2]},
{"topic":"signals","partition":2,"replicas":[0,1,2]}
]}
To increase the number of replicas for a given topic, you have to:
Specify the extra partitions to the existing topic with below command(let us say the increase from 2 to 3)
bin/kafktopics.sh --zookeeper localhost:2181 --alter --topic topic-to-increase --partitions 3
There is zoo.cfg file where you can add the IP and configuration related to ZooKeeper.

Purpose of Zookeeper in Kafka

As from the latest consumer versions of Kafka, the consumers aren't dependent on ZooKeeper. But "https://kafka.apache.org/" says Kafka requires Zookeeper, so start zookeeper server. why is it so?. Once a topic has been created, even though I terminate Zookeeper it works. So the purpose of Zookeeper is only for creating a Topic? If so why not move creating Topic also to be independent of zookeeper
Kafka topics (still) require Zookeeper for electing a leader, communicating server failure, and storing the list of topics, plus some extra metadata such as replica location and topic configurations.
Kafka Wiki - How does Kafka depend on Zookeeper
Confluent and the Kafka community are trying to move away from the Zookeeper dependency. For example, the Confluent Schema Registry can now use Kafka for leader election. Related blog from Confluent - https://www.confluent.io/blog/how-to-prepare-for-kip-500-kafka-zookeeper-removal-guide/
And in Confluent Cloud, Amazon MSK, and other hosted Kafka offerings, you generally have no access to Zookeeper at all.
The consumers are not dependent on Zookeeper as they are client-side. Likewise with the producers.
Zookeeper is required for the Kafka brokers themselves. Kafka brokers use Zookeeper to co-ordinate and synchronise themselves.

Checking Offset of Kafka topic for a storm consumer

I am using storm-kafka-client 1.2.1 and creating my spout config for KafkaTridentSpoutOpaque as below
kafkaSpoutConfig = KafkaSpoutConfig.builder(brokerURL, kafkaTopic)
.setProp(ConsumerConfig.GROUP_ID_CONFIG,"storm-kafka-group")
.setProcessingGuarantee(ProcessingGuarantee.AT_MOST_ONCE)
.setProp(ConsumerConfig.CLIENT_ID_CONFIG,InetAddress.getLocalHost().getHostName())
I am unable to find neither my group-id nor the offset in both Kafka and Zookeeper. Through Zookeeper I tried with zkCli.sh and tried ls /consumers but there were none as I think Kafka itself is now maintaining offsets rather than zookeeper.
I tried with Kafka too with the command below
bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --list --bootstrap-server localhost:9092
Note: This will not show information about old Zookeeper-based consumers.
console-consumer-20130
console-consumer-82696
console-consumer-6106
console-consumer-67393
console-consumer-14333
console-consumer-21174
console-consumer-64550
Can someone help me how I can find my offset and will it replay my events in Kafka again if I restart the topology ?
Trident doesn't store offsets in Kafka, but in Storm's Zookeeper. If you're running with default settings for Storm's Zookeeper config the path in Storm's Zookeeper will be something like /coordinator/<your-topology-id>/meta.
The objects below that path will contain the first and last offset, as well as topic partition for each batch. So e.g. /coordinator/<your-topology-id>/meta/15 would contain the first and last offset emitted in batch number 15.
Whether the spout replays offsets after restart is controlled by the FirstPollOffsetStrategy you set in the KafkaSpoutConfig. The default is UNCOMMITTED_EARLIEST, which does not start over on restart. See the Javadoc at https://github.com/apache/storm/blob/v1.2.1/external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpoutConfig.java#L126.

Why do we need to mention Zookeeper details even though Apache Kafka configuration file already has it?

I am using Apache Kafka in (Plain Vanilla) Hadoop Cluster for the past few months and out of curiosity I am asking this question. Just to gain additional knowledge about it.
Kafka server.properties file already has the below parameter :
zookeeper.connect=localhost:2181
And I am starting Kafka Server/Broker with the following command :
bin/kafka-server-start.sh config/server.properties
So I assume that Kafka automatically infers the Zookeeper details by the time we start the Kafka server itself. If that's the case, then why do we need to explicitly mention the zookeeper properties while we create Kafka topics the syntax for which is given below for your reference :
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 1 --partitions 1 --topic test
As per the Kafka documentation we need to start zookeeper before starting Kafka server. So I don't think Kafka can be started by commenting out the zookeeper details in Kafka's server.properties file
But atleast can we use Kafka to create topics and to start Kafka Producer/Consumer without explicitly mentioning about zookeeper in their respective commands ?
The zookeeper.connect parameter in the Kafka properties file is needed for having each Kafka broker in the cluster connecting to the Zookeeper ensemble.
Zookeeper will keep information about connected brokers and handling the controller election. Other than that, it keeps information about topics, quotas and ACL for example.
When you use the kafka-topics.sh tool, the topic creation happens at Zookeeper level first and then thanks to it, information are propagated to Kafka brokers and topic partitions are created and assigned to them (thanks to the elected controller). This connection to Zookeeper will not be needed in the future thanks to the new Admin Client API which provides some admin operations executed against Kafka brokers directly. For example, there is a opened JIRA (https://issues.apache.org/jira/browse/KAFKA-5561) and I'm working on it for having the tool using such API for topic admin operations.
Regarding producer and consumer ... the producer doesn't need to connect to Zookeeper while only the "old" consumer (before 0.9.0 version) needs Zookeeper connection because it saves topic offsets there; from 0.9.0 version, the "new" consumer saves topic offsets in real topics (__consumer_offsets). For using it you have to use the bootstrap-server option on the command line insteand of the zookeeper one.

Kafka consumers path in zookeeper is empty?

I use zkCli.sh to list the kafka paths in zookeeper.
By the Kafka+data+structures+in+Zookeeper document,i find all paths in the doc can match the document,except consumers path.
Command ls /consumers,response [],But kafka manager of yahoo can get consumer info,such as LogSize,Consumer Offset and so on.
That's the new consumer which does not depend on Zookeeper anymore. Zk node '/consumers' is just for old consumers. The reason why you could find consumer info in KafkaManager might because it supports the new consumer already.
Kafka ships with a command kafka-consumer-groups.sh which can be used to check status for both old consumer and new consumer.