Kafka MirrorMaker's consumer not fetching all messages from topics - scala

I am trying to setup a Kafka Mirror mechanism, but it seems the Kafka MirrorMaker's consumer from the source Kafka cluster only reads from new incoming data to the topics as soon as the mirror maker process is started, i.e. it does not read historically saved data in the topics previously.
I am using Kafka MirrorMaker class for that as:
/bin/kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config consumer.config --num.streams 2 --producer.config producer.config --whitelist=".*"
consumer.config to read from Kafka source cluster, as:
zookeeper.connect=127.0.0.1:2181
zookeeper.connection.timeout.ms=6000
group.id=kafka-mirror
and producer.config settings to produce to the new Kafka mirrored cluster:
metadata.broker.list=localhost:9093
producer.type=sync
compression.codec=none
serializer.class=kafka.serializer.DefaultEncoder
Is there a way to define the consumer of Kafka MirrorMaker to read from the beginning of the topics of my source Kafka cluster? A bit strange, because I have defined in the consumer.config settings a new consumer group (kafka-mirror), so the consumer should just read from offset 0, i.e. from beginning of topics.
Many thanks in advance!

In consumer properties, add
auto.offset.reset=earliest
This should work

Look at the auto.offset.reset parameter from Kafka consumer configuration.
From Kafka documentation:
auto.offset.reset largest
What to do when there is no initial offset in Zookeeper or if an
offset is out of range:
* smallest : automatically reset the offset to the smallest offset
* largest : automatically reset the offset to the largest offset
* anything else: throw exception to the consumer. If this is set to largest, the consumer may lose some messages when the number of
partitions, for the topics it subscribes to, changes on the broker. To
prevent data loss during partition addition, set auto.offset.reset to
smallest
So, using smallest for auto.offset.reset should fix your problem.

Very late answer but this might be helpful to some one who is still looking for.
As of now kafka mirror doesn't support this. There is an open defect .KafkaMirror

Related

Kafka consumer offset mechanism: How is the topic based storage more better than the ealier zookeeper based storage?

Earlier kafka used to store consumer offsets in zookeeper, but since kafka 0.10 or 0.11 - kafka started to store consumer offsets in an internal topic.
As stated in this post -
Kafka brokers use an internal topic named __consumer_offsets that
keeps track of what messages a given consumer group last successfully
processed. As we know, each message in a Kafka topic has a partition
ID and an offset ID attached to it.
But a topic is not like a DB Table - which can be queried for data based on some input. So I am wondering how this is efficient at all and how exactly does kafka retrieve the offsets for a particular partiton for a particular consumer-group.
Kafka Streams or an in-memory hashtable can make compacted topics very much like an KV database store.
The consumer offsets topic is a compacted topic, keyed by group name. When you give a group.id in the client, the Controller node and Group Coordinator are easily able to lookup that name from the topic, by key, and return all currently committed offsets for all partitions for the group. Then the consumer looks up the offsets for its assigned partitions from the returned map.
It's not a question of "better". Removing dependencies of Zookeeper was always the goal, and is finally production ready as of Kafka 3.3.1.

When you change number of partitions for user kafka topic, will the Kafka stream adjust number of partitions for internal topic? [duplicate]

Kafka version: 1.0.0
Let's say the stream application uses low level processor API which maintains the state and reads from a topic with 10 partitions. Please clarify if the internal topic is expected to be created with the same number of partitions OR is it per the broker default. If it's the later, if we need to increase the partitions of the internal topic, is there any option?
Kafka Streams will create the topic for you. And yes, it will create it with the same number of partitions as your input topic. During startup, Kafka Streams also checks if the topic has the expected number of partitions and fails if not.
The internal topic is basically a regular topic as any other and you can change the number of partitions via command line tools like for any other topic. However, this should never be required. Also note, that dropping/adding partitions, will mess up your state.

Kafka: Who maintains that upto which offset number message is read by a consumer group?

I know that all the messages (or offset) in a Kafka Queue Partition has its offset number and it takes care of the sequence of offsets.
But if I have a Kafka Consumer Group (or single Kafka Consumer) which is reading particularly the Kafka Topic Partition then how it maintains up to which offset messages are read and who maintains this offset counter?
If the consumer goes down then how a new consumer will start reading the offset from the next unread (or not acknowledged) offset.
The information about Consumer Groups is all stored in the internal Kafka topic __consumer_offsets. Whenever a new group tries to read data from a topic it checks its offset position in that internal topic which has a deletion policy set to compact. The compaction keeps this topic small.
Kafka comes with a command line tool kafka-consumer-groups.sh that helps you understand which information is stored for each consumer group.
More information is given in the Kafka Documentation on offset tracking.

Kafka Streams: Internal topic partitions

Kafka version: 1.0.0
Let's say the stream application uses low level processor API which maintains the state and reads from a topic with 10 partitions. Please clarify if the internal topic is expected to be created with the same number of partitions OR is it per the broker default. If it's the later, if we need to increase the partitions of the internal topic, is there any option?
Kafka Streams will create the topic for you. And yes, it will create it with the same number of partitions as your input topic. During startup, Kafka Streams also checks if the topic has the expected number of partitions and fails if not.
The internal topic is basically a regular topic as any other and you can change the number of partitions via command line tools like for any other topic. However, this should never be required. Also note, that dropping/adding partitions, will mess up your state.

read kafka message starting from a specific offset using high level API

I hope I am not making a mistake, but I remember that in Kafka documentation it mentioned that using high level APIs you can't start reading messages from a specific offset, but it was mentioned that it would change.
Is it possible now using the high level APIs to read messages from a specific partition and a specific offset? Could you please give me an example how to do it?
I am using kafka 0.8.1.1.
Thanks in advance.
You can do that with kafka 0.9:
http://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
Kafka 0.8.1.1 can use Zookeeper to store offsets for each consumer group. If you configure your consumer to commit offsets to zookeeper than you Need just to manually set the starting offset for the topic and partition under zookeeper for your consumer Group.
You Need to connect to zookeeper and use the set command:
set /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
E.g. setting offset 10 for partition 0 of topicname for the spark-app consumer Group.
set /consumers/spark-app/offsets/topicname/0 10
When a consumer starts to consume message from Kafka it always starts to consume from the last committed offset. If this last committes offset is not.valid for any reason than the consumer applies the logic due the configurazione properties auto.offset.reset.
Hope this helps.