Kafka: KafkaConsumer not able to pull all records - apache-kafka

I'm pretty new at Kafka.
For the purpose of stress testing my cluster, and building operational experience, I created two simple Java applications: one that repeatedly publishes messages to a topic (a sequence of integers) and another application that loads the entire topic (all records) and verifies that the sequence is complete. Expectation is that no messages get lost due to operations on the cluster (restart a node, replacing a node, topics partitions reconfigurations, etc).
The topic "sequence" has two partitions, and replication factor 3. The cluster is made of 3 virtual nodes (its for testing purposes, hence they are running on the same machine). The topic is configured to retain all messages (retention.ms set to -1)
I currently have two issues, that I have difficulties figuring out:
If I use bin/kafka-console-consumer.sh --bootstrap-server kafka-test-server:9090,kafka-test-server:9091,kafka-test-server:9092 --topic sequence --from-beginning I see ALL messages (even though not ordered, as expected) loaded on console. On the other hand, if I use the consumer application that I wrote, I see different results being loaded at each cycle: https://i.stack.imgur.com/tMK10.png - In the console output, the first line after the divisor is a call to records.partitions(), hence records are only sometimes pulled from both partitions. Why and why is the java app not behaving like bin/kafka-console-consumer.sh?
When the topic gets to big, the bin/kafka-console-consumer.sh is still able to show all messages, while the application is able to load only about 18'000 messages. I have tried playing around with
the various consumer-side configurations, with no progress. Again, the question is why is there a difference?
Thank you in advance for any hint!
Here are for ref. the two app discussed:
package ch.demo.toys;
import java.io.FileInputStream;
import java.util.Properties;
import java.util.concurrent.Future;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SequenceProducer {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.load(new FileInputStream("toy.properties"));
properties.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("acks", "1");
properties.put("retries", "3");
properties.put("compression.type", "snappy");
properties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1);
for (Integer sequence_i = 0; true; sequence_i++) {
try(Producer<Integer, String> producer = new KafkaProducer<>(properties)) {
ProducerRecord<Integer, String> record = new ProducerRecord<>("sequence", sequence_i, "Sequence number: " + String.valueOf(sequence_i));
Future<RecordMetadata> sendFuture = producer.send(record, (metadata, exception) -> {
System.out.println("Adding " + record.key() + " to partition " + metadata.partition());
if (exception != null) {
exception.printStackTrace();
}
});
}
Thread.sleep(200);
}
}
}
package ch.demo.toys;
import java.io.FileInputStream;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import java.util.Properties;
import java.util.stream.Collectors;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.TopicPartition;
public class CarthusianConsumer {
private static Properties getProperties() throws Exception {
Properties properties = new Properties();
properties.load(new FileInputStream("toy.properties"));
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.IntegerDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringDeserializer.class);
properties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, Integer.MAX_VALUE);
properties.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60 * 1000);
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "carthusian-consumer");
properties.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG, 60 * 1000);
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, 1024 * 1024 * 1024);
return properties;
}
private static boolean checkConsistency(List<Integer> sequence) {
Collections.sort(sequence);
Iterator<Integer> iterator = sequence.iterator();
int control = 0;
while(iterator.hasNext()) {
int value = iterator.next();
if (value != control) {
System.out.println("");
System.out.println("Gap found:");
System.out.println("\tSequence: " + value);
System.out.println("\tControl: " + control);
return false;
}
control++;
}
System.out.print(".");
return true;
}
public static void main(String[] args) throws Exception {
// Step 1: create a base consumer object
Consumer<Integer, String> consumer = new KafkaConsumer<>(getProperties());
// Step 2: load topic configuration and build list of TopicPartitons
List<TopicPartition> topicPartitions = consumer
.partitionsFor("sequence")
.stream()
.parallel()
.map(partitionInfo -> new TopicPartition(partitionInfo.topic(), partitionInfo.partition()))
.collect(Collectors.toList());
while (true) {
List<Integer> sequence = new ArrayList<>();
for (TopicPartition topicPartition : topicPartitions) {
// Step 3. specify the topic-partition to "read" from
// System.out.println("Partition specified: " + topicPartition);
consumer.assign(Arrays.asList(topicPartition));
// Step 4. set offset at the beginning
consumer.seekToBeginning(Arrays.asList(topicPartition));
// Step 5. get all records from topic-partition
ConsumerRecords<Integer, String> records = consumer.poll(Duration.ofMillis(Long.MAX_VALUE));
// System.out.println("\tCount: " + records.count());
// System.out.println("\tPartitions: " + records.partitions());
records.forEach(record -> { sequence.add(record.key()); });
}
System.out.println(sequence.size());
checkConsistency(sequence);
Thread.sleep(2500);
}
}
}

Thank you Mickael-Maison, here is my answer:
On producer: thanks you for the comment. I admit taking the example from the book and modifying it directly without performances considerations.
On consumer: as mentioned in the comments above, subscription was the first approach attempted which unfortunately yielded the same result described in my question: results from individual partitions, and only rarely from both partitions in the same call. I'd also love to understand the reasons for this apparently random behavior!
More on consumer: I rewind to the beginning of the topic at every cycle, because the purpose is to continuously verify that the sequence did not break (hence no messages lost). At every cycle I load all the messages and check them.
Because the single call based on topic subscription yielded an apparently random behavior (unsure when the full content of the topic is returned); I had to read off from each individual partition and join the lists of records manually before checking them - which is not what I wanted to do initially!
Are my approaches wrong?

There are a few things you should change in your clients logic.
Producer:
You are creating a new producer for each record you're sending. This is terrible in terms of performance as each producer as to first bootstrap before sendign a record. Also as a single record is sent by each producer, no batching happens. Finally compression on a single record is also inexistant.
You should first create a Producer and use it to send all records, ie move the creation out of the loop, something like:
try (Producer<Integer, String> producer = new KafkaProducer<>(properties)) {
for (int sequence_i = 18310; true; sequence_i++) {
ProducerRecord<Integer, String> record = new ProducerRecord<>("sequence", sequence_i, "Sequence number: " + String.valueOf(sequence_i));
producer.send(record, (metadata, exception) -> {
System.out.println("Adding " + record.key() + " to partition " + metadata.partition());
if (exception != null) {
exception.printStackTrace();
}
});
Thread.sleep(200L);
}
}
Consumer:
At every iteration of the for loop, you change the assignment and seek back to the beginning of the partition, so at best you will reconsume the same messages every time!
To begin, you should probably use the subscribe() API (like kafka-console-consumer.sh), so you don't have to fiddle with partitions. For example:
try (Consumer<Integer, String> consumer = new KafkaConsumer<>(properties)) {
consumer.subscribe(Collections.singletonList("topic"));
while (true) {
List<Integer> sequence = new ArrayList<>();
ConsumerRecords<Integer, String> records = consumer.poll(Duration.ofSeconds(1L));
records.forEach(record -> {
sequence.add(record.key());
});
System.out.println(sequence.size());
checkConsistency(sequence);
Thread.sleep(2500L);
}
}

Related

Writing <String,Long> key,value pair from kafkaTable to Kafkatopic causes corrupt values on topic

I am trying to run kafkastreams example where its counts favourite colours produced to a topic. But writing to final topic from kafka table I cannot figure out why long value is not written to the output topic.Below is my code please help me figure out what i am doing wrong.
package org.example;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.common.TopicCollection;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.*;
import java.util.Arrays;
import java.util.Collections;
import java.util.Properties;
public class FavouriteColorStreamsExample {
public static void main(String[] args) {
Properties config = new Properties();
config.setProperty(StreamsConfig.APPLICATION_ID_CONFIG,"streams-fav-color-app");
config.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
config.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest");
config.setProperty(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
config.setProperty(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// we disable the cache to demonstrate all the "steps" involved in the transformation - not recommended in prod
config.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, "0");
StreamsBuilder streamsBuilder = new StreamsBuilder();
//stream from kafka topic
/*
Produce below messages to fav-color-input topic
stephane,blue
john,green
stephane,red
alice,red
*/
KStream<String,String> textLines = streamsBuilder.stream("fav-color-input");
//lets filter out only blue,green,red colors inputs
KStream<String,String> userAndColours = textLines
//we filter all input and make sure correct color and
.filter((key, val) -> val.contains(","))
//we get the key as stephane, john, alice
.selectKey((key, val) -> val.split(",")[0].toLowerCase())
.mapValues(val -> val.split(",")[1].toLowerCase())
//filter with colors
.filter((user,val) -> (Arrays.asList("blue","red","green").contains(val)));
userAndColours.to("fav-color-output-temp");
//read from above intermetary topic "fav-color-output-temp" into a ktable
KTable<String, String> usersAndColoursTable = streamsBuilder.table("fav-color-output-temp");
KTable<String, Long> favouriteColour = usersAndColoursTable
.groupBy((user, color) -> new KeyValue<>(color, color))
.count(Named.as("CountsByColours"));
favouriteColour.toStream()
.peek((key,value) -> System.out.println("Key : "+key+" => value : "+value));
favouriteColour.toStream().
to("fav-color-output", Produced.with(Serdes.String(),Serdes.Long()));
streamsBuilder.stream("fav-color-output")
.peek((key,value) -> System.out.println("Key : "+key+" => value : "+value));
Topology topology = streamsBuilder.build(config);
KafkaStreams kafkaStreams = new KafkaStreams(topology,config);
//only on dev - not in pod
kafkaStreams.cleanUp();
kafkaStreams.start();
//shutdown to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(kafkaStreams::close));
}
}
This Line where we write from the ktable back to kafka topic for some reason does not work i guess.
to("fav-color-output", Produced.with(Serdes.String(),Serdes.Long()));```
my consumer topic ouput is just blank lines and And My Intellj show this corrupt values. screenshots below
[1]: https://i.stack.imgur.com/c8sMp.png
[2]: https://i.stack.imgur.com/WbE8t.png

Kafka Consumer Poll runs indefinitely and doesn't return anything

I am facing difficulty with KafkaConsumer.poll(duration timeout), wherein it runs indefinitely and never come out of the method. Understand that this could be related to connection and I have seen it a bit inconsistent sometimes. How do I handle this should poll stops responding? Given below is the snippet from KafkaConsumer.poll()
public ConsumerRecords<K, V> poll(final Duration timeout) {
return poll(time.timer(timeout), true);
}
and I am calling the above from here :
Duration timeout = Duration.ofSeconds(30);
while (true) {
final ConsumerRecords<recordID, topicName> records = consumer.poll(timeout);
System.out.println("record count is" + records.count());
}
I am getting the below error:
org.apache.kafka.common.errors.SerializationException: Error
deserializing key/value for partition at offset 2. If
needed, please seek past the record to continue consumption.
I stumbled upon some useful information while trying to fix the problem I was facing above. I will provide the piece of code which should be able to handle this, but before that it is important to know what causes this.
While producing or consuming message or data to Apache Kafka, we need schema structure to that message or data, in my case Avro schema. If there is a conflict of message being produced to Kafka that conflict with that message schema, it will have an effect on consumption.
Add below code in your consumer topic in the method where it consume records --
do remember to import below packages:
import org.apache.kafka.common.TopicPartition;
import org.jsoup.SerializationException;
try {
while (true) {
ConsumerRecords<String, GenericRecord> records = null;
try {
records = consumer.poll(10000);
} catch (SerializationException e) {
String s = e.getMessage().split("Error deserializing key/value
for partition ")[1].split(". If needed, please seek past the record to
continue consumption.")[0];
String topics = s.split("-")[0];
int offset = Integer.valueOf(s.split("offset ")[1]);
int partition = Integer.valueOf(s.split("-")[1].split(" at") .
[0]);
TopicPartition topicPartition = new TopicPartition(topics,
partition);
//log.info("Skipping " + topic + "-" + partition + " offset "
+ offset);
consumer.seek(topicPartition, offset + 1);
}
for (ConsumerRecord<String, GenericRecord> record : records) {
System.out.printf("value = %s \n", record.value());
}
}
} finally {
consumer.close();
}
I ran into this while setting up a test environment.
Running the following command on the broker printed out the stored records as one would expect:
bin/kafka-console-consumer.sh --bootstrap-server="localhost:9092" --topic="foo" --from-beginning
It turned out that the Kafka server was misconfigured. To connect from an external
IP address listeners must have a valid value in kafka/config/server.properties, e.g.
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092

Python librdkafka producer perform against the native Apache Kafka Producer

I am testing Apache Kafka Producer with native java implementation against Python's confluent-kafka to see which has the maximum throughput.
I am deploying a Kafka cluster with 3 Kafka brokers and 3 zookeeper instances using docker-compose. My docker compose file: https://paste.fedoraproject.org/paste/bn7rr2~YRuIihZ06O3Q6vw/raw
It's a very simple code with mostly default options for Python confluent-kafka and some config changes in java producer to match that of confluent-kafka.
Python Code:
from confluent_kafka import Producer
producer = Producer({'bootstrap.servers': 'kafka-1:19092,kafka-2:29092,kafka-3:39092', 'linger.ms': 300, "max.in.flight.requests.per.connection": 1000000, "queue.buffering.max.kbytes": 1048576, "message.max.bytes": 1000000,
'default.topic.config': {'acks': "all"}})
ss = '0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357'
def f():
import time
start = time.time()
for i in xrange(1000000):
try:
producer.produce('test-topic', ss)
except Exception:
producer.poll(1)
try:
producer.produce('test-topic', ss)
except Exception:
producer.flush(30)
producer.produce('test-topic', ss)
producer.poll(0)
producer.flush(30)
print(time.time() - start)
if __name__ == '__main__':
f()
Java implementation. Configuration same as config in librdkafka. Changed the linger.ms and callback as suggested by Edenhill.
package com.amit.kafka;
import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.LongSerializer;
import org.apache.kafka.common.serialization.StringSerializer;
import java.nio.charset.Charset;
import java.util.Properties;
import java.util.concurrent.TimeUnit;
public class KafkaProducerExampleAsync {
private final static String TOPIC = "test-topic";
private final static String BOOTSTRAP_SERVERS = "kafka-1:19092,kafka-2:29092,kafka-3:39092";
private static Producer<String, String> createProducer() {
int bufferMemory = 67108864;
int batchSizeBytes = 1000000;
String acks = "all";
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "KafkaExampleProducer");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.BATCH_SIZE_CONFIG, batchSizeBytes);
props.put(ProducerConfig.LINGER_MS_CONFIG, 100);
props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, bufferMemory);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 1000000);
props.put(ProducerConfig.ACKS_CONFIG, acks);
return new KafkaProducer<>(props);
}
static void runProducer(final int sendMessageCount) throws InterruptedException {
final Producer<String, String> producer = createProducer();
final String msg = "0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357";
final ProducerRecord<String, String> record = new ProducerRecord<>(TOPIC, msg);
final long[] new_time = new long[1];
try {
for (long index = 0; index < sendMessageCount; index++) {
producer.send(record, new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
// This if-else is to only start timing this when first message reach kafka
if(e != null) {
e.printStackTrace();
} else {
if (new_time[0] == 0) {
new_time[0] = System.currentTimeMillis();
}
}
}
});
}
} finally {
// producer.flush();
producer.close();
System.out.printf("Total time %d ms\n", System.currentTimeMillis() - new_time[0]);
}
}
public static void main(String... args) throws Exception {
if (args.length == 0) {
runProducer(1000000);
} else {
runProducer(Integer.parseInt(args[0]));
}
}
}
Benchmark results(Edited after making changes recommended by Edenhill)
Acks = 0, Messages: 1000000
Java: 12.066
Python: 9.608 seconds
Acks: all, Messages: 1000000
Java: 45.763 11.917 seconds
Python: 14.3029 seconds
Java implementation is performing same as Python implementation even after making all the changes that I could think of and the ones suggested by Edenhill in the comment below.
There are various benchmarks about the performance of Kafka in Python but I couldn't find any comparing librdkafka or python Kafka against Apache Kafka.
I have two questions:
Is this test enough to come to the conclusion that with default config's and message of size 1Kb librdkafka is faster?
Does anyone have experience or a source(blog, doc etc) benchmarking librdkafka against confluent-kafka?
Python client uses librdkakfa which overrides some of the default configuration of Kafka.
Paramter = Kafka default
batch.size = 16384
max.in.flight.requests.per.connection = 5 (librdkafka's default is 1000000)
message.max.bytes in librdkafka may be equivalent to max.request.size.
I think there is no equivalent of librdKafka's queue.buffering.max.messages in Kafka's producer API. If you find something then correct me.
Also, remove buffer.memory parameter from Java program.
https://kafka.apache.org/documentation/#producerconfigs
https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
Next thing is Java takes some time to load classes. So you need to increase the number of messages your producers producer. It would be great if it takes at-least 20-30 minutes to produce all messages. Then you can compare Java client with Python client.
I like the idea of comparison between python and java client. Keep posting your results on stackoverflow.

Using Kafka through Observable(RxJava)

I have a producer (using Kafka), and more than one consumer. So I publish a message in a topic and then my consumers receive and process the message.
I need to receive a response in the producer from at least one consumer (better if it be the first). I'm trying to use RxJava to do it (observables).
Is it possible to do in that way? Anyone have an example?
Here is how I am using rxjava '2.2.6' without any additional dependencies to process Kafka events:
import io.reactivex.Observable;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.time.Duration;
import java.util.Arrays;
import java.util.Properties;
...
// Load consumer props
Properties props = new Properties();
props.load(KafkaUtils.class.getClassLoader().getResourceAsStream("kafka-client.properties"));
// Create a consumer
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
// Subscribe to topics
consumer.subscribe(Arrays.asList(props.getProperty("kafkaTopics").split("\\s*,\\s*")));
// Create an Observable for topic events
Observable<ConsumerRecords<String, String>> observable = Observable.fromCallable(() -> {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofSecond(10);
return records;
});
// Process Observable events
observable.subscribe(records -> {
if ((records != null) && (!records.isEmpty())) {
for (ConsumerRecord<String, String> record : records) {
System.out.println(record.offset() + ": " + record.value());
}
}
});
You can use it as follows:
val consumer = new RxConsumer("zookeeper:2181", "consumer-group")
consumer.getRecordStream("cool-topic-(x|y|z)")
.map(deserialize)
.take(42 seconds)
.foreach(println)
consumer.shutdown()
For more information see:
https://github.com/cjdev/kafka-rx
Would be better that you'll share your solution first of all...
Since Spring Cloud Stream is a, mh, stream solution, not request/reply, there is no an example to share with you.
You can consider to make your consumers as producers as well. And in the original producer have a consumer to read from the replying topic. Finally you will have to correlate the reply data with the request one.
The RxJava or any other implementation details aren't related.

Why doesn't a new consumer group id start from the beginning

I have a kafka 0.10 cluster with several topics that have messages produced to them.
When I subscribe to the topics with a KafkaConsumer and a new group Id I get no records returned, but if I subscribe to the topics with a ConsumerRebalanceListener that seeks to the beginning with the same group Id, then I get the records in the topic.
#Grab('org.apache.kafka:kafka-clients:0.10.0.0')
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.clients.consumer.ConsumerRecords
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.clients.consumer.ConsumerRebalanceListener
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.PartitionInfo
Properties props = new Properties()
props.with {
put("bootstrap.servers","***********:9091")
put("group.id","script-test-noseek")
put("enable.auto.commit","true")
put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
put("session.timeout.ms",30000)
}
KafkaConsumer consumer = new KafkaConsumer(props)
def topicMap = [:]
consumer.listTopics().each { topic, partitioninfo ->
topicMap[topic] = 0
}
topicMap.each {topic, count ->
def stopTime = new Date().time + 30_000
def stop = false
println "Starting topic: $topic"
consumer.subscribe([topic])
//consumer.subscribe([topic], new CRListener(consumer:consumer))
while(!stop) {
ConsumerRecords<String, String> records = consumer.poll(5_000)
topicMap[topic] += records.size()
consumer.commitAsync()
if ( new Date().time > stopTime || records.size() == 0) {
stop = true
}
}
consumer.unsubscribe()
}
def total = 0
println "------------------- Results -----------------------"
topicMap.each { k,v ->
if ( v > 0 ) {
println "Topic: ${k.padRight(64,' ')} Records: ${v}"
}
total += v
}
println "==================================================="
println "Total: ${total}"
def dummy = "Process End"
class CRListener implements ConsumerRebalanceListener {
KafkaConsumer consumer
void onPartitionsAssigned(java.util.Collection partitions) {
consumer.seekToBeginning(partitions)
}
void onPartitionsRevoked(java.util.Collection partitions) {
consumer.commitSync()
}
}
The code is Groovy 2.4.x. And I've masked the bootstrap server.
If I uncomment the consumer subscribe line with the listener, it does what I expect it to do. But as it is I get no results.
Assume that I change the group Id for each run, just so as to not be picking up where another execution leaves off.
I can't see what I'm doing wrong. Any help would be appreciated.
If you use a new consumer group id and want to read the whole topic from beginning, you need to specify parameter "auto.offset.reset=earliest" in your properties. (default value is "latest")
Properties props = new Properties()
props.with {
// all other values...
put("auto.offset.reset","earliest")
}
On consumer start-up the following happens:
look for (valid) committed offset for use group.id
if (valid) offset is found, resume from there
if no (valid) offset is found, set offset according to auto.offset.reset