Kafka Consumer subscription vs. assigned partition - apache-kafka

Kafka is confusing me. I am running it local with standard values.
only auto create topic turned on. 1 partition, 1 node, everything local and simple.
If it write
consumer.subscribe("test_topic");
consumer.poll(10);
It simply won't work and never finds any data.
If I instead assign a partition like
consumer.assign(new TopicPartition("test_topic",0));
and check the position I sit at 995. and now can poll and receive all the data my producer put in.
What is it that I don't understand about subscriptions? I don't need multiple consumers each handling only a part of the data. My consumer needs to get all the data of a certain topic. Why does the standard subscription approach not work for me that is shown in all the tutorials?
I do understand that partitions are for load balancing consumers. I don't understand what I do wrong with the subscription.
consumer config properties
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "postproc-" + EnvUtils.getAppInst()); // jeder ist eine eigene gruppe -> kriegt alles
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<Long, byte[]> consumer = new KafkaConsumer<Long, byte[]>(props);
producer config
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 2);
props.put("batch.size", 16384);
props.put("linger.ms", 5000);
props.put("buffer.memory", 1024 * 1024 * 10); // 10mb
props.put("key.serializer", "org.apache.kafka.common.serialization.LongSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
return new KafkaProducer(props);
producer execution
try (ByteArrayOutputStream out = new ByteArrayOutputStream()){
event.writeDelimitedTo(out);
for (long a = 10; a<20;a++){
long rand=new Random(a).nextLong();
producer.send(new ProducerRecord<>("test_topic",rand ,out.toByteArray()));
}
producer.flush();
}catch (IOException e){
consumer execution
consumer.subscribe(Arrays.asList("test_topic"));
ConsumerRecords<Long,byte[]> records = consumer.poll(10);
for (ConsumerRecord<Long,byte[]> r :records){ ...

I managed to solve the issue. The problem were timeouts. When piling I didn't give it enough time to complete. I assume assigning a partition just is a lot faster and therfore completed timely. The standard subscription poll takes longer. Never actually finished and did not commit.
At least I think that was the problem. With longer timeouts it works.

You are missing this property I think
auto.offset.reset=earliest
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Reference: http://kafka.apache.org/documentation.html#highlevelconsumerapi

Related

How to achieve high performance of Spring Kafka Consumer

How do I increase the performance of Kafka consumer ?I have(and need) Atleast Once Kafka Consumer semantics
I have the below configuration.The processInDB() takes 2 minutes to complete .So just to process 10 messages(all in single partition) its taking 20 minutes(assuming 2 minutes per message). I can call processInDB in different thread but I can lose messages !.How can I process all 10 messages between 2 to 4 minutes window ?
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "grpid-mytopic120112141");
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(JsonDeserializer.TRUSTED_PACKAGES, "*");
ConcurrentKafkaListenerContainerFactory<String, ValidatedConsumerClass> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(errorHandler());
Below is my Kafka Consumer code.
#KafkaListener(id = "foo", topics = "mytopic-3", concurrency = "6", groupId = "mytopic-1-groupid")
public void consumeFromTopic1(#Payload #Valid ValidatedConsumerClass message, ConsumerRecordMetadata c) {
dbservice.processInDB(message);
}
Using a batch listener would help - you just need to hold up the consumer thread in the listener until all the individual records have completed processing.
In the next release (2.8.0-M1 milestone released today) there is support for out-of-order manual acknowledgments where the framework defers the commits until the "gaps are filled" https://docs.spring.io/spring-kafka/docs/2.8.0-M1/reference/html/#x28-ooo-commits
Another suggestion not purely related to spring kafka, as you stated in your tags that your also exploring the consumer api and not only spring kafka, so I am allowing to myself to suggest it here, you might want to test out this api
https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
https://github.com/confluentinc/parallel-consumer
its in alpha stage, so not recommended for production, but may keep an eye on that as well
But as stated in my earlier comments , you might just want to make more partitions

How to get the latest value from a kafka Stream

I am fairly new to Kafka and streaming.I have a requirement like every time I run the kafka producer and consumer I should get the only message produced by producer.
Below is the basic code for Producer and consumer
Producer
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val record = new ProducerRecord[String, String]("test", "key", jsonstring)
producer.send(record)
producer.close()
Consumer
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("auto.offset.reset", "earliest")
props.put("group.id", "13")
val consumer: KafkaConsumer[String, Map[String,Any]] = new KafkaConsumer[String, Map[String,Any]](props)
consumer.subscribe(util.Arrays.asList("test"))
while (true) {
val record = consumer.poll(1000).asScala
for (data <- record.iterator){
println(data.value())
}
The Input Json I am using is the below
{
"id":1,
"Name":"foo"
}
Now the Problem I am facing is each time I run the program I am getting the duplicated values.For example If I run the code twice the consumer output looks like this
{
"id":1,
"Name":"foo"
}
{
"id":1,
"Name":"foo"
}
I want the output like if I run the program the only message that is processed by producer should be consume and should be printed.
I hv tried few things like changing the consumer properties for offset to latest
props.put("auto.offset.reset", "latest")
I also tried things mentioned like below but it didnot work for me
How can I get the LATEST offset of a kafka topic?
Can you please suggest any alternatives??
Consumer read messages from a topic partition on sequential order.
If you call to poll(), it returns records written to Kafka that consumers in our group have not read yet. Kafka tracks their consumption offset on each partition to know where to start to consume in case of restart.
Consumers maintain their partition offset in topic __consumer_offsets by using commit.
Commit is the action of updating the current position in
__consumer_offsets.
If a consumer restarted, In order to know where to start to consume, the consumer will read the latest committed offset of each partition and continue from there.
You can control the commit by two ways either set auto-commit true with commit interval
1.By enable.auto.commit true
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
2.Manual commit
consumer.commitAsync();//asyn commit
or
consumer.commitSync();//sync commit
If you fail to commit it will restart from the last committed position as shown on below pics
auto.offset.reset:
Once the consumer restarted the first time it uses auto.offset.reset to determine the initial position for each assigned partition. Please note when the group first created with a unique group id, before any messages have been consumed, the position is set according to a configurable offset reset policy (auto.offset.reset). After that, it will continue consuming message incrementally and use commit (as explained above) to track the latest consume message
Note: If the consumer crashes before any offset has been committed,
then the consumer which takes over its partitions will use the reset
policy.
So in your case
Either use manual offset commit or enable.auto.commit true for auto-commit.
Always use the same group id if you change group if it will treat different consumers and use auto.offset.reset to assign offset.
Reference: https://www.confluent.io/resources/kafka-the-definitive-guide/

Kafka producer behavior

I am trying latest kafka version 1.1.0.
I have point which bothering me about producer behavior.
Below is a small piece of code
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 3);
props.put("max.in.flight.requests.per.connection", 1);
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic",
Integer.toString(i), Integer.toString(i)), new CallBack());
assumptions
each message is sent to the same partition of a topic.
Size of each message is large enough that, it is submitted to the broker (not held in the buffer)
Now when the index is 0 and send method fails but for subsequent send call didn't fail, then, in that case, will the message's reach to broker in out of sequence. That is index 0 message will not be the first to reach to broker even if a retry code is added.
Will it be the same behavior if I add below configuration property
enable.idempotence=true
Is there any elegant approach to handle this situation? that is to maintain the order of messages
Thanks in advance

How to obtain Offset of Kafka consumer?

Working with Kafka(v2.11-0.10.1.0)-spark-streaming(v-2.0.1-bin-hadoop2.7).
I have Kafka Producer and Spark-streaming consumer to produce and consume. All works fine till I stop consumer(for approx 2-min) and start again. The consumer starts and reads data, absolutely perfect. But, I'm lost with the 2-min data, where consumer was off.
Kafka consumer/server.properties are unchanged.
Kafka producer with properties:
Properties properties = new Properties();
properties.put("bootstrap.servers", AppCoding.KAFKA_HOST);
properties.put("auto.create.topics.enable", true);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("retries", 1);
logger.info("Initializing Kafka Producer.");
Producer<String, String> producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(AppCoding.KAFKA_TOPIC, "", documentAsString));
Consuming using Spark-streaming api as:
SparkConf sparkConf = new SparkConf().setMaster(args[4]).setAppName("Streaming");
// Create the context with 60 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60000 * 5));
//input arguments:localhost:2181 sparkS incoming 10 local[*]
Set<String> topicsSet = new HashSet<>(Arrays.asList(args[2].split(";")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
//input arguments: localhost:9092 "" incoming 10 local[*]
JavaPairInputDStream<String, String> kafkaStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
On the other end i have been using ActiveMQ. While ActiveMQ Consumer could fetch me the data while its off.
Help me out if there's a confuguration problem.
In Kafka, consumers actually have no direct relationship with producers. Each consumer has an offset which tracks what has been consumed in the partitions. If a consumer has no offset tracked, Kafka will automatically reset its offset to the largest one because of the default value of config 'auto.offset.reset'. In your case, when the brand-new consumer is started, due to the default policy, it does not see the messages produced previously. You could set 'auto.offset.reset' to earliest (for new consumer) or smallest (for old consumer).
Kafka maintains offset per partition per record basis. While consumer was off for 2 minute duration, offset value would be stored in topic metadata for new-consumer, and again when the consumer is started back after 2minutes, it would read last offset which was stored in kafka topic.
I think what you need to check is kafka broker data retention policy if it is less than 2 minutes , data would be lost , if data corresponding to offset is not present , it would start reading from latest as by default value is set to latest auto.offset.reset=latest for new data arriving.
I would suggest to check and change kafka data retention policy accordingly if it is less than 2 minutes

kafka 0.90 consumer persist group between runs

I have built the following kafka consumer:
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:6667");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "TEST1");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "10000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG,"1000");
this.kconsumer = new KafkaConsumer(props);
I want to the consumer to start with the earliest for this group when it is initiated. So the first time I run it, it works perfectly as expected. As long as the subscription exists and the connection is not closed it continues to increase the offset.
When I log in to kafka and run the following:
./kafka-consumer-groups.sh --bootstrap-server localhost:6667 --new-consumer --group TEST1 --describe
I see exactly what is expected, an increase in offset, etc. When the connection is closed however running the same command results in "Consumer group TEST1 does not exist or is rebalancing." Only it is not rebalancing, it is gone.
How do I persist the existence of the group when the consumer is not running? Am I missing a config in the consumer or in kafka?
As another note, when I alter the OFFSET parameter to "latest" I get no records at all unless new ones are loaded even though the records are not expired.
So bottom line, what I want to be able to do is spin up a new consumer with a given name, be able to pull from the earliest available record, shut down that consumer and if I start a consumer with that name again pull from where I left off. Any ideas of what I am missing? Or am I just misunderstanding how the high level consumer is meant to work at all?
In case someone comes across this and wants to know what I did. I was able to set the offset after determining if the group existed first. Doing it this way means if the group exists use "latest". If not, use "earliest".
private void buildConsumer(String offset)
{
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:6667");
props.put(ConsumerConfig.GROUP_ID_CONFIG, this.groupId);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "10000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, offset);
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG,"1000");
this.kconsumer = new KafkaConsumer(props);
}
/*
Check if the group exists before polling.
If it does, leave with default offset.
If it does not exists, set the offset to earliest to ensure you are getting all the records
*/
private void groupExists(String topic)
{
TopicPartition toc = new TopicPartition(topic, 0);
OffsetAndMetadata oam = kconsumer.committed(toc);
if(oam != null){
//do nothing, all is well, start from last commit
} else {
/*
when a new group is started the AUTO_OFFSET_RESET_CONFIG
needs to be set to earliest to ensure all records are picked up
Since that property can only be set at instantiation the consumer
must be rebuilt and resubscribed
*/
buildConsumer("earliest");
this.kconsumer.subscribe(Arrays.asList(topic));
}
}