How do I increase the performance of Kafka consumer ?I have(and need) Atleast Once Kafka Consumer semantics
I have the below configuration.The processInDB() takes 2 minutes to complete .So just to process 10 messages(all in single partition) its taking 20 minutes(assuming 2 minutes per message). I can call processInDB in different thread but I can lose messages !.How can I process all 10 messages between 2 to 4 minutes window ?
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "grpid-mytopic120112141");
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(JsonDeserializer.TRUSTED_PACKAGES, "*");
ConcurrentKafkaListenerContainerFactory<String, ValidatedConsumerClass> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(errorHandler());
Below is my Kafka Consumer code.
#KafkaListener(id = "foo", topics = "mytopic-3", concurrency = "6", groupId = "mytopic-1-groupid")
public void consumeFromTopic1(#Payload #Valid ValidatedConsumerClass message, ConsumerRecordMetadata c) {
dbservice.processInDB(message);
}
Using a batch listener would help - you just need to hold up the consumer thread in the listener until all the individual records have completed processing.
In the next release (2.8.0-M1 milestone released today) there is support for out-of-order manual acknowledgments where the framework defers the commits until the "gaps are filled" https://docs.spring.io/spring-kafka/docs/2.8.0-M1/reference/html/#x28-ooo-commits
Another suggestion not purely related to spring kafka, as you stated in your tags that your also exploring the consumer api and not only spring kafka, so I am allowing to myself to suggest it here, you might want to test out this api
https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
https://github.com/confluentinc/parallel-consumer
its in alpha stage, so not recommended for production, but may keep an eye on that as well
But as stated in my earlier comments , you might just want to make more partitions
Related
I am using kafka with spring boot. I read so many articles on kafka tuning and made some configuration changes on both producer and consumer sides.
But still kafka took 1.10 seconds to reach one message from producer to consumer. As per kafka, kafka is very well programmed for high latency but I am not able achieve.
ProducerConfig :
#Bean
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 20000);
props.put(ProducerConfig.LINGER_MS_CONFIG, 1000);
return props;
}
ConsumerConfig
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, 502400);
return props;
}
Some Points which I want to described :
Producer : for low latency, Batch.size and liner.ms is very crucial on producer configuration. Try to set as maximum.
Consumer : For high-throughput, need to increase buffer size.
I had these two changes but still not able to achieve what I am expected from Kafka.
How will be achieve High latency using kafka ? Any specific configuration in producer and consumer ?
Any suggestions will help me .
There is a nice white paper on Optimizing your Kafka Deployment by Confluent which explains in full details how to optimize the configurations for
latency
throughput
durability
availability
You are interested in reducing latency. Your current configuration can lead to a high latency because linger.ms is set to 1000 (1 second). In addition, the batch.size is "quite high" which probably leads to a delay of at least 1 second (from linger.ms).
To reduce your latency, you can force your producer to send messages without any delay and irrespective of their size by setting linger.ms=0.
Overall, the white paper give the following recommendation, which is also what I have experienced in practice and can highly support:
Producer:
linger.ms=0
compression.type=none
acks=1
Consumer:
fetch.min.bytes=1
I try to implement a very simple Kafka (0.9.0.1) consumer in scala (code below).
For my understanding, Kafka (or better say the Zookeeper) stores for each groupId the offset of the last consumed message for a giving topic. So given the following scenario:
Consumer with groupId1 which Yesterday consumed the only 5
messages in a topic. Now last consumed message has offset 4 (considering the
first message with offset 0)
During the night 2 new messages arrive to the topic
Today I restart the consumer, with the same groupId1, there will
be two options:
Option 1: The consumer will read the last 2 new messages which arrived during the night if I set the following property as "latest":
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Option 2: The consumer will read all the 7 messages in the topic if I set the following property as "earliest":
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
Problem: For some reason, if I change the groupId of the consumer to groupId2, that is a new groupId for the given topic, so it never consumed any message before and its latest offset should be 0. I was expecting that by setting
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
The consumer will read during the first execution all the messages stored in the topic (the equivalent of having earliest). And then for following executions it will consume just the new ones. However this is not what happens.
If I set a new groupId and keep AUTO_OFFSET_RESET_CONFIG as latest, the consumer is not able to read any message. What I need to do then is for the first run set AUTO_OFFSET_RESET_CONFIG as earliest, and once there is already an offset different to 0 for the groupID I can move to latest.
Is this how it should be working my consumer? Is there a better solution than switching the AUTO_OFFSET_RESET_CONFIGafter the first time I run the consumer?
Below is the code I am using as a simple consumer:
class KafkaTestings {
val brokers = "listOfBrokers"
val groupId = "anyGroupId"
val topic = "anyTopic"
val props = createConsumerConfig(brokers, groupId)
def createConsumerConfig(brokers: String, groupId: String): Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000")
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000")
props.put(ConsumerConfig.CLIENT_ID_CONFIG, "12321")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props
}
def run() = {
consumer.subscribe(Collections.singletonList(this.topic))
Executors.newSingleThreadExecutor.execute( new Runnable {
override def run(): Unit = {
while (true) {
val records = consumer.poll(1000)
for (record <- records) {
println("Record: "+record.value)
}
}
}
})
}
}
object ScalaConsumer extends App {
val testConsumer = new KafkaTestings()
testConsumer.run()
}
This was used as a reference to write this simple consumer
This is working as documented.
If you start a new consumer group (i.e. one for which there are no existing offsets stored in Kafka), you have to choose if the consumer should be starting from the EARLIEST possible messages (the oldest message still available in the topic) or from the LATEST (only messages that produced from now on).
Is there a better solution than switching the AUTO_OFFSET_RESET_CONFIG after the first time I run the consumer?
You can keep it at EARLIEST, because the second time you run the consumer, it will already have stored offsets and just pick up there. The reset policy is only used when a new consumer group is created.
Today I restart the consumer, with the same groupId1, there will be two options:
Not really. Since the consumer group was running the day before, it will find its committed offsets and just pick up where it left off. So no matter what you set the reset policy to, it will get these two new messages.
By aware though, that Kafka does not store these offsets forever, I believe the default is just a week. So if you shut down your consumers for more than that, the offsets may be aged out, and you could run into an accidental reset to EARLIEST (which may be expensive for large topics). Given that, it is probably prudent to change it to LATEST anyway.
You can keep it at EARLIEST, because the second time you run the consumer, it will already have stored offsets and just pick up there. The reset policy is only used when a new consumer group is created.
In my testing, I often want to read from the earliest offset, but as noted, once you've read messages with a given groupId, then your offset remains at that pointer.
I do this:
properties.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID());
I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.
Previously I have been using 0.8 API. When you pass topics list to it, it returns a map of streams (one entry per topic). This allows me to spawn a separate thread and assign each topic's stream to it. Having too much data in each topic, spawning a separate thread helps multi tasking.
//0.8 code sample
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap =
consumer.createMessageStreams(topicCountMap);
I want to upgrade to 0.10. I checked KafkaStreams and KafkaConsumer classes. KafkaConsumer object takes config properties and provide the subscribe method that takes topics List and its return type is void. I cannot find a way where I can get a handle to each topic.
KafkaConsumer consumer = new KafkaConsumer(props);
consumer.subscribe(topicsList);
conusmer.poll(long ms)
KafkaStreams on the other hand seems to have the same problem.
KStreamBuilder builder = new KStreamBuilder();
String [] topics = new String[] {"topic1", "topic2"};
KStream<byte[], byte[]> source = builder.stream(stringSerde, stringSerde, topics);
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
There is source.foreach() method available but it is a stream of all topics. Anyone, any ideas ?
First, using a multi threaded consumer is tricky, thus the pattern you employed in 0.8 is hopefully well designed :)
Best practice is to use a single threaded consumer and thus, there is "no need" to separate different topics if a single consumer subscribes to a list of topics at once. Nevertheless, while consuming the record, the record object provides information about from which topic it originates from (it carries this metadata). Thus, you could theoretically dispatch a record according to its topics to a different thread for the actual processing (even if this is not recommended!).
Kafka scales out via partitions, thus, if a single-threaded consumer is not able to handle the load, you should start multiple consumers (as a consumer group) to scale out your consumer processing capacity.
A more general question: if you want to process data per topic, why not using multiple consumers each subscribing to a single topic each?
Last but not least, in Apache Kafka 0.10+ the Kafka Streams API is a newly introduced stream processing library -- though it must not be confused with 0.8 KafkaStream class (hint, there is no "s"). Both are completely unrelated to each other.
I am testing the Kafka High Level Consumer using the ConsumerGroupExample code from the Kafka site. I would like to retrieve all the existing messages on the topic called "test" that I have in the Kafka server config. Looking at other blogs, auto.offset.reset should be set to "smallest" to be able to get all messages:
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
Properties props = new Properties();
props.put("zookeeper.connect", a_zookeeper);
props.put("group.id", a_groupId);
props.put("auto.offset.reset", "smallest");
props.put("zookeeper.session.timeout.ms", "10000");
return new ConsumerConfig(props);
}
The question I really have is this: what is the equivalent Java api call for the High Level Consumer that is the equivalent of:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Basically, everytime a new consumer tries to consume a topic, it'll read messages from the beginning. If you're especially just consuming from the beginning each time for testing purposes, everytime you initialise your consumer with a new groupID, it'll read the messages from the beginning. Here's how I did it :
properties.put("group.id", UUID.randomUUID().toString());
and read messages from the beginning each time!
Looks like you need to use the "low level SimpleConsumer API"
For most applications, the high level consumer Api is good enough.
Some applications want features not exposed to the high level consumer
yet (e.g., set initial offset when restarting the consumer). They can
instead use our low level SimpleConsumer Api. The logic will be a bit
more complicated and you can follow the example in here.
This example worked for getting all messages from a topic with the following arguments: (note that the port is the Kafka port, not the ZooKeeper port, topics set up from this example):
10 my-replicated-topic 0 localhost 9092
Specifically, there is a method to get readOffset which takes kafka.api.OffsetRequest.EarliestTime():
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
Here is another post may provide some alternate ideas on how to sort this out: How to get data from old offset point in Kafka?
To fetch messages from the beginning, you can do this:
import kafka.utils.ZkUtils;
ZkUtils.maybeDeletePath("zkhost:zkport", "/consumers/group.id");
then just follow the routine work...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("auto.offset.reset", "earliest");
props.put("group.id", UUID.randomUUID().toString());
This properties will help you.