I am using kafka with spring boot. I read so many articles on kafka tuning and made some configuration changes on both producer and consumer sides.
But still kafka took 1.10 seconds to reach one message from producer to consumer. As per kafka, kafka is very well programmed for high latency but I am not able achieve.
ProducerConfig :
#Bean
public Map<String, Object> producerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 20000);
props.put(ProducerConfig.LINGER_MS_CONFIG, 1000);
return props;
}
ConsumerConfig
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.RECEIVE_BUFFER_CONFIG, 502400);
return props;
}
Some Points which I want to described :
Producer : for low latency, Batch.size and liner.ms is very crucial on producer configuration. Try to set as maximum.
Consumer : For high-throughput, need to increase buffer size.
I had these two changes but still not able to achieve what I am expected from Kafka.
How will be achieve High latency using kafka ? Any specific configuration in producer and consumer ?
Any suggestions will help me .
There is a nice white paper on Optimizing your Kafka Deployment by Confluent which explains in full details how to optimize the configurations for
latency
throughput
durability
availability
You are interested in reducing latency. Your current configuration can lead to a high latency because linger.ms is set to 1000 (1 second). In addition, the batch.size is "quite high" which probably leads to a delay of at least 1 second (from linger.ms).
To reduce your latency, you can force your producer to send messages without any delay and irrespective of their size by setting linger.ms=0.
Overall, the white paper give the following recommendation, which is also what I have experienced in practice and can highly support:
Producer:
linger.ms=0
compression.type=none
acks=1
Consumer:
fetch.min.bytes=1
Related
How do I increase the performance of Kafka consumer ?I have(and need) Atleast Once Kafka Consumer semantics
I have the below configuration.The processInDB() takes 2 minutes to complete .So just to process 10 messages(all in single partition) its taking 20 minutes(assuming 2 minutes per message). I can call processInDB in different thread but I can lose messages !.How can I process all 10 messages between 2 to 4 minutes window ?
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "grpid-mytopic120112141");
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(JsonDeserializer.TRUSTED_PACKAGES, "*");
ConcurrentKafkaListenerContainerFactory<String, ValidatedConsumerClass> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckMode(AckMode.RECORD);
factory.setErrorHandler(errorHandler());
Below is my Kafka Consumer code.
#KafkaListener(id = "foo", topics = "mytopic-3", concurrency = "6", groupId = "mytopic-1-groupid")
public void consumeFromTopic1(#Payload #Valid ValidatedConsumerClass message, ConsumerRecordMetadata c) {
dbservice.processInDB(message);
}
Using a batch listener would help - you just need to hold up the consumer thread in the listener until all the individual records have completed processing.
In the next release (2.8.0-M1 milestone released today) there is support for out-of-order manual acknowledgments where the framework defers the commits until the "gaps are filled" https://docs.spring.io/spring-kafka/docs/2.8.0-M1/reference/html/#x28-ooo-commits
Another suggestion not purely related to spring kafka, as you stated in your tags that your also exploring the consumer api and not only spring kafka, so I am allowing to myself to suggest it here, you might want to test out this api
https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
https://github.com/confluentinc/parallel-consumer
its in alpha stage, so not recommended for production, but may keep an eye on that as well
But as stated in my earlier comments , you might just want to make more partitions
My configuration:
props.put("sasl.mechanism", "PLAIN");
props.put("enable.auto.commit", "false");
props.put("auto.commit.interval.ms", "1000");
props.put("auto.offset.reset", "latest");
props.put("request.timeout.ms", 16000);
props.put("max.partition.fetch.bytes", "4194304");
props.put("max.poll.records","3000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
ConsumerRecords consumerRecords = consumer.poll(100);
You can increase the consumer configuration fetch.min.bytes which defaults to 1 to ensure more data is being fetched at once. However, if there is simply no new data arriving at the topic after 16secs (as set in your request.timeout.ms configuration) you will still see polls that fetch zero messages.
Here is the full description of the configuration fetch.min.bytes from the Kafka documentation on consumer configs:
fetch.min.bytes: The minimum amount of data the server should return for a fetch request. If insufficient data is available the request will wait for that much data to accumulate before answering the request. The default setting of 1 byte means that fetch requests are answered as soon as a single byte of data is available or the fetch request times out waiting for data to arrive. Setting this to something greater than 1 will cause the server to wait for larger amounts of data to accumulate which can improve server throughput a bit at the cost of some additional latency.
Working with Kafka(v2.11-0.10.1.0)-spark-streaming(v-2.0.1-bin-hadoop2.7).
I have Kafka Producer and Spark-streaming consumer to produce and consume. All works fine till I stop consumer(for approx 2-min) and start again. The consumer starts and reads data, absolutely perfect. But, I'm lost with the 2-min data, where consumer was off.
Kafka consumer/server.properties are unchanged.
Kafka producer with properties:
Properties properties = new Properties();
properties.put("bootstrap.servers", AppCoding.KAFKA_HOST);
properties.put("auto.create.topics.enable", true);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("retries", 1);
logger.info("Initializing Kafka Producer.");
Producer<String, String> producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(AppCoding.KAFKA_TOPIC, "", documentAsString));
Consuming using Spark-streaming api as:
SparkConf sparkConf = new SparkConf().setMaster(args[4]).setAppName("Streaming");
// Create the context with 60 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60000 * 5));
//input arguments:localhost:2181 sparkS incoming 10 local[*]
Set<String> topicsSet = new HashSet<>(Arrays.asList(args[2].split(";")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
//input arguments: localhost:9092 "" incoming 10 local[*]
JavaPairInputDStream<String, String> kafkaStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
On the other end i have been using ActiveMQ. While ActiveMQ Consumer could fetch me the data while its off.
Help me out if there's a confuguration problem.
In Kafka, consumers actually have no direct relationship with producers. Each consumer has an offset which tracks what has been consumed in the partitions. If a consumer has no offset tracked, Kafka will automatically reset its offset to the largest one because of the default value of config 'auto.offset.reset'. In your case, when the brand-new consumer is started, due to the default policy, it does not see the messages produced previously. You could set 'auto.offset.reset' to earliest (for new consumer) or smallest (for old consumer).
Kafka maintains offset per partition per record basis. While consumer was off for 2 minute duration, offset value would be stored in topic metadata for new-consumer, and again when the consumer is started back after 2minutes, it would read last offset which was stored in kafka topic.
I think what you need to check is kafka broker data retention policy if it is less than 2 minutes , data would be lost , if data corresponding to offset is not present , it would start reading from latest as by default value is set to latest auto.offset.reset=latest for new data arriving.
I would suggest to check and change kafka data retention policy accordingly if it is less than 2 minutes
Kafka is confusing me. I am running it local with standard values.
only auto create topic turned on. 1 partition, 1 node, everything local and simple.
If it write
consumer.subscribe("test_topic");
consumer.poll(10);
It simply won't work and never finds any data.
If I instead assign a partition like
consumer.assign(new TopicPartition("test_topic",0));
and check the position I sit at 995. and now can poll and receive all the data my producer put in.
What is it that I don't understand about subscriptions? I don't need multiple consumers each handling only a part of the data. My consumer needs to get all the data of a certain topic. Why does the standard subscription approach not work for me that is shown in all the tutorials?
I do understand that partitions are for load balancing consumers. I don't understand what I do wrong with the subscription.
consumer config properties
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "postproc-" + EnvUtils.getAppInst()); // jeder ist eine eigene gruppe -> kriegt alles
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<Long, byte[]> consumer = new KafkaConsumer<Long, byte[]>(props);
producer config
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 2);
props.put("batch.size", 16384);
props.put("linger.ms", 5000);
props.put("buffer.memory", 1024 * 1024 * 10); // 10mb
props.put("key.serializer", "org.apache.kafka.common.serialization.LongSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
return new KafkaProducer(props);
producer execution
try (ByteArrayOutputStream out = new ByteArrayOutputStream()){
event.writeDelimitedTo(out);
for (long a = 10; a<20;a++){
long rand=new Random(a).nextLong();
producer.send(new ProducerRecord<>("test_topic",rand ,out.toByteArray()));
}
producer.flush();
}catch (IOException e){
consumer execution
consumer.subscribe(Arrays.asList("test_topic"));
ConsumerRecords<Long,byte[]> records = consumer.poll(10);
for (ConsumerRecord<Long,byte[]> r :records){ ...
I managed to solve the issue. The problem were timeouts. When piling I didn't give it enough time to complete. I assume assigning a partition just is a lot faster and therfore completed timely. The standard subscription poll takes longer. Never actually finished and did not commit.
At least I think that was the problem. With longer timeouts it works.
You are missing this property I think
auto.offset.reset=earliest
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Reference: http://kafka.apache.org/documentation.html#highlevelconsumerapi
I am testing the Kafka High Level Consumer using the ConsumerGroupExample code from the Kafka site. I would like to retrieve all the existing messages on the topic called "test" that I have in the Kafka server config. Looking at other blogs, auto.offset.reset should be set to "smallest" to be able to get all messages:
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
Properties props = new Properties();
props.put("zookeeper.connect", a_zookeeper);
props.put("group.id", a_groupId);
props.put("auto.offset.reset", "smallest");
props.put("zookeeper.session.timeout.ms", "10000");
return new ConsumerConfig(props);
}
The question I really have is this: what is the equivalent Java api call for the High Level Consumer that is the equivalent of:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Basically, everytime a new consumer tries to consume a topic, it'll read messages from the beginning. If you're especially just consuming from the beginning each time for testing purposes, everytime you initialise your consumer with a new groupID, it'll read the messages from the beginning. Here's how I did it :
properties.put("group.id", UUID.randomUUID().toString());
and read messages from the beginning each time!
Looks like you need to use the "low level SimpleConsumer API"
For most applications, the high level consumer Api is good enough.
Some applications want features not exposed to the high level consumer
yet (e.g., set initial offset when restarting the consumer). They can
instead use our low level SimpleConsumer Api. The logic will be a bit
more complicated and you can follow the example in here.
This example worked for getting all messages from a topic with the following arguments: (note that the port is the Kafka port, not the ZooKeeper port, topics set up from this example):
10 my-replicated-topic 0 localhost 9092
Specifically, there is a method to get readOffset which takes kafka.api.OffsetRequest.EarliestTime():
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
Here is another post may provide some alternate ideas on how to sort this out: How to get data from old offset point in Kafka?
To fetch messages from the beginning, you can do this:
import kafka.utils.ZkUtils;
ZkUtils.maybeDeletePath("zkhost:zkport", "/consumers/group.id");
then just follow the routine work...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("auto.offset.reset", "earliest");
props.put("group.id", UUID.randomUUID().toString());
This properties will help you.