I am trying latest kafka version 1.1.0.
I have point which bothering me about producer behavior.
Below is a small piece of code
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 3);
props.put("max.in.flight.requests.per.connection", 1);
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic",
Integer.toString(i), Integer.toString(i)), new CallBack());
assumptions
each message is sent to the same partition of a topic.
Size of each message is large enough that, it is submitted to the broker (not held in the buffer)
Now when the index is 0 and send method fails but for subsequent send call didn't fail, then, in that case, will the message's reach to broker in out of sequence. That is index 0 message will not be the first to reach to broker even if a retry code is added.
Will it be the same behavior if I add below configuration property
enable.idempotence=true
Is there any elegant approach to handle this situation? that is to maintain the order of messages
Thanks in advance
Related
I try to implement a very simple Kafka (0.9.0.1) consumer in scala (code below).
For my understanding, Kafka (or better say the Zookeeper) stores for each groupId the offset of the last consumed message for a giving topic. So given the following scenario:
Consumer with groupId1 which Yesterday consumed the only 5
messages in a topic. Now last consumed message has offset 4 (considering the
first message with offset 0)
During the night 2 new messages arrive to the topic
Today I restart the consumer, with the same groupId1, there will
be two options:
Option 1: The consumer will read the last 2 new messages which arrived during the night if I set the following property as "latest":
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Option 2: The consumer will read all the 7 messages in the topic if I set the following property as "earliest":
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
Problem: For some reason, if I change the groupId of the consumer to groupId2, that is a new groupId for the given topic, so it never consumed any message before and its latest offset should be 0. I was expecting that by setting
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
The consumer will read during the first execution all the messages stored in the topic (the equivalent of having earliest). And then for following executions it will consume just the new ones. However this is not what happens.
If I set a new groupId and keep AUTO_OFFSET_RESET_CONFIG as latest, the consumer is not able to read any message. What I need to do then is for the first run set AUTO_OFFSET_RESET_CONFIG as earliest, and once there is already an offset different to 0 for the groupID I can move to latest.
Is this how it should be working my consumer? Is there a better solution than switching the AUTO_OFFSET_RESET_CONFIGafter the first time I run the consumer?
Below is the code I am using as a simple consumer:
class KafkaTestings {
val brokers = "listOfBrokers"
val groupId = "anyGroupId"
val topic = "anyTopic"
val props = createConsumerConfig(brokers, groupId)
def createConsumerConfig(brokers: String, groupId: String): Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000")
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000")
props.put(ConsumerConfig.CLIENT_ID_CONFIG, "12321")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props
}
def run() = {
consumer.subscribe(Collections.singletonList(this.topic))
Executors.newSingleThreadExecutor.execute( new Runnable {
override def run(): Unit = {
while (true) {
val records = consumer.poll(1000)
for (record <- records) {
println("Record: "+record.value)
}
}
}
})
}
}
object ScalaConsumer extends App {
val testConsumer = new KafkaTestings()
testConsumer.run()
}
This was used as a reference to write this simple consumer
This is working as documented.
If you start a new consumer group (i.e. one for which there are no existing offsets stored in Kafka), you have to choose if the consumer should be starting from the EARLIEST possible messages (the oldest message still available in the topic) or from the LATEST (only messages that produced from now on).
Is there a better solution than switching the AUTO_OFFSET_RESET_CONFIG after the first time I run the consumer?
You can keep it at EARLIEST, because the second time you run the consumer, it will already have stored offsets and just pick up there. The reset policy is only used when a new consumer group is created.
Today I restart the consumer, with the same groupId1, there will be two options:
Not really. Since the consumer group was running the day before, it will find its committed offsets and just pick up where it left off. So no matter what you set the reset policy to, it will get these two new messages.
By aware though, that Kafka does not store these offsets forever, I believe the default is just a week. So if you shut down your consumers for more than that, the offsets may be aged out, and you could run into an accidental reset to EARLIEST (which may be expensive for large topics). Given that, it is probably prudent to change it to LATEST anyway.
You can keep it at EARLIEST, because the second time you run the consumer, it will already have stored offsets and just pick up there. The reset policy is only used when a new consumer group is created.
In my testing, I often want to read from the earliest offset, but as noted, once you've read messages with a given groupId, then your offset remains at that pointer.
I do this:
properties.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID());
I am new to the kafka eco system and in my case I'm using a Java producer but have no need for sending a key along with the record value which is serialized Avro. Is there a way to build a Java Producer to will not send keys, or are keys a requirement when sending messages in Kafka?
Like #gasparms said, there are built-in ways to produce without sending in a key. Most people use Kafka this way, since they just want to be able to send a stream of messages, with no key. Using keys is only really required if you need log compaction
Here's a really good explanation -
https://stackoverflow.com/a/29515696/236528
ProducerRecord has several constructors, one of them don't have value for the key, so you don't have to indicate it.
Example:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for(int i = 0; i < 100; i++)
producer.send(new ProducerRecord<>("my-topic", "myValue"));
producer.close();
Working with Kafka(v2.11-0.10.1.0)-spark-streaming(v-2.0.1-bin-hadoop2.7).
I have Kafka Producer and Spark-streaming consumer to produce and consume. All works fine till I stop consumer(for approx 2-min) and start again. The consumer starts and reads data, absolutely perfect. But, I'm lost with the 2-min data, where consumer was off.
Kafka consumer/server.properties are unchanged.
Kafka producer with properties:
Properties properties = new Properties();
properties.put("bootstrap.servers", AppCoding.KAFKA_HOST);
properties.put("auto.create.topics.enable", true);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("retries", 1);
logger.info("Initializing Kafka Producer.");
Producer<String, String> producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(AppCoding.KAFKA_TOPIC, "", documentAsString));
Consuming using Spark-streaming api as:
SparkConf sparkConf = new SparkConf().setMaster(args[4]).setAppName("Streaming");
// Create the context with 60 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60000 * 5));
//input arguments:localhost:2181 sparkS incoming 10 local[*]
Set<String> topicsSet = new HashSet<>(Arrays.asList(args[2].split(";")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
//input arguments: localhost:9092 "" incoming 10 local[*]
JavaPairInputDStream<String, String> kafkaStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
On the other end i have been using ActiveMQ. While ActiveMQ Consumer could fetch me the data while its off.
Help me out if there's a confuguration problem.
In Kafka, consumers actually have no direct relationship with producers. Each consumer has an offset which tracks what has been consumed in the partitions. If a consumer has no offset tracked, Kafka will automatically reset its offset to the largest one because of the default value of config 'auto.offset.reset'. In your case, when the brand-new consumer is started, due to the default policy, it does not see the messages produced previously. You could set 'auto.offset.reset' to earliest (for new consumer) or smallest (for old consumer).
Kafka maintains offset per partition per record basis. While consumer was off for 2 minute duration, offset value would be stored in topic metadata for new-consumer, and again when the consumer is started back after 2minutes, it would read last offset which was stored in kafka topic.
I think what you need to check is kafka broker data retention policy if it is less than 2 minutes , data would be lost , if data corresponding to offset is not present , it would start reading from latest as by default value is set to latest auto.offset.reset=latest for new data arriving.
I would suggest to check and change kafka data retention policy accordingly if it is less than 2 minutes
I have built the following kafka consumer:
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:6667");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "TEST1");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "10000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG,"1000");
this.kconsumer = new KafkaConsumer(props);
I want to the consumer to start with the earliest for this group when it is initiated. So the first time I run it, it works perfectly as expected. As long as the subscription exists and the connection is not closed it continues to increase the offset.
When I log in to kafka and run the following:
./kafka-consumer-groups.sh --bootstrap-server localhost:6667 --new-consumer --group TEST1 --describe
I see exactly what is expected, an increase in offset, etc. When the connection is closed however running the same command results in "Consumer group TEST1 does not exist or is rebalancing." Only it is not rebalancing, it is gone.
How do I persist the existence of the group when the consumer is not running? Am I missing a config in the consumer or in kafka?
As another note, when I alter the OFFSET parameter to "latest" I get no records at all unless new ones are loaded even though the records are not expired.
So bottom line, what I want to be able to do is spin up a new consumer with a given name, be able to pull from the earliest available record, shut down that consumer and if I start a consumer with that name again pull from where I left off. Any ideas of what I am missing? Or am I just misunderstanding how the high level consumer is meant to work at all?
In case someone comes across this and wants to know what I did. I was able to set the offset after determining if the group existed first. Doing it this way means if the group exists use "latest". If not, use "earliest".
private void buildConsumer(String offset)
{
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:6667");
props.put(ConsumerConfig.GROUP_ID_CONFIG, this.groupId);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,"org.apache.kafka.common.serialization.StringDeserializer");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "10000");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, offset);
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG,"1000");
this.kconsumer = new KafkaConsumer(props);
}
/*
Check if the group exists before polling.
If it does, leave with default offset.
If it does not exists, set the offset to earliest to ensure you are getting all the records
*/
private void groupExists(String topic)
{
TopicPartition toc = new TopicPartition(topic, 0);
OffsetAndMetadata oam = kconsumer.committed(toc);
if(oam != null){
//do nothing, all is well, start from last commit
} else {
/*
when a new group is started the AUTO_OFFSET_RESET_CONFIG
needs to be set to earliest to ensure all records are picked up
Since that property can only be set at instantiation the consumer
must be rebuilt and resubscribed
*/
buildConsumer("earliest");
this.kconsumer.subscribe(Arrays.asList(topic));
}
}
Kafka is confusing me. I am running it local with standard values.
only auto create topic turned on. 1 partition, 1 node, everything local and simple.
If it write
consumer.subscribe("test_topic");
consumer.poll(10);
It simply won't work and never finds any data.
If I instead assign a partition like
consumer.assign(new TopicPartition("test_topic",0));
and check the position I sit at 995. and now can poll and receive all the data my producer put in.
What is it that I don't understand about subscriptions? I don't need multiple consumers each handling only a part of the data. My consumer needs to get all the data of a certain topic. Why does the standard subscription approach not work for me that is shown in all the tutorials?
I do understand that partitions are for load balancing consumers. I don't understand what I do wrong with the subscription.
consumer config properties
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "postproc-" + EnvUtils.getAppInst()); // jeder ist eine eigene gruppe -> kriegt alles
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<Long, byte[]> consumer = new KafkaConsumer<Long, byte[]>(props);
producer config
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 2);
props.put("batch.size", 16384);
props.put("linger.ms", 5000);
props.put("buffer.memory", 1024 * 1024 * 10); // 10mb
props.put("key.serializer", "org.apache.kafka.common.serialization.LongSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
return new KafkaProducer(props);
producer execution
try (ByteArrayOutputStream out = new ByteArrayOutputStream()){
event.writeDelimitedTo(out);
for (long a = 10; a<20;a++){
long rand=new Random(a).nextLong();
producer.send(new ProducerRecord<>("test_topic",rand ,out.toByteArray()));
}
producer.flush();
}catch (IOException e){
consumer execution
consumer.subscribe(Arrays.asList("test_topic"));
ConsumerRecords<Long,byte[]> records = consumer.poll(10);
for (ConsumerRecord<Long,byte[]> r :records){ ...
I managed to solve the issue. The problem were timeouts. When piling I didn't give it enough time to complete. I assume assigning a partition just is a lot faster and therfore completed timely. The standard subscription poll takes longer. Never actually finished and did not commit.
At least I think that was the problem. With longer timeouts it works.
You are missing this property I think
auto.offset.reset=earliest
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Reference: http://kafka.apache.org/documentation.html#highlevelconsumerapi