Setting isolation.level for kafka with spring cloud stream - apache-kafka

On a project I'm consuming messages of on partner from a kafka server, the producer is using transactions and it seems that the default configuration of a kafka consumer for isolation.level is read_uncommited by default and they suggest to set the configuration
I have some questions:
Is the default value always isolation.level=read_uncommited or changed over versions?
If I set default isolation.level=read_commited by defaul for kafka binder is there any cost for consumers reading from non transactional producer.
kafka:
default:
consumer:
configuration:
isolation.level: read_committed
What is the best practices setting this value if we have multiple partners/binders and we don't really know if they are using transactions or no.
Thanks in advance.

Isolation level has always "read_uncommited" as default value as I know, but not the case of enable.idempotence.
So What you did is a good practice, to be sure that you'll not loose your config if the default value will change.
For the second question :
If you are using read_commited level on a fully non transactional topic, there are no pb
If your topic is mixed or full transactional you'll be restricted by the broker to read only messages until LSO (last stable offset) that's not always equal to the High Watermark
=> the LSO is depending of the state of the last transactional message, so if there is a pending transaction on a message in the partition P1 at offset X and you received in the meantime other messages in P1 (X+1,..., X+n), those messages will not be part of LSO until the last transaction at the offset X is committed or aborted => that's the only thing to be aware of when using read_committed level.
Don't hesitate if it's not totally clear.

Related

Difference between kafka idempotent and transactional producer setup?

When setting up a kafka producer to use idempotent behaviour, and transactional behaviour:
I understand that for idempotency we set:
enable.idempotence=true
and that by changing this one flag on our producer, we are guaranteed exactly-once event delivery?
and for transactions, we must go further and set the transaction.id=<some value>
but by setting this value, it also sets idempotence to true?
Also, by setting one or both of the above to true, the producer will also set acks=all.
With the above should I be able to add 'exactly once delivery' by simply changing the enable idempotency setting? If i wanted to go further and enable transactional support, On the Consumer side, I would only need to change their setting, isolation.level=read_committed? Does this image reflect how to setup the producer in terms of EOS?
Yes you understood the main concepts.
By enabling idempotence, the producer automatically sets acks to all and guarantees message delivery for the lifetime of the Producer instance.
By enabling transactions, the producer automatically enables idempotence (and acks=all). Transactions allow to group produce requests and offset commits and ensure all or nothing gets committed to Kafka.
When using transactions, you can configure if consumers should only see records from committed transactions by setting isolation.level to read_committed, otherwise by default they see all records including from discarded transactions.
Actually idemnpotency by itself does not always guarantee exactly once event delivery. Let's say you have a consumer that consumes an event, processes it and produces an event. Somewhere in this process the offset that the consumer uses must be incremented and persisted. Without a transactional producer, if it happens before the producer sends a message, the message might not be sent and its at most once delivery. If you do it after the message is sent you might fail in persisting the offset and then the consumer would read the same message again and the producer would send a duplicate, you get an at least once delivery. The all or nothing mechanism of a transactional producer prevents this scenario given that you store your offset on kafka, the new message and the incrementation of the offset of the consumer becomes an atomic action.

Kafka: isolation level implications

I have a use case where I need 100% reliability, idempotency (no duplicate messages) as well as order-preservation in my Kafka partitions. I'm trying to set up a proof of concept using the transactional API to achieve this. There is a setting called 'isolation.level' that I'm struggling to understand.
In this article, they talk about the difference between the two options
There are now two new isolation levels in Kafka consumer:
read_committed: Read both kind of messages that are not part of a
transaction and that are, after the transaction is committed.
Read_committed consumer uses end offset of a partition, instead of
client-side buffering. This offset is the first message in the
partition belonging to an open transaction. It is also known as “Last
Stable Offset” (LSO). A read_committed consumer will only read up till
the LSO and filter out any transactional messages which have been
aborted.
read_uncommitted: Read all messages in offset order without
waiting for transactions to be committed. This option is similar to
the current semantics of a Kafka consumer.
The performance implication here is obvious but I'm honestly struggling to read between the lines and understand the functional implications/risk of each choice. It seems like read_committed is 'safer' but I want to understand why.
First, the isolation.level setting only has an impact on the consumer if the topics it's consuming from contains records written using a transactional producer.
If so, if it's set to read_uncommitted, the consumer will simply read everything including aborted transactions. That is the default.
When set to read_committed, the consumer will only be able to read records from committed transactions (in addition to records not part of transactions). It also means that in order to keep ordering, if a transaction is in-flight the consumer will not be able to consume records that are part of that transation. Basically the broker will only allow the consumer to read up to the Last Stable Offset (LSO). When the transation is committed (or aborted), the broker will update the LSO and the consumer will receive the new records.
If you don't tolerate duplicates or records from aborted transactions, then you should use read_committed. As you hinted this creates a small delay in consuming as records are only visible once transactions are committed. The impact mostly depends on the sizes of your transactions, ie how often you commit.
If you are not using transactions in your producer, the isolation level does not matter. If you are, then you must use read_committed if you want the consumers to honor the transactional nature. Here are some additional references:
https://www.confluent.io/blog/transactions-apache-kafka/
https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7CP1SnQGdp2eF0wSw9ra8/edit
if so, if it's set to read_uncommitted, the consumer will simply read everything including aborted transactions. That is the default.
To clarify things a bit for readers: this is the default only in the Java Kafka client. It was done to not change the semantics when transactions were introduced back in the day.
It's the opposite in librdkafka which sets the isolation.level configuration to read_committed by default. As a result, all libraries built on top of librdkafka will consume only committed messages by default: confluent-kafka-python, confluent-kafka-dotnet, rdkafka-ruby.
KafkaJS consumers also uses read committed by default (readUncommitted set to false).

Commit issue with Kafka

I am working on a module and the requirement is there is a producer and we are using kafka as queue for data producing and feeding it to consumer.
Now In consumer,I am trying to implement At-Least-Once messaging scenario.
For this i have to pool the messages from kafka and then consumer those.After consuming i am calling consumer.commitAsync(offset,Callback).
I want to know what will happen
Case 1). when commitAsync() api is never called(suppose there was an exception just before calling this api).In my case,I was supposing the message will be pumped again to consumer; but it is not happening.Consumer never get that data again.
Case 2). if the consumer reboots.
Below is the code snippet of properties set with the consumer
private Properties getConsumerProperties() {
final Properties props = new Properties();
props.put(BOOTSTRAP_SERVERS_CONFIG, "server");
props.put(GROUP_ID_CONFIG, "groupName");
props.put(ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(HEARTBEAT_INTERVAL_MS_CONFIG, heartBeatinterval);
props.put(METADATA_MAX_AGE_CONFIG, metaDataMaxAge);
props.put(SESSION_TIMEOUT_MS_CONFIG, sessionTimeout);
props.put(AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
props.put(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
return props;
}
Now in consumer, on the basis of some property set; i have 3 topics and create 3 consumers for each topic(as there are 3 partition and 3 brokers of kafka).
For consumption of data...I identify the packet in the basis of some propertywhen received fron kafka..and pass it to the relevant topic(i have taken a different thread pools for different topics and create the tasks on the basis of property in the packet and submit to thread pool).In the tasks, after processing i call the consumer.commitAsync(offset,callback).
I was expecting the same message to be pulled again from kafka in case of commitAsync is not called for some packet...but to my surprise it is not coming back...Am i missing something.Is there any sort of setting we need to do in the apache-kafka as well for At-Least-One.
Please suggest.
There is couple of things to be addressed in your question.
Before I get to the suggestions on how to achieve at-least-once behavior, I'll try and address the 2 cases:
Case 1). when commitAsync() api is never called(suppose there was an
exception just before calling this api).In my case,I was supposing the
message will be pumped again to consumer; but it is not
happening.Consumer never get that data again.
The reason why your consumer does not get the old data could be because of the enable.auto.commit property, this is set to true by default and will commit the offsets regularly in the background. Due to this, the consumer on subsequent runs will find an offset to work with and will just wait for new data/messages to arrive.
Case 2). if the consumer reboots.
This would also be similar i.e. if the consumer after rebooting finds a committed offset to work with, it will start consuming from that offset whether the offset was committed automatically due to the enable.auto.commit property set to true or by invoking commitAsync()/commitSync() explicitly.
Now, moving to the part on how to achieve at-least-once behavior - I could think of the following 2 ways:
If you want to take control of committing offsets, then set the "enable.auto.commit" property to false and then invoke commitSync() or commitAsync() with retries handled in the Callback function.
Note: The choice of Synchronous vs Asynchronous commit will depend on your latency budget and any other requirements. So, not going too much into those details here.
The other option is to utilise the automatic offset commit feature i.e. setting enable.auto.commit to true and auto.commit.interval.ms to an acceptable number (again, based on your requirements on how often would you like to commit offsets).
I think the default behaviour of Kafka is centered around at-least-once semantics, so it should be fairly straightforward.
I hope this helps!

Preventing message loss with Kafka High Level Consumer 0.8.x

A typical kafka consumer looks like the following:
kafka-broker ---> kafka-consumer ----> downstream-consumer like Elastic-Search
And according to the documentation for Kafka High Level Consumer:
The ‘auto.commit.interval.ms’ setting is how often updates to the
consumed offsets are written to ZooKeeper
It seems that there can be message loss if the following two things happen:
Offsets are committed just after some messages are retrieved from kafka brokers.
Downstream consumers (say Elastic-Search) fail to process the most recent batch of messages OR the consumer process itself is killed.
It would perhaps be most ideal if the offsets are not committed automatically based on a time interval but they are committed by an API. This would make sure that the kafka-consumer can signal the committing of offsets only after it receives an acknowledgement from the downstream-consumer that they have successfully consumed the messages. There could be some replay of messages (if kafka-consumer dies before committing offsets) but there would at least be no message loss.
Please let me know if such an API exists in the High Level Consumer.
Note: I am aware of the Low Level Consumer API in 0.8.x version of Kafka but I do not want to manage everything myself when all I need is just one simple API in High Level Consumer.
Ref:
AutoCommitTask.run(), look for commitOffsetsAsync
SubscriptionState.allConsumed()
There is a commitOffsets() API in the High Level Consumer API that can be used to solve this.
Also set option "auto.commit.enable" to "false" so that at no time, the offsets are committed automatically by kafka consumer.

Apache Kafka Consumer group and Simple Consumer

I am new to Kafka, what I've understood sofar regarding the consumer is there are basically two types of implementation.
1) The High level consumer/consumer group
2) Simple Consumer
The most important part about the high level abstraction is it used when Kafka doesn't care about handling the offset while the Simple consumer provides much better control over the offset management. What confuse me is what if I want to run consumer in a multithreaded environment and also want to have control over the offset.If I use consumer group does that mean I must read from the last offset stored in zookeeper? is that the only option I have.
For the most part, the high-level consumer API does not let you control the offset directly.
When the consumer group is first created, you can tell it whether to start with the oldest or newest message that kafka has stored using the auto.offset.reset property.
You can also control when the high-level consumer commits new offsets to zookeeper by setting auto.commit.enable to false.
Since the high-level consumer stores the offsets in zookeeper, your app could access zookeeper directly and manipulate the offsets - but it would be outside of the high-level consumer API.
Your question was a little confusing but you can use the simple consumer in a multi-threaded environment. That's what the high-level consumer does.
In Apache Kafka 0.9 and 0.10 the consumer group management is handled entirely within the Kafka application by a Broker (for coordination) and a topic (for state storage).
When a consumer group first subscribes to a topic the setting of auto.offset.reset determines where consumers begin to consume messages (http://kafka.apache.org/documentation.html#newconsumerconfigs)
You can register a ConsumerRebalanceListener to receive a notification when a particular consumer is assigned topics/partitions.
Once the consumer is running, you can use seek, seekToBeginning and seekToEnd to get messages from a specific offset. seek affects the next poll for that consumer, and is stored on the next commit (e.g. commitSync, commitAsync or when the auto.commit.interval elapses, if enabled.)
The consumer javadocs mention more specific situations: http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
You can combine the group management provided by Kafka with manual management of offsets via seek(..) once partitions are assigned.