Spring Kafka Consumer Configs - default values and at least once semantics - apache-kafka

I am writing kafka consumer using spring-kafka template.
When I am instantiating consumers, Spring kafka takes in parameters like the following.
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, fetchMaxBytes);
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, maxPartitionFetchBytes);
I read the documentation and it looks like there are lots of other parameters that can be passed as the consumer configs too. Interestingly, each of these parameter has a default value.
My question is
On what basis were these arrived?
Will there be a real-need to change these values, if so what would be those
(IMHO, this is on case by case basis. But still would like to hear
it from experts)
The delivery semantic we have is atleast once.
So, for this (atleast-once) delivery semantic, should these be left
untouched and it would still process high volume of data.
Any pointers or answers would be of great help in clarifying my doubts.

The default values are an attempt to serve most of the use cases around Kafka. However, it would be an illusion to assume that those many different configurations can be set to serve all use cases.
A good starting point to understand the default values is the plain-Kafka ConsumerConfiguration and for Spring its documentation. In the Confluence docs you will also find for each configuration the "Importance". If this importance is set to high, it is recommended to really think about it. I have given some more background on the importance here.
at-least-once
For at least once semantics you want to control the commits of the consumed messages. For this, enable.autto.commit needs to be set to false which is the default value since spring version 2.3). In addition the AckMode is per default set to BATCH which is the basis for a at least once semantics.
So, depending on your Spring version it looks like you can leave the default configuration to achieve at-least-once semantics.

Related

How to define max.poll.records (SCS with Kafka) over containers

I'm trying to figure out the poll records mechanism for Kafka over SCS in a K8s environment.
What is the recommended way to control max.poll.records?
How can I poll the defined value?
Is it possible to define it once for all channels and then override for a specific channel?
(referring to this comment form documentation):
To avoid repetition, Spring Cloud Stream supports setting values for
all channels, in the format of
spring.cloud.stream.kafka.default.consumer.=. The
following properties are available for Kafka consumers only and must
be prefixed with
spring.cloud.stream.kafka.bindings..consumer..")
Is this path supported: spring.cloud.stream.binding.<channel name>.consumer.configuration?
Is this: spring.cloud.stream.**kafka**.binding.<channel name>.consumer.configuration?
How are conflicts being resolved? Let's say in a case where both spring.cloud.stream.binding... and spring.cloud.stream.**kafka**.binding... are set?
I've tried all mentioned configurations, but couldn't see in the log what is the actual poll.records and frankly the documentation is not entirely clear on the subject.
These are the configurations:
spring.cloud.stream.kafka.default.consumer.configuration.max.poll.records - default if nothing else specified for given channel
spring.cloud.stream.kafka.bindings..consumer.configuration.max.poll.records

kafka and parallel consumer: why order is important into a microservice architecture

I started to dive into kafka ecosystem.
I was surprised to find out that by default, each consumer only digests one "event" at a time, sequentially!
It's given by offset acknowledgement, unit of parallelism is at partition-level and some other stuff... you can find nice details here.
If I need to consume received messages in parallel into my application node thread pool, I need to use and make some non-default development effort to get it.
By other hand, several technologies have their own recipes to get it: quarkus/smallrye, confluentinc has a parallel-consummer, spring, ...
I hope to find an by-default code configuration in order to get it.
This suggests me that perhaps, some other technologies are more suitable in order to consume messages straightforwardly...
Why parallel consumer is not given by default into libraries?
Why order is important into a microservice architecture?
KafkaConsumer is a relatively low-level object, that's basically capable of reading records from given offset position, seeking to a particular offset, reading and saving that offset in existing Kafka store (called __consumer_offsets). Similarly, the receive API is fully synchronous with its poll(Duration) signature.
If more custom, e.g. asynchronous behaviour is desired, then you can use the wrappers like parallel-consumer or spring-kafka.
When it comes to library design, very often it is preferable to do only one thing (basically an applied single responsibility principle).
As an example, consider that if the "main" library were to be asynchrous, the library providers would need to provide thread creation and maintaining semantics, what happens when there are no records (compare to spring-kafka's listeners), and so on. By exposing low-level API these concerns that are not immediately relevant to Kafka these concerns can be avoided.
Why parallel consumer is not given by default into libraries?
Kafka clients are a largely pluggable ecosystem. The core developers are focused on optimizing the server code, and the built-in client libraries (and serializers) work "well-enough" (TM). So, therefore, a "by default code configuration" for parallel-consumption doesn't exist.
Why order is important into a microservice architecture
That completely depends on your app, but one example is payment-processing or handling any sort of ledger system (after all, Kafka is a sort of distributed ledger). You cannot withdraw money without first depositing a balance. This is not unique to microservices.

Join a static and a dynamic Kafka source in Flink

Today, I'd like to address a conceptual topic about Flink, rather than a technical.
In our case, we do have two Kafka topics A and B, that need to be joined. The join should always include all elements from topic A, as well as all new elements from topic B. There's 2 possibilities to achieve this: always create a new consumer and start consumption of topic A from beginning, or keep all elements from topic A within a state, once consumed.
Right now, the technological approach is going via joining two DataStreams, which quickly shows us its limits for this use case, as there is no possibility to join streams without a window (fair enough). Elements from topic A are eventually lost, if the window moves on and I got the feeling regularly resetting the consumer would bypass the elaborate logic introduced by Flink.
The other approach I am looking towards right now, would be to use the Table API, it sounds like it's the best fit for this job and actually keeps all the elements in its state for an indefinite amount of time.
However my question: Before going into depths of the Table API, only to notice there is a more elegant way, I'd like to identify, if this is the optimal solution for this matter or if there's an even better fitting Flink concept I am not aware of?
Edit: I forgot to mention: We do not make use of POJOs, but rather keep it generic, which means that the incoming data is identified as Tuple2<K,V>, where K,V are each an instance of GenericRecord. The corresponding schema for Serialization/Deserialization is obtained from the Schema Registry on runtime. I don't know, to which extent the SQL constructs can be a bottleneck in this situation.
Additionally, this remark from the documentation Both tables must have distinct field names makes me doubt a little bit, as we do have the same field names, which we will have to handle somehow, without having huge workarounds.
If A is truly static, then it will be less expensive if you can somehow fully ingest A, either into Flink state or into memory, and then stream B past A -- thereby producing the join results without having to store B.
There are at least a couple of ways to accomplish this with Flink. One is described in this answer, and the other involves using the State Processor API.
With this second approach you would hold A in key-partitioned Flink state. By using the State Processor API you can bootstrap a savepoint that contains the state you want, so that by starting your job from this savepoint, A is already fully loaded and immediately available.
There's a simple example of bootstrapping keyed state in this gist. Once you have created the savepoint, then you need to implement a streaming job that uses it to compute the join -- which can be done with a RichFlatMapFunction.
The other alternative for implementing joins without using the Table API is to simply roll your own with a RichCoFlatMapFunction or a KeyedCoProcessFunction. You will find examples of this in the Flink training. None of those examples really match your requirements, but they give the general flavor. I don't see any advantage to this, however -- if you are going to do a fully dynamic/dynamic join, might as well use the Table API.

Why message brokers don't supply total data/messages sent metrics?

My team was recently considering different message brokers to use for our project, we ended up picking Apache Pulsar, but it applies to others (Kafka). Our requirement is to track total number of messages sent and bytes sent to each subscriber for billing purposes.
I was reading documentation for metrics and was surprised to see that Pulsar doesn't track this, I've checked Kafka and the result was the same.
My understanding on this subject is minimal so is this some kind of anti-pattern?
I understand that counter values like this never go down and for our use case - should not be reset, leading to potential (certain) overflows. But to me this could be solved by using something like a histogram in Prometheus (metrics format used in Pulsar). I am actually thinking about implementing such functionality, but am I wrong and is there a better solution for our purpose?

Implementing sagas with Kafka

I am using Kafka for Event Sourcing and I am interested in implementing sagas using Kafka.
Any best practices on how to do this? The Commander pattern mentioned here seems close to the architecture I am trying to build but sagas are not mentioned anywhere in the presentation.
This talk from this year's DDD eXchange is the best resource I came across wrt Process Manager/Saga pattern in event-driven/CQRS systems:
https://skillsmatter.com/skillscasts/9853-long-running-processes-in-ddd
(requires registering for a free account to view)
The demo shown there lives on github: https://github.com/flowing/flowing-retail
I've given it a spin and I quite like it. I do recommend watching the video first to set the stage.
Although the approach shown is message-bus agnostic, the demo uses Kafka for the Process Manager to send commands to and listen to events from other bounded contexts. It does not use Kafka Streams but I don't see why it couldn't be plugged into a Kafka Streams topology and become part of the broader architecture like the one depicted in the Commander presentation you referenced.
I hope to investigate this further for our own needs, so please feel free to start a thread on the Kafka users mailing list, that's a good place to collaborate on such patterns.
Hope that helps :-)
I would like to add something here about sagas and Kafka.
In general
In general Kafka is a tad different than a normal queue. It's especially good in scaling. And this actually can cause some complications.
One of the means to accomplish scaling, Kafka uses partitioning of the data stream. Data is placed in partitions, which can be consumed at its own rate, independent of the other partitions of the same topic. Here is some info on it: how-choose-number-topics-partitions-kafka-cluster. I'll come back on why this is important.
The most common ways to ensure the order within Kafka are:
Use 1 partition for the topic
Use a partition message key to "assign" the message to a topic
In both scenarios your chronologically dependent messages need to stream through the same topic.
Also, as #pranjal thakur points out, make sure the delivery method is set to "exactly once", which has a performance impact but ensures you will not receive the messages multiple times.
The caveat
Now, here's the caveat: When changing the amount of partitions the message distribution over the partitions (when using a key) will be changed as well.
In normal conditions this can be handled easily. But if you have a high traffic situation, the migration toward a different number of partitions can result in a moment in time in which a saga-"flow" is handled over multiple partitions and the order is not guaranteed at that point.
It's up to you whether this will be an issue in your scenario.
Here are some questions you can ask to determine if this applies to your system:
What will happen if you somehow need to migrate/copy data to a new system, using Kafka?(high traffic scenario)
Can you send your data to 1 topic?
What will happen after a temporary outage of your saga service? (low availability scenario/high traffic scenario)
What will happen when you need to replay a bunch of messages?(high traffic scenario)
What will happen if we need to increase the partitions?(high traffic scenario/outage & recovery scenario)
The alternative
If you're thinking of setting up a saga, based on steps, like a state machine, I would challenge you to rethink your design a bit.
I'll give an example:
Lets consider a booking-a-hotel-room process:
Simplified, it might consist of the following steps:
Handle room reserved (incoming event)
Handle room payed (incoming event)
Send acknowledgement of the booking (after payed and some processing)
Now, if your saga is not able to handle the payment if the reservation hasn't come in yet, then you are relying on the order of events.
In this case you should ask yourself: when will this break?
If you conclude you want to avoid the chronological dependency; consider a system without a saga, or a saga which does not depend on the order of events - i.e.: accepting all messages, even when it's not their turn yet in the process.
Some examples:
aggregators
Modeled as business process: parallel gateways (parallel process flows)
Do note in such a setup it is even more crucial that every action has got an implemented compensating action (rollback action).
I know this is often hard to accomplish; but, if you start small, you might start to like it :-)