How to define max.poll.records (SCS with Kafka) over containers - apache-kafka

I'm trying to figure out the poll records mechanism for Kafka over SCS in a K8s environment.
What is the recommended way to control max.poll.records?
How can I poll the defined value?
Is it possible to define it once for all channels and then override for a specific channel?
(referring to this comment form documentation):
To avoid repetition, Spring Cloud Stream supports setting values for
all channels, in the format of
spring.cloud.stream.kafka.default.consumer.=. The
following properties are available for Kafka consumers only and must
be prefixed with
spring.cloud.stream.kafka.bindings..consumer..")
Is this path supported: spring.cloud.stream.binding.<channel name>.consumer.configuration?
Is this: spring.cloud.stream.**kafka**.binding.<channel name>.consumer.configuration?
How are conflicts being resolved? Let's say in a case where both spring.cloud.stream.binding... and spring.cloud.stream.**kafka**.binding... are set?
I've tried all mentioned configurations, but couldn't see in the log what is the actual poll.records and frankly the documentation is not entirely clear on the subject.

These are the configurations:
spring.cloud.stream.kafka.default.consumer.configuration.max.poll.records - default if nothing else specified for given channel
spring.cloud.stream.kafka.bindings..consumer.configuration.max.poll.records

Related

Spring Kafka Consumer Configs - default values and at least once semantics

I am writing kafka consumer using spring-kafka template.
When I am instantiating consumers, Spring kafka takes in parameters like the following.
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, fetchMaxBytes);
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, maxPartitionFetchBytes);
I read the documentation and it looks like there are lots of other parameters that can be passed as the consumer configs too. Interestingly, each of these parameter has a default value.
My question is
On what basis were these arrived?
Will there be a real-need to change these values, if so what would be those
(IMHO, this is on case by case basis. But still would like to hear
it from experts)
The delivery semantic we have is atleast once.
So, for this (atleast-once) delivery semantic, should these be left
untouched and it would still process high volume of data.
Any pointers or answers would be of great help in clarifying my doubts.
The default values are an attempt to serve most of the use cases around Kafka. However, it would be an illusion to assume that those many different configurations can be set to serve all use cases.
A good starting point to understand the default values is the plain-Kafka ConsumerConfiguration and for Spring its documentation. In the Confluence docs you will also find for each configuration the "Importance". If this importance is set to high, it is recommended to really think about it. I have given some more background on the importance here.
at-least-once
For at least once semantics you want to control the commits of the consumed messages. For this, enable.autto.commit needs to be set to false which is the default value since spring version 2.3). In addition the AckMode is per default set to BATCH which is the basis for a at least once semantics.
So, depending on your Spring version it looks like you can leave the default configuration to achieve at-least-once semantics.

add record level custom latency metric in kafka streams

I trying to add a specific metric to my kafka-streams application that will measure latency and report in to the jmx.
I'm using StreamsDSL in scala so using the ProcessorAPI for metrics (which I know is possible) will not work for me.
the basic things I would like to understand is:
how to extract specific record properties (i.e headers) to use as part of the metric calculation
How to add the new metric to the metrics reported to the jmx
Thanks!
You will need to fall back to the Processor API to access record metadata like headers and to register custom metrics.
Note thought, that you can mix-and-match the DSL and the Processor API, so it's not necessary to move off the DSL. Instead, you can pluging custom Processors or Transformers via KStream.process() or KStream.transform() (note, that there are multiple "siblings" to transform() that you might want to use instead of the transform()).

Acknowledgement Kafka Producer Apache Beam

How do I get the records where an acknowledgement was received in apache beam KafkaIO?
Basically I want all the records where I didn't get any acknowledgement to go to a bigquery table so that I can retry sometime later. I used the following code snippet from the docs
.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("my_topic") // use withTopics(List<String>) to read from multiple topics.
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
// Above four are required configuration. returns PCollection<KafkaRecord<Long, String>>
// Rest of the settings are optional :
// you can further customize KafkaConsumer used to read the records by adding more
// settings for ConsumerConfig. e.g :
.updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))
// set event times and watermark based on LogAppendTime. To provide a custom
// policy see withTimestampPolicyFactory(). withProcessingTime() is the default.
.withLogAppendTime()
// restrict reader to committed messages on Kafka (see method documentation).
.withReadCommitted()
// offset consumed by the pipeline can be committed back.
.commitOffsetsInFinalize()
// finally, if you don't need Kafka metadata, you can drop it.g
.withoutMetadata() // PCollection<KV<Long, String>>
)
.apply(Values.<String>create()) // PCollection<String>
By Default Beam IOs are designed to keep attempting to write/read/process elements until . (Batch pipelines will fail after repeated errors)
What you are referring to is usually called a Dead Letter Queue, to take the failed records and add them to a PCollection, Pubsub topic, queuing service, etc. This is often desire-able as it allows a streaming pipeline to make progress (not block), when errors writing some records are encountered, but allowing the onces which succeed to be written.
Unfortunately, unless I am mistaken there is no dead letter queue implemented in Kafka IO. It may be possible to modify KafkaIO to support this. There was some discussion on the beam mailing list with some ideas proposed to implement this, which might have some ideas.
I suspect it may be possible to add this to KafkaWriter, catching the records that failed and outputting them to another PCollection. If you choose to implement this, please also contact the beam community mailing list, if you would like help merging it into master, they will be able to help make sure the change covers necessary requirements so that it can be merged and makes sense as a whole for beam.
Your pipeline can then write those elsewhere (i.e. a different source). Of course, if that secondary source simultaneously has an outage/issue, you would need another DLQ.

Why message brokers don't supply total data/messages sent metrics?

My team was recently considering different message brokers to use for our project, we ended up picking Apache Pulsar, but it applies to others (Kafka). Our requirement is to track total number of messages sent and bytes sent to each subscriber for billing purposes.
I was reading documentation for metrics and was surprised to see that Pulsar doesn't track this, I've checked Kafka and the result was the same.
My understanding on this subject is minimal so is this some kind of anti-pattern?
I understand that counter values like this never go down and for our use case - should not be reset, leading to potential (certain) overflows. But to me this could be solved by using something like a histogram in Prometheus (metrics format used in Pulsar). I am actually thinking about implementing such functionality, but am I wrong and is there a better solution for our purpose?

Kafka Streams - accessing data from the metrics registry

I'm having a difficult time finding documentation on how to access the data within the Kafka Streams metric registry, and I think I may be trying to fit a square peg in a round hole. I was hoping to get some advice on the following:
Goal
Collect metrics being recorded in the Kafka Streams metrics registry and send these values to an arbitrary end point
Workflow
This is what I think needs to be done, and I've complete all of the steps except the last (having trouble with that one because the metrics registry is private). But I may be going about this the wrong way:
Define a class that implements the MetricReporter interface. Build a list of the metrics that Kafka creates in the metricChange method (e.g. whenever this method is called, update a hashmap with the currently registered metrics).
Specify this class in the metric.reporters configuration property
Set up a process that polls the Kafka Streams metric registry for the current data, and ship the values to an arbitrary end point
Anyways, the last step doesn't appear to be possible in Kafka 0.10.0.1 since the metrics registry isn't exposed. Could some please let me know this if is the correct workflow (sounds like it's not..), or if I am misunderstanding the process for extracting the Kafka Streams metrics?
Although the metrics registry is not exposed, you can still get the value of a given KafkaMetric by its KafkaMetric.value() / KafkaMetric.value(timestamp) methods. For example, as you observed in the JMXRporter, it keeps the list of KafkaMetrics from the instantiated init() and metricChange/metricRemoval methods, and then in its MBean implementation, when getAttribute is called, it will call its corresponding KafkaMetrics.value() function. So for your customized reporter, you can apply similar patterns, for example, periodically poll all kept KafkaMetrics.value() and then pipe the results to your end point.
The MetricReporter interface in org.apache.kafka.common.metrics already enables you to manage all Kafka stream metrics in the reporter. So kafka internal registry is not needed.