Kafka Streams - accessing data from the metrics registry - apache-kafka

I'm having a difficult time finding documentation on how to access the data within the Kafka Streams metric registry, and I think I may be trying to fit a square peg in a round hole. I was hoping to get some advice on the following:
Goal
Collect metrics being recorded in the Kafka Streams metrics registry and send these values to an arbitrary end point
Workflow
This is what I think needs to be done, and I've complete all of the steps except the last (having trouble with that one because the metrics registry is private). But I may be going about this the wrong way:
Define a class that implements the MetricReporter interface. Build a list of the metrics that Kafka creates in the metricChange method (e.g. whenever this method is called, update a hashmap with the currently registered metrics).
Specify this class in the metric.reporters configuration property
Set up a process that polls the Kafka Streams metric registry for the current data, and ship the values to an arbitrary end point
Anyways, the last step doesn't appear to be possible in Kafka 0.10.0.1 since the metrics registry isn't exposed. Could some please let me know this if is the correct workflow (sounds like it's not..), or if I am misunderstanding the process for extracting the Kafka Streams metrics?

Although the metrics registry is not exposed, you can still get the value of a given KafkaMetric by its KafkaMetric.value() / KafkaMetric.value(timestamp) methods. For example, as you observed in the JMXRporter, it keeps the list of KafkaMetrics from the instantiated init() and metricChange/metricRemoval methods, and then in its MBean implementation, when getAttribute is called, it will call its corresponding KafkaMetrics.value() function. So for your customized reporter, you can apply similar patterns, for example, periodically poll all kept KafkaMetrics.value() and then pipe the results to your end point.

The MetricReporter interface in org.apache.kafka.common.metrics already enables you to manage all Kafka stream metrics in the reporter. So kafka internal registry is not needed.

Related

How to define max.poll.records (SCS with Kafka) over containers

I'm trying to figure out the poll records mechanism for Kafka over SCS in a K8s environment.
What is the recommended way to control max.poll.records?
How can I poll the defined value?
Is it possible to define it once for all channels and then override for a specific channel?
(referring to this comment form documentation):
To avoid repetition, Spring Cloud Stream supports setting values for
all channels, in the format of
spring.cloud.stream.kafka.default.consumer.=. The
following properties are available for Kafka consumers only and must
be prefixed with
spring.cloud.stream.kafka.bindings..consumer..")
Is this path supported: spring.cloud.stream.binding.<channel name>.consumer.configuration?
Is this: spring.cloud.stream.**kafka**.binding.<channel name>.consumer.configuration?
How are conflicts being resolved? Let's say in a case where both spring.cloud.stream.binding... and spring.cloud.stream.**kafka**.binding... are set?
I've tried all mentioned configurations, but couldn't see in the log what is the actual poll.records and frankly the documentation is not entirely clear on the subject.
These are the configurations:
spring.cloud.stream.kafka.default.consumer.configuration.max.poll.records - default if nothing else specified for given channel
spring.cloud.stream.kafka.bindings..consumer.configuration.max.poll.records

Spring Kafka Consumer Configs - default values and at least once semantics

I am writing kafka consumer using spring-kafka template.
When I am instantiating consumers, Spring kafka takes in parameters like the following.
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, fetchMaxBytes);
props.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, maxPartitionFetchBytes);
I read the documentation and it looks like there are lots of other parameters that can be passed as the consumer configs too. Interestingly, each of these parameter has a default value.
My question is
On what basis were these arrived?
Will there be a real-need to change these values, if so what would be those
(IMHO, this is on case by case basis. But still would like to hear
it from experts)
The delivery semantic we have is atleast once.
So, for this (atleast-once) delivery semantic, should these be left
untouched and it would still process high volume of data.
Any pointers or answers would be of great help in clarifying my doubts.
The default values are an attempt to serve most of the use cases around Kafka. However, it would be an illusion to assume that those many different configurations can be set to serve all use cases.
A good starting point to understand the default values is the plain-Kafka ConsumerConfiguration and for Spring its documentation. In the Confluence docs you will also find for each configuration the "Importance". If this importance is set to high, it is recommended to really think about it. I have given some more background on the importance here.
at-least-once
For at least once semantics you want to control the commits of the consumed messages. For this, enable.autto.commit needs to be set to false which is the default value since spring version 2.3). In addition the AckMode is per default set to BATCH which is the basis for a at least once semantics.
So, depending on your Spring version it looks like you can leave the default configuration to achieve at-least-once semantics.

add record level custom latency metric in kafka streams

I trying to add a specific metric to my kafka-streams application that will measure latency and report in to the jmx.
I'm using StreamsDSL in scala so using the ProcessorAPI for metrics (which I know is possible) will not work for me.
the basic things I would like to understand is:
how to extract specific record properties (i.e headers) to use as part of the metric calculation
How to add the new metric to the metrics reported to the jmx
Thanks!
You will need to fall back to the Processor API to access record metadata like headers and to register custom metrics.
Note thought, that you can mix-and-match the DSL and the Processor API, so it's not necessary to move off the DSL. Instead, you can pluging custom Processors or Transformers via KStream.process() or KStream.transform() (note, that there are multiple "siblings" to transform() that you might want to use instead of the transform()).

Avro messages within Avro messages: reasonable?

I want to do something crazy with Kafka and avro. Someone talk me off the ledge:
record Bundle {
string key;
array<bytes> msgs;
}
Producers individually serialize a bunch of messages that share a key, then serialize a bundle and post to a topic.
A generic Flattener service is configured by startup parameters to listen to 1...n kafka topics containing bundles, then blindly forward the bundled messages to configured output topics one at a time. (Blindly meaning it takes the bytes from the array and puts them on the wire.)
Use case:
I have services that respond to small operations (update record, delete record, etc). At times, I want batches of ops that need to be gauranteed not to be interleaved with other ops for the same key.
To accomplish this, my thought was to position a Flattener in front of each of the services in question. Normal, one-off commands get stored in 1-item bundles, true batchs are bundled into bigger ones.
I don't use a specific field type for the inner messages, because I'd like to be able to re-use Flattener all over the place
Does this make any sense at all? Potential drawbacks?
EDIT:
Each instance of the Flattener service would only be delivering message of types known to the ultimate consumers with schema_ids embedded in them.
The only reason array is not an array of a specific type is that I'd like to be able to re-use Flattener unchanged in front of multiple different services (just started with different environment variables / command line parameters).
I'm going to move my comment to an answer because I think it's reasonable to "talk you off the ledge" ;)
If you set up a Producer<String, GenericRecord> (change the Avro class as you wish), you already have a String key and Avro bytes as the value. This way, you won't need to embed anything

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.