Pause message consumption and resume after time interval - apache-kafka

We are using spring-cloud-stream to work with kafka. And I need to add some interval between getting data by consumer from single topic.
batch-node is already set as true , also fetch-max-wait, fetch-min-size, max-poll-records are already tuned.
Should I do something with idleEventInterval, or it's wrong way?

You can pause/resume the container as needed (avoiding a rebalance).
See https://docs.spring.io/spring-cloud-stream/docs/3.2.1/reference/html/spring-cloud-stream.html#binding_visualization_control
Since version 3.1 we expose org.springframework.cloud.stream.binding.BindingsLifecycleController which is registered as bean and once injected could be used to control the lifecycle of individual bindings
If you only want to delay for a short time, you can set the container's idleBetweenPolls property, using a ListenerContainerCustomizer bean.

Related

How to configure druid properly to fire a periodic kill task

I have been trying to get druid to fire a kill task periodically to clean up unused segments.
These are the configuration variables responsible for it
druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT45M
druid.coordinator.kill.durationToRetain=PT45M
druid.coordinator.kill.maxSegments=10
From the above configuration my mental model is, once ingested data is marked unused, kill task will fire and delete the segments that are older that 45 mins while retaining 45 mins worth of data. period and durationToRetain are the config vars that are confusing me, not quite sure how to leverage them. Any help would be appreciated.
The caveat for druid.coordinator.kill.on=true is that segments are deleted from whitelisted datasources. The whitelist is empty by default.
To populate the whitelist with all datasources, set killAllDataSources to true. Once I did that, the kill task fired as expected and deleted the segments from s3 (COS). This was tested for Druid version 0.18.1.
Now, while the above configuration properties can be set when you build your image, the killAllDataSources needs to be set through an API. This can be set via the druid UI too.
When you click the option, a modal appears that has Kill All Data Sources. Click on True and you should see a kill task (Ingestion ---> Tasks below) firing in the interval specified. It would be really nice to have this as a part of runtime.properties or some sort of common configuration file that we can set the value in when build the druid image.
Use crontab it works quite well for us.
If you want to have a control outside the druid over the segments removal, then you must use an scheduled task which runs based on your desire interval and register kill-tasks in druid. It can increase your control over your segments, since when they go away, you cannot recover them. You can use this script to accompany you:
https://github.com/mostafatalebi/druid-kill-task

Is it possible to dynamically adjust the num.stream.threads configuration of kafka stream while the program is running?

I am running multiple instances of kafka stream in a service. I want to dynamically adjust the num.stream.threads configuration to control the priority of each instance while the program is running.
I didn't find a related method on the KafkaStream class.
I wonder if there is any other way?
it's not possible to update KafkaStreams configuration at runtime when you already created it (it relates not just to property num.stream.threads, but also for others as well).
as a workaround, you could recreate a specific KafkaStreams by stopping existing one and creating and starting a new one without stopping other streams and without restarting your application. it depends on your specific use case whether it fits your needs or not.
this could be achieved by several options. one of them - update configs (like num.stream.threads) in database per specific kafka stream flow, and from each instance of your application fetch data from database (e.g. every 10 minutes by cron expression), and if any updates found - stop existing and start a new KafkaStream that has desired updated configs. if you have a single instance of application, it could be achieved much easier via REST.
Update since kafka-streams 2.8.0
since kafka-streams 2.8.0, you have the ability to add and remove stream threads at runtime, without recreating stream (API to Start and Shut Down Stream Threads)
kafkaStreams.addStreamThread();
kafkaStreams.removeStreamThread();
That is currently not possible.
If you want to change the number of threads, you need to stop the program with KafkaStreams#close(), create new KafkaStreams instance with updated configuration, and start the new instance with KafkaStreams#start().

Kafka Streams - accessing data from the metrics registry

I'm having a difficult time finding documentation on how to access the data within the Kafka Streams metric registry, and I think I may be trying to fit a square peg in a round hole. I was hoping to get some advice on the following:
Goal
Collect metrics being recorded in the Kafka Streams metrics registry and send these values to an arbitrary end point
Workflow
This is what I think needs to be done, and I've complete all of the steps except the last (having trouble with that one because the metrics registry is private). But I may be going about this the wrong way:
Define a class that implements the MetricReporter interface. Build a list of the metrics that Kafka creates in the metricChange method (e.g. whenever this method is called, update a hashmap with the currently registered metrics).
Specify this class in the metric.reporters configuration property
Set up a process that polls the Kafka Streams metric registry for the current data, and ship the values to an arbitrary end point
Anyways, the last step doesn't appear to be possible in Kafka 0.10.0.1 since the metrics registry isn't exposed. Could some please let me know this if is the correct workflow (sounds like it's not..), or if I am misunderstanding the process for extracting the Kafka Streams metrics?
Although the metrics registry is not exposed, you can still get the value of a given KafkaMetric by its KafkaMetric.value() / KafkaMetric.value(timestamp) methods. For example, as you observed in the JMXRporter, it keeps the list of KafkaMetrics from the instantiated init() and metricChange/metricRemoval methods, and then in its MBean implementation, when getAttribute is called, it will call its corresponding KafkaMetrics.value() function. So for your customized reporter, you can apply similar patterns, for example, periodically poll all kept KafkaMetrics.value() and then pipe the results to your end point.
The MetricReporter interface in org.apache.kafka.common.metrics already enables you to manage all Kafka stream metrics in the reporter. So kafka internal registry is not needed.

Will Kafka operator semantic change when I rename an Apache Apex application?

Am I correct in assuming that the moment we rename an application, the semantics of the Kafka operator will completely change and might end up reading from the "initialOffset" by the application code ?
How is the semantics maintained for a definition of "application name" ?
Does every deploy of the application code result in a new application or it is simply using the #ApplicationAnnotation(name="") instance to define this meaning ?
You can always launch the application by using -originalAppId the operator should continue from where the original app left-off. If you are using kafka 0.9 operator and you launch the application with same name, you can set the initialOffset to "application_or_latest" or "application_or_earliest", so the operator should continue from offsets that are processed in last run. The difference is if you specify -originalAppId the offsets are restored from checkpoint while the other one store offsets in kafka itself.
You can launch an application from its previous state by using the -originalAppId parameter and provide the yarn application id from its previous run's checkpointed state, it should apply to all operator within the dag including the kafka input operator. You can also provide a new name for the application using attribute dt.attr.APPLICATION_NAME.
eg:
launch pi-demo-3.4.0-incubating-SNAPSHOT.apa -originalAppId application_1459879799578_8727 -Ddt.attr.APPLICATION_NAME="pidemo v201"

Preiodically read zookeeper node

I have a use case where I want to read the zookeeper nodes periodically. Is there any way I can do this in zookeeper asynchronously?
I read about exists() but it call backs only if there is a change in node data.
If you are using Java with Spring, create a component with a method with the annotation #Scheduled. Add a delay and inside the method put the code where you read the zNode and check if the other application is alive.