Is it possible to dynamically adjust the num.stream.threads configuration of kafka stream while the program is running? - apache-kafka

I am running multiple instances of kafka stream in a service. I want to dynamically adjust the num.stream.threads configuration to control the priority of each instance while the program is running.
I didn't find a related method on the KafkaStream class.
I wonder if there is any other way?

it's not possible to update KafkaStreams configuration at runtime when you already created it (it relates not just to property num.stream.threads, but also for others as well).
as a workaround, you could recreate a specific KafkaStreams by stopping existing one and creating and starting a new one without stopping other streams and without restarting your application. it depends on your specific use case whether it fits your needs or not.
this could be achieved by several options. one of them - update configs (like num.stream.threads) in database per specific kafka stream flow, and from each instance of your application fetch data from database (e.g. every 10 minutes by cron expression), and if any updates found - stop existing and start a new KafkaStream that has desired updated configs. if you have a single instance of application, it could be achieved much easier via REST.
Update since kafka-streams 2.8.0
since kafka-streams 2.8.0, you have the ability to add and remove stream threads at runtime, without recreating stream (API to Start and Shut Down Stream Threads)
kafkaStreams.addStreamThread();
kafkaStreams.removeStreamThread();

That is currently not possible.
If you want to change the number of threads, you need to stop the program with KafkaStreams#close(), create new KafkaStreams instance with updated configuration, and start the new instance with KafkaStreams#start().

Related

Pause message consumption and resume after time interval

We are using spring-cloud-stream to work with kafka. And I need to add some interval between getting data by consumer from single topic.
batch-node is already set as true , also fetch-max-wait, fetch-min-size, max-poll-records are already tuned.
Should I do something with idleEventInterval, or it's wrong way?
You can pause/resume the container as needed (avoiding a rebalance).
See https://docs.spring.io/spring-cloud-stream/docs/3.2.1/reference/html/spring-cloud-stream.html#binding_visualization_control
Since version 3.1 we expose org.springframework.cloud.stream.binding.BindingsLifecycleController which is registered as bean and once injected could be used to control the lifecycle of individual bindings
If you only want to delay for a short time, you can set the container's idleBetweenPolls property, using a ListenerContainerCustomizer bean.

Kafka - Topology change on redundant apps

Let's say I have two applications with the same applicationId "foo-processor" and the following setup:
streamsBuilder.table(fooTopic)
.groupBy(...)
.reduce(...)
Assuming I now have some cases I don't want to handle and add a filter like this:
streamsBuilder.table(fooTopic)
.filter(...)
.groupBy(...)
.reduce(...)
During deployment, not all instances of the app is shut down and restarted at the same time. Therefore, instance #1 of foo-processor is restarted and instance #2 is still using the previous topology. What happens is that instance #1 will have this error:
java.lang.IllegalArgumentException: Assigned partition foo-processor-KTABLE-REDUCE-STATE-STORE-0000000006-repartition-2 for non-subscribed topic regex pattern; subscription pattern is foo-processor-KTABLE-REDUCE-STATE-STORE-0000000007-repartition|<topic>
I assume this is the expected behaviour because the repartition topic might not contain the same events because of the different topology. That being said, I am wondering how should I handle change in topology.
Does that mean that the application is different so the applicationId should also change? If not, how should I handle topology changes if many instances of the same app are running?
Thanks!
If you want to change the topology, you need to use a new application.id -- running both in parallel with the same application.id is not supported.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.

Storm dynamic topology

Does Storm support dynamic topology? The functionality I want from this is to dynamically change the topology according to the user requirement while the Storm topology is running. For example, when user want to know the top-10 words of a stream, I use the top-10 bolt to process it, when user want to know something else, I use the other bolt to process the stream and 'unplug' the top-10 bolt.
I know it could be done by partition the stream or duplicate the stream and alway running every functionalities and only demo the data we want, or we could shut down the stream and update another topology, but is there a 'hot plug-in' way to do that?
You can't dinamically change a Storm topology's structure, i.e. modify the spouts and bolts wiring. A Storm topology's wiring is always static.
However, you could implement the needed functionality in other ways you already described. IMHO, the best, most logical way would be by running multiple topologies -- in case the data processing differs greatly. But if most of the processing is similar in both cases, just duplicate the source stream and process the data in different branches of the same topology.
It was added on STORM-561, on 03/Jun/15:
https://issues.apache.org/jira/browse/STORM-561
There is no built in way to do this (switch out one bolt for another), but what you can do is write a bolt that executes arbitrary code based on the input it receives. So long as your input and output has the same structure in storm (same tuples emitted), you could theoretically execute whatever you wanted at run time in your bolt. This is especially easy if you build your bolt in Clojure, but it's possible in essentially every language you can use with Storm.
However, this probably doesn't make a lot of sense as most computations you'll want to do involve more than one bolt and lend themselves to passing differently structured tuples around. As schiavuzzi already said in their answer, you're probably better off running multiple topologies if there are multiple, independent computations you'd like to do to a stream.
For hot deployment there is a new streaming platform from eBay.
Jetstream: https://github.com/pulsarIO/jetstream.
It has a built in config management tool and your config sits in mongodb. When user modify the config bean, the tool will publish the notification to zookeeper, the corresponding JetStream applications will be get notified and change the config dynamically