How do you set global properties for child tasks in Spring Cloud Dataflow using Stream - Task-Launcher-Dataflow - spring-cloud

I have a stream with an Http source, custom processor and Task-Launcher-Dataflow
I have a composed task the gets called from the stream - task-launcher-dataflow
I pass several properties in the stream processor to the task-launcher-dataflow and to the child tasks, example...
deploymentProps":{"app.composedtask-filecopy2.prescript.scriptFile":"/source/prescript.sh"}
This works fine but I have reached the maximum character limit and get a sql exception on the composed task pod stating I have exceeded the 2500 character limit. I would like to set properties for all tasks instead of individually to save space on characters but using a wildcard character doesnt work
deploymentProps":{"app.composedtask-filecopy2.prescript.*":"/source/prescript.sh"
Is there a way to set properties for all tasks instead of having to set them individually?
I tried setting these in the Spring-Cloud-Dataflow-Server configmap in kubernetes, specifically for the imagePullPolicy but so far this hasnt worked.
Any help would be appreciated.

If you are using SCDF 2.8.x you can try the following deployer property.: deployer.*.kubernetes.image-pull-policy=Always

Related

How to define max.poll.records (SCS with Kafka) over containers

I'm trying to figure out the poll records mechanism for Kafka over SCS in a K8s environment.
What is the recommended way to control max.poll.records?
How can I poll the defined value?
Is it possible to define it once for all channels and then override for a specific channel?
(referring to this comment form documentation):
To avoid repetition, Spring Cloud Stream supports setting values for
all channels, in the format of
spring.cloud.stream.kafka.default.consumer.=. The
following properties are available for Kafka consumers only and must
be prefixed with
spring.cloud.stream.kafka.bindings..consumer..")
Is this path supported: spring.cloud.stream.binding.<channel name>.consumer.configuration?
Is this: spring.cloud.stream.**kafka**.binding.<channel name>.consumer.configuration?
How are conflicts being resolved? Let's say in a case where both spring.cloud.stream.binding... and spring.cloud.stream.**kafka**.binding... are set?
I've tried all mentioned configurations, but couldn't see in the log what is the actual poll.records and frankly the documentation is not entirely clear on the subject.
These are the configurations:
spring.cloud.stream.kafka.default.consumer.configuration.max.poll.records - default if nothing else specified for given channel
spring.cloud.stream.kafka.bindings..consumer.configuration.max.poll.records

Pause message consumption and resume after time interval

We are using spring-cloud-stream to work with kafka. And I need to add some interval between getting data by consumer from single topic.
batch-node is already set as true , also fetch-max-wait, fetch-min-size, max-poll-records are already tuned.
Should I do something with idleEventInterval, or it's wrong way?
You can pause/resume the container as needed (avoiding a rebalance).
See https://docs.spring.io/spring-cloud-stream/docs/3.2.1/reference/html/spring-cloud-stream.html#binding_visualization_control
Since version 3.1 we expose org.springframework.cloud.stream.binding.BindingsLifecycleController which is registered as bean and once injected could be used to control the lifecycle of individual bindings
If you only want to delay for a short time, you can set the container's idleBetweenPolls property, using a ListenerContainerCustomizer bean.

How do you set global environment variables for child tasks in Spring Cloud Dataflow using Stream - Task-Launcher

I have a stream with an Http source, custom processor and Task-Launcher Spring Cloud Dataflow Stream
I have a composed task the gets called from the stream - task-launcher
I pass several properties in the stream processor to the task-launcher and to the child tasks for example the kubernetes pull policy which I want set to Always for all the child tasks, example...
"deploymentProps":{"deployer.*.kubernetes.imagePullPolicy","Always"}
Passing properties for pull policy and volumes/volumemounts works as expected
I also want to pass an environment variable to the child tasks for spring to pickup, I have tried the following that does not work...
deploymentProps":{"deployer.*.kubernetes.environment-variables":"SCDF_ACTIVE_PROFILE=prod"}
Does not seem to make it to the pod in kubernetes, is there another way to get environment variables to the child tasks or is there something wrong with this approach?
Any help would be appreciated.
Try it in this format: deployer.*.kubernetes.environment-variables=SCDF_ACTIVE_PROFILE=prod.
There is a little more information about it here:
https://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#_environment_variables

How to configure druid properly to fire a periodic kill task

I have been trying to get druid to fire a kill task periodically to clean up unused segments.
These are the configuration variables responsible for it
druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT45M
druid.coordinator.kill.durationToRetain=PT45M
druid.coordinator.kill.maxSegments=10
From the above configuration my mental model is, once ingested data is marked unused, kill task will fire and delete the segments that are older that 45 mins while retaining 45 mins worth of data. period and durationToRetain are the config vars that are confusing me, not quite sure how to leverage them. Any help would be appreciated.
The caveat for druid.coordinator.kill.on=true is that segments are deleted from whitelisted datasources. The whitelist is empty by default.
To populate the whitelist with all datasources, set killAllDataSources to true. Once I did that, the kill task fired as expected and deleted the segments from s3 (COS). This was tested for Druid version 0.18.1.
Now, while the above configuration properties can be set when you build your image, the killAllDataSources needs to be set through an API. This can be set via the druid UI too.
When you click the option, a modal appears that has Kill All Data Sources. Click on True and you should see a kill task (Ingestion ---> Tasks below) firing in the interval specified. It would be really nice to have this as a part of runtime.properties or some sort of common configuration file that we can set the value in when build the druid image.
Use crontab it works quite well for us.
If you want to have a control outside the druid over the segments removal, then you must use an scheduled task which runs based on your desire interval and register kill-tasks in druid. It can increase your control over your segments, since when they go away, you cannot recover them. You can use this script to accompany you:
https://github.com/mostafatalebi/druid-kill-task

Spring Batch Integration - Multiple files as single Message

In the sample https://github.com/ghillert/spring-batch-integration-sample, the file inbound adapter is configured to poll for a directory and once the file is placed in that directory, a FileMessageToJobRequest is constructed and the Spring batch Job is getting launched.
So for each file, a new FileMessageToJobRequest is constructed and a new Spring batch Job instance is getting created.
We also want to configure a file inbound adapter to poll for the files but want to process all the files using a single batch job instance.
For example, If we place 1000 files in the directory and have max-messages-per-poll to 1000, We want to send the name of the 1000 files as one of the parameters to Spring batch Job instead of calling the Job 1000 times.
Is there a way to send list of files that the File Inbound adapter picked up during its one poll as a single message to the subsequent Spring components?
Thank You,
Regards
Suresh
Even if it is a single poll, the inbound-channel-adapter emits messages for each entry.
So, to collect them to the single message you need to use an <aggregator>.
Although with that you have to come up with ReleaseStrategy. Even if you can just use 1 as a correlationKey, there is some issue with releasing the group.
You should agree with that you don't always have 1000 of files there to group them to the single message. So, maybe a TimeoutCountSequenceSizeReleaseStrategy is a good compromise to emit the result after some timeout, even if you don't have enough number of files to complete the group by size.
HTH
UPDATE
You can consider to use group-timeout on the <aggregator> to allow to release groups even if there is no a new message during that period.
In addition the there is an expire-groups-upon-completion option to be sure that you "single" will be cleared and removed after each release. To allow to form a new group for the next poll.