How do you set global environment variables for child tasks in Spring Cloud Dataflow using Stream - Task-Launcher - spring-cloud

I have a stream with an Http source, custom processor and Task-Launcher Spring Cloud Dataflow Stream
I have a composed task the gets called from the stream - task-launcher
I pass several properties in the stream processor to the task-launcher and to the child tasks for example the kubernetes pull policy which I want set to Always for all the child tasks, example...
"deploymentProps":{"deployer.*.kubernetes.imagePullPolicy","Always"}
Passing properties for pull policy and volumes/volumemounts works as expected
I also want to pass an environment variable to the child tasks for spring to pickup, I have tried the following that does not work...
deploymentProps":{"deployer.*.kubernetes.environment-variables":"SCDF_ACTIVE_PROFILE=prod"}
Does not seem to make it to the pod in kubernetes, is there another way to get environment variables to the child tasks or is there something wrong with this approach?
Any help would be appreciated.

Try it in this format: deployer.*.kubernetes.environment-variables=SCDF_ACTIVE_PROFILE=prod.
There is a little more information about it here:
https://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#_environment_variables

Related

How do you set global properties for child tasks in Spring Cloud Dataflow using Stream - Task-Launcher-Dataflow

I have a stream with an Http source, custom processor and Task-Launcher-Dataflow
I have a composed task the gets called from the stream - task-launcher-dataflow
I pass several properties in the stream processor to the task-launcher-dataflow and to the child tasks, example...
deploymentProps":{"app.composedtask-filecopy2.prescript.scriptFile":"/source/prescript.sh"}
This works fine but I have reached the maximum character limit and get a sql exception on the composed task pod stating I have exceeded the 2500 character limit. I would like to set properties for all tasks instead of individually to save space on characters but using a wildcard character doesnt work
deploymentProps":{"app.composedtask-filecopy2.prescript.*":"/source/prescript.sh"
Is there a way to set properties for all tasks instead of having to set them individually?
I tried setting these in the Spring-Cloud-Dataflow-Server configmap in kubernetes, specifically for the imagePullPolicy but so far this hasnt worked.
Any help would be appreciated.
If you are using SCDF 2.8.x you can try the following deployer property.: deployer.*.kubernetes.image-pull-policy=Always

Airflow DAG design for a user centric workflow

We are considering using airflow to replace our currently custom rq based workflow but I am unsure of the best way to design it. Or if it even makes sense to use airflow.
The use case is:
We get a data upload from a user.
Given the data types received we optionally run zero or more jobs
Each job runs if a certain combination of datatypes was received. It runs for that user for a time frame determined from the data received
The job reads data from a database and writes results to a database.
Further jobs are potentially triggered as a result of these jobs.
e.g.
After a data upload we put an item on a queue:
upload:
user: 'a'
data:
- type: datatype1
start: 1
end: 3
- type: datatype2
start: 2
end: 3
And this would trigger:
job1, user 'a', start: 1, end: 3
job2, user 'a', start: 2, end: 3
and then maybe job1 would have some clean up job that runs after it.
(Also it would be good if it was possible to be able to restrict jobs to only run if there are not other jobs running for the same user.)
Approaches I have considered:
1.
Trigger a DAG when data upload arrives on message queue.
Then this DAG determines which dependent jobs to to run and passes as arguments (or xcom) the user and the time range.
2.
Trigger a DAG when data upload arrives on message queue.
Then this DAG dynamically creates DAGS for the jobs based on datatypes and templates in the user and timeframe.
So you get dynamic DAGs for each user, job, timerange combo.
I'm not even sure how to trigger DAGs from a message queue... And finding it hard to find examples similar to this use case. Maybe that is because Airflow is not suited?
Any help/thoughts/advice would be greatly appreciated.
Thanks.
Airflow is built around time based schedules. It is not built to trigger runs based on the landing of data. There are other systems designed to do this instead. I heard something like pachyderm.io or maybe dvs.org. Even repurposing a CI tool or customizing a Jenkins setup could trigger based on file change events or a message queue.
However you can try to work it with Airflow by having an external queue listener use rest API calls to Airflow to trigger DAGs. EG if the queue is an AWS SNS queue you could have an AWS Lambda listener in simple Python do this.
I would recommend one DAG per job type (or is it user, whichever is less) which the trigger logic determines is correct based on the queue. If there's common clean up tasks and the like, the DAG might use a TriggerDagRunOperator to start those, or you might just have a common library of those clean up tasks that each DAG includes. I think the latter is cleaner.
DAGs can have their tasks limited to certain pools. You could make a pool per user so as to limit the runs of jobs per user. Alternatively if you have a DAG per user, you could set your max concurrent DAG runs for that DAG to something reasonable.

Graceful custom activity timeout in data factory

I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!

Kafka Streams - accessing data from the metrics registry

I'm having a difficult time finding documentation on how to access the data within the Kafka Streams metric registry, and I think I may be trying to fit a square peg in a round hole. I was hoping to get some advice on the following:
Goal
Collect metrics being recorded in the Kafka Streams metrics registry and send these values to an arbitrary end point
Workflow
This is what I think needs to be done, and I've complete all of the steps except the last (having trouble with that one because the metrics registry is private). But I may be going about this the wrong way:
Define a class that implements the MetricReporter interface. Build a list of the metrics that Kafka creates in the metricChange method (e.g. whenever this method is called, update a hashmap with the currently registered metrics).
Specify this class in the metric.reporters configuration property
Set up a process that polls the Kafka Streams metric registry for the current data, and ship the values to an arbitrary end point
Anyways, the last step doesn't appear to be possible in Kafka 0.10.0.1 since the metrics registry isn't exposed. Could some please let me know this if is the correct workflow (sounds like it's not..), or if I am misunderstanding the process for extracting the Kafka Streams metrics?
Although the metrics registry is not exposed, you can still get the value of a given KafkaMetric by its KafkaMetric.value() / KafkaMetric.value(timestamp) methods. For example, as you observed in the JMXRporter, it keeps the list of KafkaMetrics from the instantiated init() and metricChange/metricRemoval methods, and then in its MBean implementation, when getAttribute is called, it will call its corresponding KafkaMetrics.value() function. So for your customized reporter, you can apply similar patterns, for example, periodically poll all kept KafkaMetrics.value() and then pipe the results to your end point.
The MetricReporter interface in org.apache.kafka.common.metrics already enables you to manage all Kafka stream metrics in the reporter. So kafka internal registry is not needed.

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.