Spring Cloud Dataflow - how to pass credentials to task - spring-batch

I use spring cloud dataflow deployed to pivotal cloud foundry, to run spring batch jobs as spring cloud tasks, and the jobs require aws credentials to access an s3 bucket.
I've tried passing the aws credentials as task properties, but the credentials are showing up in the task's log files as arguments or properties. (https://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#spring-cloud-dataflow-global-properties)
For now, I am manually setting the credentials as env variables in pcf after each deployment, but I'm trying to automate this. The tasks aren't deployed until the tasks are actually launched, so on a deployment I have to launch the task, then wait for it to fail due to missing credentials, then set the env variable credentials with the cf cli. How so I provide these credentials, without them showing in the pcf app's logs?
I've also explored using vault and spring cloud config, but again, I would need to pass credentials to the task to access spring cloud config.
Thanks!

Here's a Task/Batch-Job example.
This App uses spring-cloud-starter-aws. And this starter already provides the Boot autoconfiguration and the ability to override AWS creds as Boot properties.
You'd override the properties while launching from SCDF like:
task launch --name S3LoaderJob --arguments "--cloud.aws.credentials.accessKey= --cloud.aws.credentials.secretKey= --cloud.aws.region.static= --cloud.aws.region.auto=false"
You can decide to control the log-level of the Task, so it doesn't log them in plain text.

Secure credentials for tasks should be configured either via environment variables in your task definition or by using something like Spring Cloud Config Server to provide them (and store them encrypted at rest). Spring Cloud Task stores all command line arguments in the database in clear text which is why they should not be passed that way.

After considering the approaches included in the provided answers, I continued testing and researching and concluded that the best approach is to use a Cloud Foundry "User Provided Service" to supply AWS credentials to the task.
https://docs.cloudfoundry.org/devguide/services/user-provided.html
Spring Boot auto-processes the VCAP_SERVICES environment variable included in each app's container.
http://engineering.pivotal.io/post/spring-boot-injecting-credentials/
I then used properties placeholders in the application-cloud.properties to map the processed properties into spring-cloud-aws properties:
cloud.aws.credentials.accessKey=${vcap.services.aws-s3.credentials.aws_access_key_id}
cloud.aws.credentials.secretKey=${vcap.services.aws-s3.credentials.aws_secret_access_key}

Related

Conditionally launch Spring Cloud Task on a specific node of Kubernetes cluster

I am building a data pipeline for batch processing. And I find that Spring Cloud Data Flow is a quite attractive framework to use. Without much knowledge in SCDF and Kubernetes, I am not sure whether it is possible to conditionally launch a Spring Cloud Task on a specific machine.
Suppose I have two physical servers that are for running the batch process (Server A and Server B). By default, I would like my Spring cloud task to be launched on Server A. If the Server A is shut down, the task should be deployed on server B. Can Kubernetes / SCDF handle this kind of mechanism? I am wondering whether the nodeselector is the thing that I should look into.
Yes, you can pass deployment.nodeSelector as a deployment property when launching the task.
The deployment.nodeSelector is a Kubernetes deployment property and hence, you need to pass something like this:
task launch mytask --properties "deployer.<taskAppName>.kubernetes.deployment.nodeSelector=foo1:bar1,foo2:bar2"
You can check the list of supported Kubernetes deployer properties here

Running a spring batch with partitions in cloud foundry

I have created an app with spring batch(with partition) application taking example of this https://github.com/mminella/S3JDBC. My app is reading some files from object store and doing some processing and writing back to object the store. My app with local partition works fine in my machine.
I changed the maven, to run in cloud foundry , did change for deployer partition handler and step execution listener and deploying on pcf.
But while trying to push and run the app on pcf , I am getting an issue :
Failing URI /v2/info. I tried to log the error found that there is one call to my app e.g https://mypcf.com:443/v2/info and after that it gives the error. I cant provide full logs because of some restrictions. So I want to know :
To deploy a spring batch in pcf(is there any extra configuration
needed except the maven dependency and code changes for
deployerpartitionhandler and stepexecutionlistener and #cloudtask):
org.springframework.cloud spring-cloud-deployer-cloudfoundry
1.1.0.M1
Is it mandatory to have a separate data base service like my-sql for the partition job. Cant I use H2(the default one, if I
don't configure anything)?
Do I need to do any configuration in pcf to support running multiple partitions ?
As I am running remote partitioning , can I run that app on local STS or Intellij(not on PCF-DEV)so that it will run my app in
pcf(remote) and launch the workers.(Sorry for the stupid question ,
I am new to PCF).
Thanks for checking out my example. To answer your questions:
You should be able to use the latest deployer release (instead of that rather old version).
Yes. Partitioned steps need to all be able to share the same job repository data store so an in memory database like H2 will not work for that use case.
Besides defining your datasource, that's all that is required to live in PCF. That being said, there are other things that need to be configured, but you can use other mechanisms to do so (Spring Cloud Config Server, application.properties/yml, etc).
Yes, you should be able to run the master locally and have it deploy the workers onto PCF if you're using the CF deployer.

Using logstash, config server and eureka with spring cloud task and dataflow

We have an existing microservice environment with logstash, config and eureka servers. We are now setting up a Spring Cloud Dataflow (Kubernetes) environment (primarily intially to run tasks/batch jobs).
Ideally we would like the tasks to use the existing logstash, config and eureka servers via the standard spring boot configuration (annotations etc) to support the following scenarios:
Logstash: When a task runs its logs are output to logstash and viewable from Kibana
Config Server: To support changing configuration properties for tasks. eg a periodic task's configuration can be tweaked by altering the values on the configuration server and next time the task runs it will use the new values.
My understanding is that config server properties will override properties in the task definition which override properties in the internal application.properties.
Eureka: Each task would register itself in Eureka. The main reason for this is that our tasks have web actuator endpoints exposed and we can then can use Spring Boot Admin (which can discover services via eureka) to access the actuator endpoints and information while a task is running.
(Some of our tasks can take hours to run and this would enable us to monitor them, adjust logging etc)
Is this a sensible approach - or are there any potential issues to look out for here (eg short lived tasks with eureka). I can’t find any discussion of this in the existing spring cloud data flow or spring cloud task documentation.
You may try logstash-logback-encoder for SCDF integration with ELK stack. It works fine for our SCDF on Yarn stream application.
Config Server should work for any Spring Boot application.

Add custom logs in azure HDInsight application

I am deploying my scala+apache spark 2.0 application on azure HDInsight cluster. We can see default yarn logs of the application through azure portal. But, Our requirement is to add our own custom logger (error, debug logs) for application specific (business cases) logs. We are not able to create custom logger which can be accessible outside the cluster (by storing azure blob storage).
We are working towards HDInsight Integration with Azure Operations management suite. Which let's you add custom logs in addition to common Spark logs. Let us know if this is something you are interested in and we can invite you to our preview.
Thanks,
Ashish
ashishth#microsoft.com

How to add functionality to Spring Cloud Bootstrap

I want to add some lookup before the Spring context is loaded, which is ideally on the bootstrap phase of Spring Cloud (When it lookup for Configuration Server, Cloud connectors etc). How can I make my code to be executed on that phase ?
What I want to do is query Vault to get all my databases secrets and api keys and set the properties, I know I can encrypt with Spring Cloud Config, but I liked the strong box of Vault. (The integration with Vault part I can handle)
As I saw in the code of Spring Cloud Config, the bootstrap configuration is auto-configured by using the class org.springframework.cloud.bootstrap.BootstrapConfiguration on the resources/META-INF/spring.factories file, the same which you can use to register new auto configuration classes for Spring Boot, as reference you can refer to the file on the project here. This will make your configuration be started and registered before the "normal" application context.