Variable number of input artifacts into a step - argo-workflows

I have a diamond style workflow where a single step A starts a variable number of analysis jobs B to X using withParam:. The number of jobs is based on dynamic information and unknown until the first step runs. This all works well, except that I also want a single aggregator job Y to run over the output of all of those analysis jobs:
B
/ \
/ C \
/ / \ \
A-->D-->Y
\ . /
\ . /
\./
X
Each of the analysis jobs B-X writes artifacts, and Y needs as input all of them. I can't figure out how to specify the input for Y. Is this possible? I've tried passing in a JSON array of the artifact keys, but the pod gets stuck on pod initialisation. I can't find any examples on how to do this.
A creates several artifacts which are consumed by B-X (one per job as part of the withParam:) so I know my artifact repository is set up correctly.
Each of the jobs B-X require a lot of CPU so will be running on different nodes, so I don't think a shared volume will work (although I don't know much about sharing volumes across different nodes).

I posted the question as a GitHub issue:
https://github.com/argoproj/argo/issues/4120
The solution is to write all the output to an artifact path specific to the job (i.e. the same subdirectory). You then specify that path as the input key and argo will unpack all the previous results into a subdirectory. You can use {{workflow.name}} to create unique paths.
This does mean you're restricted to a specific directory structure on your artifact repository, but for me that was a small price to pay.
For a full working solution see sarabala1979's answer on the GitHub issue.

Related

Move variable groups to the code repository and reference it from YAML pipelines

We are looking for a solution how to move the non-secret variables from the Variable groups into the code repositories.
We would like to have the possibilities:
to track the changes of all the settings in the code repository
version value of the variable together with the source code, pipeline code version
Problem:
We have over 100 variable groups defined which are referenced by over 100 YAML pipelines.
They are injected at different pipeline/stage/job levels depends on the environment/component/stage they are operating on.
Example problems:
some variable can change their name, some variable can be removed and in the pipeline which targets the PROD environment it is still referenced, and on the pipeline which deploys on DEV it is not there
particular pipeline run used the version of the variables at some date in the past, it is good to know with what set of settings it had been deployed in the past
Possible solutions:
It should be possible to use the simple yaml template variables file to mimic the variable groups and just include the yaml templates with variable groups into the main yamls using this approach: Variable reuse.
# File: variable-group-component.yml
variables:
myComponentVariable: 'SomeVal'
# File: variable-group-environment.yml
variables:
myEnvVariable: 'DEV'
# File: azure-pipelines.yml
variables:
- template: variable-group-component.yml # Template reference
- template: variable-group-environment.yml # Template reference
#some stages/jobs/steps:
In theory, it should be easy to transform the variable groups to the YAML template files and reference them from YAML instead of using a reference to the variable group.
# Current reference we use
variables:
- group: "Current classical variable group"
However, even without implementing this approach, we hit the following limit in our pipelines: "No more than 100 separate YAML files may be included (directly or indirectly)"
YAML templates limits
Taking into consideration the requirement that we would like to have the variable groups logically granulated and separated and not stored in one big yml file (in order to not hit another limit with the number of variables in a job agent) we cannot go this way.
The second approach would be to add a simple script (PowerShell?) which will consume some key/value metadata file with variables (variableName/variableValue) records and just execute job step with a command to
##vso[task.setvariable variable=one]secondValue.
But it could be only done at the initial job level, as a first step, and it looks like the re-engineering variable groups mechanism provided natively in Azure DevOps.
We are not sure that this approach will work everywhere in the YAML pipelines when the variables are currently used. Somewhere they are passed as arguments to the tasks. Etc.
Move all the variables into the key vault secrets? We abandoned this option at the beginning as the key vault is a place to store sensitive data and not the settings which could be visible by anyone. Moreover storing it in secrets cause the pipeline logs to put * instead of real configuration setting and obfuscate the pipeline run log information.
Questions:
Q1. Do you have any other propositions/alternatives on how the variables versioning/changes tracking could be achieved in Azure DevOps YAML pipelines?
Q2. Do you see any problems in the 2. possible solution, or have better ideas?
You can consider this as alternative:
Store your non-secrets variable in json file in a repository
Create a pipeline to push variables to App Configuration (instead a Vault)
Then if you need this settings in your app make sure that you reference to app configuration from the app instead running replacement task in Azure Devops. Or if you need this settings directly by pipelines Pull them from App Configuration
Drawbacks:
the same as one mentioned by you in Powershell case. You need to do it job level
What you get:
track in repo
track in App Configuration and all benefits of App Configuration

Azure DevOps: Run tests parallely on agents in agentpool with different run settings

We have setup a agentpool with 3 agents tagged to it for running tests in parallel. We would like to use various input values for .runsettings file to override test run parameters (overrideTestrunParameters) & distribute our test runs on various agents. e.g.,
Lets assume that the agentpool P1 have associated agents A1, A2, A3.
We need agent A1 to configure a test run parameter executeTests = Functionality1, agent A2 to configure a test run parameter executeTests = Functionality2 etc.,
Please let us know if it is possible to use executionPlan with options Multiagent or Multi Configuration to achieve it.
So if I did not misunderstand, what you want is run the tests with multiple-configuration into multi-agents?
If yes, I'd better suggest you could apply with matrix in pipeline to achieve what you want.
*Note: Matrix is the new feature that only support YAML pipeline. If you want to make use matrix in your side, you had to use YAML to configure your pipeline.*
For how to apply matrix in this scenario, you could refer to below simple sample:
strategy:
matrix:
execTest1:
agentname: "Agent-V1"
executeTests: "Functionality1"
execTest2:
agentname: "Agent-V2"
executeTests: "Functionality2"
execTest3:
agentname: "Agent-V3"
executeTests: "Functionality3"
maxParallel: 3
pool:
name: '{pool name}'
demand:
- agent-name -equals $(agentname)
...
...
With such YAML definition, it can run the job at same time and with different configuration. Also, different configuration run onto the specified agent.
Note: Please ensure your project support parallel consuming.
For more details, see this.
I was able to find a solution for my case here by doing below
Add a variable group in the pipeline named executeTests & assigning names, values for the respective variable group as Functionality1, Functionality2 etc.,
Added multiple agent jobs in the same pipeline & assigned the Override test run parameters with -(test.runsetting variable) $(Functionality1) etc across agents A1, A2, A3
The above does run tests parallelly based on the settings available at each agent job
Using different runsettings or even override settings is not supported. The test task expects it to be consistent across all the agents. It will use whichever is configured for the first to start the test task. For example, if you were to pass an override variable $(Agent.Name), it would use the first agent name regardless of which agent picked it up.
The only way we found to manage this was to handle it in our test framework code. Instead of loading from runsettings, we set environment variables on the agent in a step prior to the test task. Then our test framework will load from the environment variable.

How can i identify processed files in Data flow Job

How can I identify processed files in Data flow Job? I am using a wildcard to read files from cloud storage. but every time when the job runs, it re-read all files.
This is a batch Job and following is sample reading TextIO that I am using.
PCollection<String> filePColection = pipeline.apply("Read files from Cloud Storage ", TextIO.read().from("gs://bucketName/TrafficData*.txt"));
To see a list of files that match your wildcard you can use gsutils, which is the Cloud Storage command line utility. You'd do the following:
gsutils ls gs://bucketName/TrafficData*.txt
Now, when it comes to running a batch job multiple times, your pipeline has no way to know which files it has analyzed already or not. To avoid analyzing new files you could do either of the following:
Define a Streaming job, and use TextIO's watchForNewFiles functionality. You would have to leave your job to run for as long as you want to keep processing files.
Find a way to provide your pipeline with files that have already been analyzed. For this, every time you run your pipeline you could generate a list of files to analyze, put it into a PCollection, read each with TextIO.readAll(), and store the list of analyzed files somewhere. Later, when you run your pipeline again you can use this list as a blacklist for files that you don't need to run again.
Let me know in the comments if you want to work out a solution around one of these two options.

Coordinating Job containers and Volumes in a CI system

I'm working on a tinker Kubernetes-based CI system, where each build gets launched as a Job. I'm running these much like Drone CI does, in that each step in the build is a separate container. In my k8s CI case, I'm running each step as a container within a Job pod. Here's the behavior I'm after:
A build volume is created. All steps will mount this. A Job is fired
off with all of the steps defined as separate containers, in order of
desired execution.
The git step (container) runs, mounting the shared volume and cloning
the sources.
The 'run tests' step mounts the shared volume to a container spawned
from an image with all of the dependencies pre-installed.
If our tests pass, we proceed to the Slack announcement step, which is
another container that announces our success.
I'm currently using a single Job pod with an emptyDir Volume for the shared build space. I did this so that we don't have to wait while a volume gets shuffled around between nodes/Pods. This also seemed like a nice way to ensure that things get cleaned up automatically at build exit.
The problem becomes that if I fire up a multi-container Job with all of the above steps, they execute at the same time. Meaning the 'run tests' step could fire before the 'git' step.
I've thought about coming up with some kind of logic in each of these containers to sleep until a certain unlock/"I'm done!" file appears in the shared volume, signifying the dependency step(s) are done, but this seems complicated enough to ask about alternatives before proceeding.
I could see giving in and using multiple Jobs with a coordinating Job, but then I'm stuck getting into Volume Claim territory (which is a lot more complicated than emptyDir).
To sum up the question:
Is my current approach worth pursuing, and if so, how to sequence the Job's containers? I'm hoping to come up with something that will work on AWS/GCE and bare metal deployments.
I'm hesitant to touch PVCs, since the management and cleanup bit is not something I want my system to be responsible for. I'm also not wanting to require networked storage when emptyDir could work so well.
Edit: Please don't suggest using another existing CI system, as this isn't helpful. I am doing this for my own gratification and experimentation. This tinker CI system is unlikely to ever be anything but my toy.
If you want all the build steps to run in containers, GitLab CI or Concourse CI would probably be a much better fit for you. I don't have experience with fabric8.io, but Frank.Germain suggests that it will also work.
Once you start getting complex enough that you need signaling between containers to order the build steps it becomes much more productive to use something pre-rolled.
As an option you could use a static volume (i.e. a host path as an artifact cache, and trigger the next container in sequence from the current container, mounting the same volume between the stages. You could then just add a step to the beginning or end of the build to 'clean up' after your pipeline has been run.
To be clear: I don't think that rolling your own CI is the most effective way to handle this, as there are already systems in place that will do exactly what you are looking for

How to create multiple jenkins job for multiple folders from git repository

I have 5 repositories :
2 of Java,1 of R, 1 of Python and last of documents
I would like to create one master repository for this combinedly as folders and would like to create jenkins jobs for these folders in a repository.
Please let me know is this possible and if yes how I should approach.
Yes, this is possible.
You can create 5 different jenkins jobs referring to the same git repo. What will differ is the content of execution phase for each of them. Say, if you're using ant, this might look:
Java_1 job: ant -f java_1/build.xml all
Doc job: ant -f doc/build.xml all
Note: this approach implies an overhead in the amount of files: each job will copy and sync-up all the repo while needing only a subset of it.
Use [dsl plugin] (https://wiki.jenkins-ci.org/display/JENKINS/Job+DSL+Plugin) - just generate a job for control new jobs