Snakemake: sub-workflow error when executing in LSF cluster - lsf

I have two Snakemake workflows that are very similar. Both of them share a sub-workflow and a couple of includes. Both of them work when doing dry runs. Both of them use the same cluser config file, and I'm running them with the same launch command. One of them fails when submitting to the LSF cluster with this error:
Executing subworkflow wf_common.
WorkflowError:
Config file __default__ not found.
I'm wondering whether it's "legal" in Snakemake for two workflows to share a sub-workflow, like in this case, and if not, whether the fact that I ran the workflow that does work first could have this effect.

Can you try Snakemake 3.12.0? It fixed a bug with passing the cluster config to a subworkflow. I would think that this solves your problem.

Related

Extracting files from a k8s pod

I am running a function in a k8s pod and the function creates a log file once it finishes. How can I copy the log file to the local machine?
I have searched similar problems and found this. But the kubectl cp functions need to be run from the master node. The problem I have is that, I don't know when the function finishes. And once it has finished the container terminates automatically.
I am looking forward to any idea on how to solve this problem.
You can just create a bash script that runs
scp file_from_container user#master_node_ip:/file

How to wait for full cloud-initialization before VM is marked as running

I am currently configuring a virtual machine to work as an agent within Azure (with Ubuntu as image). In which the additional configuration is running through a cloud init file.
In which, among others, I have the below 'fix' within bootcmd and multiple steps within runcmd.
However the machine already gives the state running within the azure portal, while still running the cloud configuration phase (cloud_config_modules). This has as a result pipelines see the machine as ready for usage while not everything is installed/configured yet and breaks.
I tried a couple of things which did not result in the desired effect. After which I stumbled on the following article/bug;
The proposed solution worked, however I switched to a rhel image and it stopped working.
I noticed this image is not using walinuxagent as the solution states but waagent, so I tried to replacing that like the example below without any success.
bootcmd:
- mkdir -p /etc/systemd/system/waagent.service.d
- echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/waagent.service.d/override.conf
- sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
- systemctl daemon-reload
After this, also tried to set the runcmd steps to the bootcmd steps. This resulted in a boot which took ages and eventually froze.
Since I am not that familiar with rhel and Linux overall, I wanted to ask help if anyone might have some suggestions which I can additionally try.
(Apply some other configuration to ensure await on the cloud-final.service within a waagent?)
However the machine already had the state running, while still running the cloud configuration phase (cloud_config_modules).
Could you please be more specific? Where did you read the machine state?
The reason I ask is that cloud-init status will report status: running until cloud-init is done running, at which point it will report status: done
I what is the purpose of waiting until cloud-init is done? I'm not sure exactly what you are expecting to happen, but here are a couple of things that might help.
If you want to execute a script "at the end" of cloud-init initialization, you could put the script directly in runcmd, and if you want to wait for cloud-init in an external script you could do cloud-init status --wait, which will print a visual indicator and eventually return once cloud-init is complete.
On not too old Azure Linux VM images, cloud-init rather than WALinuxAgent acts as the VM provisioner. The VM is marked provisioned by the Azure cloud-init datasource module very early during cloud-init processing (source), before any cloud-init modules configurable with user data. WALinuxAgent is only responsible for provisioning Azure VM extensions. It does not appear to be possible to delay sending the 'VM ready' signal to Azure without modifying the VM image and patching the source code of cloud-init Azure datasource.

Running kubectl commands in parallel with different credentials

I'm currently running two Kubernetes clusters one on Google cloud and one on IBM cloud. To manage them I use kubectl. I've made a script that executes some commands on one of the clusters then switches to the other and does some other work there.
This works fine as long as the script only runs in one process, however when run in parallel the credentials are sometimes overwritten by one process when in use by another and this obviously causes issues.
I therefore want to know if I can supply kubectl with a credentials file for every call, instead of storing it in a environmental variable with kubectl config set-credentials.
Any help/solution is much appreciated.
If I need to work with multiple clusters using kubectl I am splitting my terminal and setting KUBECONFIG for each split:
For my first split:
export KUBECONFIG=~/.kube/cluster1
For the second split
export KUBECONFIG=~/.kube/cluster2
It is working pretty well, but this approach has one issue:
If you are using some kind of prompt with the current Kubernetes context it will give you different output and it might be missing leading.
For scripts, I am just changing value of KUBECONFIG in for loop, to loop over each cluster.
You need to use Kubefed in order to manage multiple clusters.
It will take one cluster as the main one, and execute all the same requests to the second cluster.

Is it possible to stop a job in Kubernetes without deleting it

Because Kubernetes handles situations where there's a typo in the job spec, and therefore a container image can't be found, by leaving the job in a running state forever, I've got a process that monitors job events to detect cases like this and deletes the job when one occurs.
I'd prefer to just stop the job so there's a record of it. Is there a way to stop a job?
1) According to the K8S documentation here.
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here are the details for the failedJobsHistoryLimit property in the CronJobSpec.
This is another way of retaining the details of the failed job for a specific duration. The failedJobsHistoryLimit property can be set based on the approximate number of jobs run per day and the number of days the logs have to be retained. Agree that the Jobs will be still there and put pressure on the API server.
This is interesting. Once the job completes with failure as in the case of a wrong typo for image, the pod is getting deleted and the resources are not blocked or consumed anymore. Not sure exactly what kubectl job stop will achieve in this case. But, when the Job with a proper image is run with success, I can still see the pod in kubectl get pods.
2) Another approach without using the CronJob is to specify the ttlSecondsAfterFinished as mentioned here.
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
Not really, no such mechanism exists in Kubernetes yet afaik.
You can workaround is to ssh into the machine and run a: (if you're are using Docker)
# Save the logs
$ docker log <container-id-that-is-running-your-job> 2>&1 > save.log
$ docker stop <main-container-id-for-your-job>
It's better to stream log with something like Fluentd, or logspout, or Filebeat and forward the logs to an ELK or EFK stack.
In any case, I've opened this
You can suspend cronjobs by using the suspend attribute. From the Kubernetes documentation:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#suspend
Documentation says:
The .spec.suspend field is also optional. If it is set to true, all
subsequent executions are suspended. This setting does not apply to
already started executions. Defaults to false.
So, to pause a cron you could:
run and edit "suspend" from False to True.
kubectl edit cronjob CRON_NAME (if not in default namespace, then add "-n NAMESPACE_NAME" at the end)
you could potentially create a loop using "for" or whatever you like, and have them all changed at once.
you could just save the yaml file locally and then just run:
kubectl create -f cron_YAML
and this would recreate the cron.
The other answers hint around the .spec.suspend solution for the CronJob API, which works, but since the OP asked specifically about Jobs it is worth noting the solution that does not require a CronJob.
As of Kubernetes 1.21, there alpha support for the .spec.suspend field in the Job API as well, (see docs here). The feature is behind the SuspendJob feature gate.

Kubernetes command logging on Google Cloud Platform for PCI Compliance

Using Kubernetes' kubectl I can execute arbitrary commands on any pod such as kubectl exec pod-id-here -c container-id -- malicious_command --steal=creditcards
Should that ever happen, I would need to be able to pull up a log saying who executed the command and what command they executed. This includes if they decided to run something else by simply running /bin/bash and then stealing data through the tty.
How would I see which authenticated user executed the command as well as the command they executed?
Audit logging is not currently offered, but the Kubernetes community is working to get it available in the 1.4 release, which should come around the end of September.
There are 3rd party solutions that can solve the auditing issue, and if you're looking for a PCI compliance as the title implies solutions exist that helps solve the broader problem, and not just auditing.
Here is a link to such a solution by Twistlock. https://info.twistlock.com/guide-to-pci-compliance-for-containers
Disclaimer, I work for Twistlock.