I would like to print more logs in Terraform using Azure pipeline, no need to export the logs in a file as the Terraform's documentation mentions.
You should be able to do this by setting the TF_LOG environment variable to one of the values below (most likely DEBUG in your case), in a separate task before you start running actual Terraform commands.
TRACE: the most elaborate verbosity, as it shows every step taken by Terraform and produces enormous outputs with internal logs.
DEBUG: describes what happens internally in a more concise way compared to TRACE.
ERROR: shows errors that prevent Terraform from continuing.
WARN: logs warnings, which may indicate misconfiguration or mistakes, but are not critical to execution.
INFO: shows general, high-level messages about the execution process
Bash:
export TF_LOG=<log_level>
PowerShell
$env:TF_LOG = '<log_level>'
Related
I have the following celery setup in production
Using RabbitMQ as the broker
I am running multiple instances of the celery using ECS Fargate
Logs are sent to CloudWatch using default awslogs driver
Result backend - MongoDB
The issue I am facing is this. A lot of my tasks are not showing logs on cloudwatch.
I just see this log
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] received
But I do not see the actual logs for the execution of this task. Nor do I see the logs for completion - for example something like this is nowhere in the logs to be found
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] succeeded
This does not happen all the time. It happens intermittently but consistently. I see that a lot of tasks are printing the logs.
I can see that result backend has the task results and I know that the task has executed but the logs for the task are completely missing. It is not specific to some task_name.
On my local setup, I have not been able to isolate the issue
I am not sure if this is a celery logging issue or awslogs issue. What can I do to troubleshoot this?
** UPDATE **
Found the root cause - it was that I had some code in the codebase that was removing handlers from the root logger. Leaving this question on stack overflow in case someone else faces this issue
Log all requests from the python-requests module
Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.
Currently I have a OneBranch DevOps pipeline that fails every now and then while restoring packages. Usually it fails because of some transient error like a socket exception or timeout. Re-trying the job usually fixes the issue.
Is there a way to configure a job or task to retry?
Azure Devops now supports the retryCountOnTaskFailure setting on a task to do just this.
See this page for further information:
https://learn.microsoft.com/en-us/azure/devops/release-notes/2021/pipelines/sprint-195-update
Update:
Automatic retries for a task was added and when you read this it should be available for usage.
It can be used as follow:
- task: <name of task>
retryCountOnTaskFailure: <max number of retries>
...
Here are a few things to note when using retries:
The failing task is retried immediately.
There is no assumption about the idempotency of the task. If the task has side-effects (for instance, if it created an external resource partially), then it may fail the second time it is run.
There is no information about the retry count made available to the task.
A warning is added to the task logs indicating that it has failed before it is retried.
All of the attempts to retry a task are shown in the UI as part of the same task node.
Original answer:
There is no way of doing that with native tasks. However, if you can script then you can put such logic inside.
You could do this for instance in this way:
n=0
until [ "$n" -ge 5 ]
do
command && break # substitute your command here
n=$((n+1))
sleep 15
done
However there is no native way of doing this for regular tasks.
Automatically retry a task in on roadmap so it could change in near future.
I am running a snakemake pipeline on a HPC that uses slurm. The pipeline is rather long, consisting of ~22 steps. Periodically, snakemake will encounted a problem when attempting to submit a job. This reults in the error
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):
I run the pipeline via a sbatch file with the following snakemake call
snakemake -j 999 -p --cluster-config cluster.json --cluster 'sbatch --account {cluster.account} --job-name {cluster.job-name} --ntasks-per-node {cluster.ntasks-per-node} --cpus-per-task {threads} --mem {cluster.mem} --partition {cluster.partition} --time {cluster.time} --mail-user {cluster.mail-user} --mail-type {cluster.mail-type} --error {cluster.error} --output {cluster.output}'
This results in not only an output for snakemake sbatch job, but also for the jobs that snakemake creates. The above error appears in the slurm.out for the sbatch file.
The specific job step the error indicates will run successfully, and give output, but the pipeline fails. The logs of the job step show that the job id ran without a problem. I have googled this error, and it appears to happen often with slurm, and especially when the scheduler is under high IO, which suggests it will be an inevitable and regular occurrence. I was hoping someone has encountered this problem, and could offer suggestions for a work around, so that the entire pipeline doesn't fail.
snakemake has an option --max-jobs-per-second and --max-status-checks-per-second with default argument of 10. Maybe try decreasing them to reduce strain on the scheduler? Also, maybe try to reduce -j 999?
I have two Snakemake workflows that are very similar. Both of them share a sub-workflow and a couple of includes. Both of them work when doing dry runs. Both of them use the same cluser config file, and I'm running them with the same launch command. One of them fails when submitting to the LSF cluster with this error:
Executing subworkflow wf_common.
WorkflowError:
Config file __default__ not found.
I'm wondering whether it's "legal" in Snakemake for two workflows to share a sub-workflow, like in this case, and if not, whether the fact that I ran the workflow that does work first could have this effect.
Can you try Snakemake 3.12.0? It fixed a bug with passing the cluster config to a subworkflow. I would think that this solves your problem.