When updating a Cloudformation EC2 Container Service (ECS) Stack with a new Container Image, is there any way to control the timeout so if the service does not stabilize it rolls back automatically?
The UpdatePolicy attribute which is part of the Auto Scaling Group does not help since instances are not being created.
I also tried a WaitCondition but have not been able to get that to work.
The stack essentially just stays in the UPDATE_IN_PROGRESS state until it hits the default timeout (~3 hours), or you trigger a Cancel the update.
Ideally we would be able to have the stack timeout after a short period of time.
This is what my Cloudformation template looks like:
https://s3.amazonaws.com/aws-rga-cw-public/ops/cfn/ecs-cluster-asg-elb-cfn.yaml
Thanks.
I've created a workaround for this problem until AWS creates a ECS UpdatePolicy and CreationPolicy that allows for resourcing signaling:
Use AWS::CloudFormation::WaitCondition with a Macro that will create new WaitCondition resources when the service is expected to update. Signal the wait condition with a non-essential container attached to the task.
Example: https://github.com/deuscapturus/cloudformation-macro-WaitConditionUpdate/blob/master/example-ecs-service.yaml
The Macro for the above example can be found here: https://github.com/deuscapturus/cloudformation-macro-WaitConditionUpdate
My workaround for this problem is that before triggering an update stack, run a script in the background
./deployment-breaker.sh &
And for the script
#!/bin/bash
sleep 600
$deploymentStatus = (aws cloudformation describe-stack --stack-name STACK_NAME | jq XXX)
if [[ $deploymentStatus == YOUR_TERMINATE_CONDITION ]]then
aws cloudformation cancel-update-stack --stack-name STACK_NAME
fi
If your WaitCondition is in the original create you need to rename it (and the Handle). Once a waitcondition has been signaled as complete, it will always be complete. If you rename it and do an update, the original WaitCondition and Handle will be dropped and the new ones created created and signaled.
If you don't want to have to modify your template you might be able to use Lamba and Custom resources to create a unique WaitCondition via the aws cli for each update.
It's not possible at the moment with the provided CloudFormation types. I have the same problem and I might create a custom CloudFormation resource (usineg AWS Lambda) to replace my AWS::ECS::Service.
The other alternative is to use nested stacks to wrap the AWS::ECS::Service resources — it won't solve the problem, but it at least will isolate the individual service and the rest of the stack will be in a good state. My stacks have multiple services and this would help, but the custom resource is the best option so far (I know other people that did the same thing).
Related
I want to create a CI/CD pipeline for deploying micro-services using AWS ECS.
Everything is ok until new image uploaded to ECR (trigger build new Docker image when new code is committed, pushes new Docker image to ECR).
The next step is i need to update service with new Docker image, then i have to options:
Update CloudFormation for ecs(which i need to design 1 stack contain only ecs infrastructure for each mirco-service)
Update ecs serivce directly via update-service cli
Which approach should i choose?
Updated:
At the first, i prefer the option 1, it have advantages like:
Rollback if deployment failed
Avoid dirty data (compare with direct update resource)
But the thing i concern is only one stack for each ecs infrastucture, this will create many stacks, does this lead too hard to manage stacks ?
Thank all!!
If you are using IaC such as CDK or CFN to manage resources then it is always suggested to make updates to those resources via IaC. Making updates directly to the resources could cause your stack to drift and cause you bother in the long term.
The best practice is to always use CloudFormation or CDK.
You can see version history to track changes. You can do auto rollbacks if there are any deployment issues.
Error message:
terraform destroy
module.application.google_sql_database_instance.sql-db-xxx: Destroying... [id=db-xxx]
Error: Error, failed to delete instance because deletion_protection is set to true. Set it to false to proceed with instance deletion
The terraform solution is here:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/sql_database_instance
On newer versions of the provider, you must explicitly set deletion_protection=false (and run terraform >apply to write the field to state) in order to destroy an instance. It is recommended to not set this >field (or set it to true) until you're ready to destroy the instance and its databases.
Question:
I do NOT want to make changes to the terraform script. I would rather toggle the deletion protection flag via gcloud, then destroy as per usual.
For gcloud VMs there is a deletion protection flag I can toggle. However, I cannot find the corresponding flag for the database:
cloud sql instances describe db-xxx
Is this deletion_protection flag meta data within terraform itself?
If not, what is the gcloud command to toggle it?
If so, how can I override it via terraform without modifying the code; ie command line parameter?
I have insufficient 'points' to add to the existing thread of a similar title.
To answer your questions:
From Terraform docs:
deletion_protection - Whether or not to allow Terraform to destroy the instance.
So yes, this is within Terraform itself. Deletion protection flag on GCP is currently only available on Compute Engine instances, not Cloud SQL instances.
Currently, deletion protection can only be toggled on a Compute Engine Instance.
You may consider using input variables like this:
terraform apply -var="deletion_protection=false"
terraform destroy
There are also other ways to use input variables. For more reference, here's the link.
I am running CloudFormation updates to ECS. Triggered by CodePipeline. I would like to abort the CloudFormation deployment and rollback to the previous version after a timeout.
What is the best way to accomplish this? I saw something about WaitConditions but I'm not sure that is the right mechanism.
I also found that you can configure a TimeoutInMinutes on nested stacks https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-stack.html#cfn-cloudformation-stack-timeoutinminutes - but sounds like you cannot apply a similar property at the top level of the stack or to an arbitrary resource?
Is there another way that I can use the combination where I can abort the Codepipeline->Cloudformation->ECS deployment after a few minutes if it doesn't succeed?
This is a general gripe with CodePipeline ECS Deploy action (ECS, not ECS B/G) that if you push a bad image, you will have to wait 1hr for the timeout to occur before you can retry the pipeline.
At the moment, CodePipeline doesn't support rollbacks. You can detect a failed pipeline using CloudWatch [1] and take some action. The action will probably be roll-forward to a good version.
[1] Detect and React to Changes in Pipeline State with Amazon CloudWatch Events - https://docs.aws.amazon.com/codepipeline/latest/userguide/detect-state-changes-cloudwatch-events.html
We don't use CodePipeline, we're using Sceptre. But I guess my workaround could still work.
My workaround for this problem is that before triggering a deployment, run a script in the background.
./deployment-breaker.sh &
And for the script
#!/bin/bash
sleep 600
$deploymentStatus = (aws cloudformation describe-stack --stack-name STACK_NAME | jq XXX)
if [[ $deploymentStatus == YOUR_TERMINATE_CONDITION ]]then
aws cloudformation cancel-update-stack --stack-name STACK_NAME
fi
I am new to Kubernetes, I am looking to see if its possible to hook into the container execution life cycle events in the orchestration process so that I can call an API to pass the details of the container and see if its allowed to execute this container in the given environment, location etc.
An example check could be: container can only be run in a Europe or US data centers. so before someone tries to execute this container, outside this region data centers, it should not be allowed.
Is this possible and what is the best way to achieve this?
You can possibly set up an ImagePolicy admission controller in the clusters, were you describes from what registers it is allowed to pull images.
kube-image-bouncer is an example of an ImagePolicy admission controller
A simple webhook endpoint server that can be used to validate the images being created inside of the kubernetes cluster.
If you don't want to start from scratch...there is a Cloud Native Computing Foundation (incubating) project - Open Policy Agent with support for Kubernetes that seems to offer what you want. (I am not affiliated with the project)
Because Kubernetes handles situations where there's a typo in the job spec, and therefore a container image can't be found, by leaving the job in a running state forever, I've got a process that monitors job events to detect cases like this and deletes the job when one occurs.
I'd prefer to just stop the job so there's a record of it. Is there a way to stop a job?
1) According to the K8S documentation here.
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here are the details for the failedJobsHistoryLimit property in the CronJobSpec.
This is another way of retaining the details of the failed job for a specific duration. The failedJobsHistoryLimit property can be set based on the approximate number of jobs run per day and the number of days the logs have to be retained. Agree that the Jobs will be still there and put pressure on the API server.
This is interesting. Once the job completes with failure as in the case of a wrong typo for image, the pod is getting deleted and the resources are not blocked or consumed anymore. Not sure exactly what kubectl job stop will achieve in this case. But, when the Job with a proper image is run with success, I can still see the pod in kubectl get pods.
2) Another approach without using the CronJob is to specify the ttlSecondsAfterFinished as mentioned here.
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
Not really, no such mechanism exists in Kubernetes yet afaik.
You can workaround is to ssh into the machine and run a: (if you're are using Docker)
# Save the logs
$ docker log <container-id-that-is-running-your-job> 2>&1 > save.log
$ docker stop <main-container-id-for-your-job>
It's better to stream log with something like Fluentd, or logspout, or Filebeat and forward the logs to an ELK or EFK stack.
In any case, I've opened this
You can suspend cronjobs by using the suspend attribute. From the Kubernetes documentation:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#suspend
Documentation says:
The .spec.suspend field is also optional. If it is set to true, all
subsequent executions are suspended. This setting does not apply to
already started executions. Defaults to false.
So, to pause a cron you could:
run and edit "suspend" from False to True.
kubectl edit cronjob CRON_NAME (if not in default namespace, then add "-n NAMESPACE_NAME" at the end)
you could potentially create a loop using "for" or whatever you like, and have them all changed at once.
you could just save the yaml file locally and then just run:
kubectl create -f cron_YAML
and this would recreate the cron.
The other answers hint around the .spec.suspend solution for the CronJob API, which works, but since the OP asked specifically about Jobs it is worth noting the solution that does not require a CronJob.
As of Kubernetes 1.21, there alpha support for the .spec.suspend field in the Job API as well, (see docs here). The feature is behind the SuspendJob feature gate.