CloudFormation: SNS creation timing issue

CloudFormation: SNS creation timing issue - aws-cloudformation

I'm trying to create a CF script that spins up an EC2 instance and creates an SNS Topic that uses a service on that instance as an endpoint. My issue seems to be one of timing: the SNS topic will fail to be created because the endpoint is unresponsive... Because the instance is likely still initializing.
I have used the DependsOn attribute, but that doesn't do the trick.
I'm looking at WaitCondition, but I'm wondering where my 'signal' should be triggered: will the instance's httpd be fully initialized and externally accessible when the 'userdata' script is executed? Or is there another 'place' that I should put the signal?
Or should I be looking at CreationPolicy? From a quick reading of the docs, there seems to be a signal involved with that as well, so the above question stands for that solution as well.
Thanks!

Your signal should be triggered at the end of userdata script. In the userdata script, ensure that the service is up and running. You can write a loop which will poll on your service's health.
Please refer to the CreationPolicy subtopic in the Sample template link. More information on CreationPolicy
Now your SNS topic can DependsOn EC2 instance. This ensures that by that time SNS creation is triggered, your service is healthy.

Related

CloudFormation: stack is stuck, CloudTrail events shows repeating DeleteNetworkInterface event

I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?

I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.

How to create a cron job in a Kubernetes deployed app without duplicates?

I am trying to find a solution to run a cron job in a Kubernetes-deployed app without unwanted duplicates. Let me describe my scenario, to give you a little bit of context.
I want to schedule jobs that execute once at a specified date. More precisely: creating such a job can happen anytime and its execution date will be known only at that time. The job that needs to be done is always the same, but it needs parametrization.
My application is running inside a Kubernetes cluster, and I cannot assume that there always will be only one instance of it running at the any moment in time. Therefore, creating the said job will lead to multiple executions of it due to the fact that all of my application instances will spawn it. However, I want to guarantee that a job runs exactly once in the whole cluster.
I tried to find solutions for this problem and came up with the following ideas.
Create a local file and check if it is already there when starting a new job. If it is there, cancel the job.
Not possible in my case, since the duplicate jobs might run on other machines!
Utilize the Kubernetes CronJob API.
I cannot use this feature because I have to create cron jobs dynamically from inside my application. I cannot change the cluster configuration from a pod running inside that cluster. Maybe there is a way, but it seems to me there have to be a better solution than giving the application access to the cluster it is running in.
Would you please be as kind as to give me any directions at which I might find a solution?
I am using a managed Kubernetes Cluster on Digital Ocean (Client Version: v1.22.4, Server Version: v1.21.5).

After thinking about a solution for a rather long time I found it.
The solution is to take the scheduling of the jobs to a central place. It is as easy as building a job web service that exposes endpoints to create jobs. An instance of a backend creating a job at this service will also provide a callback endpoint in the request which the job web service will call at the execution date and time.
The endpoint in my case links back to the calling backend server which carries the logic to be executed. It would be rather tedious to make the job service execute the logic directly since there are a lot of dependencies involved in the job. I keep a separate database in my job service just to store information about whom to call and how. Addressing the startup after crash problem becomes trivial since there is only one instance of the job web service and it can just re-create the jobs normally after retrieving them from the database in case the service crashed.
Do not forget to take care of failing jobs. If your backends are not reachable for some reason to take the callback, there must be some reconciliation mechanism in place that will prevent this failure from staying unnoticed.
A little note I want to add: In case you also want to scale the job service horizontally you run into very similar problems again. However, if you think about what is the actual work to be done in that service, you realize that it is very lightweight. I am not sure if horizontal scaling is ever a requirement, since it is only doing requests at specified times and is not executing heavy work.

How do you create a message queue service for the scope of a specific Kubernetes job

I have a parallel Kubernetes job with 1 pod per work item (I set parallelism to a fixed number in the job YAML).
All I really need is an ID per pod to know which work item to do, but Kubernetes doesn't support this yet (if there's a workaround I want to know).
Therefore I need a message queue to coordinate between pods. I've successfully followed the example in the Kubernetes documentation here: https://kubernetes.io/docs/tasks/job/coarse-parallel-processing-work-queue/
However, the example there creates a rabbit-mq service. I typically deploy my tasks as a job. I don't know how the lifecycle of a job compares with the lifecycle of a service.
It seems like that example is creating a permanent message queue service. But I only need the message queue to be in existence for the lifecycle of the job.
It's not clear to me if I need to use a service, or if I should be creating the rabbit-mq container as part of my job (and if so how that works with parallelism).

Get Redshift cluster status in outputs of cloudformation

I am creating a redshift cluster using CF and then I need to output the cluster status (basically if its available or not). There are ways to output the endpoints and port but I could not find any possible way of outputting the status.
How can I get that, or it is not possible ?

You are correct. According to AWS::Redshift::Cluster - AWS CloudFormation, the only available outputs are Endpoint.Address and Endpoint.Port.
Status is not something that you'd normally want to output from CloudFormation because the value changes.
If you really want to wait until the cluster is available, you could create a WaitCondition and then have something monitor the status and the signal for the Wait Condition to continue. This would probably need to be an Amazon EC2 instance with some User Data. Linux instances are charged per-second, so this would be quite feasible.

Set the ECS Cloudformation Update Stack timeout?

When updating a Cloudformation EC2 Container Service (ECS) Stack with a new Container Image, is there any way to control the timeout so if the service does not stabilize it rolls back automatically?
The UpdatePolicy attribute which is part of the Auto Scaling Group does not help since instances are not being created.
I also tried a WaitCondition but have not been able to get that to work.
The stack essentially just stays in the UPDATE_IN_PROGRESS state until it hits the default timeout (~3 hours), or you trigger a Cancel the update.
Ideally we would be able to have the stack timeout after a short period of time.
This is what my Cloudformation template looks like:
https://s3.amazonaws.com/aws-rga-cw-public/ops/cfn/ecs-cluster-asg-elb-cfn.yaml
Thanks.

I've created a workaround for this problem until AWS creates a ECS UpdatePolicy and CreationPolicy that allows for resourcing signaling:
Use AWS::CloudFormation::WaitCondition with a Macro that will create new WaitCondition resources when the service is expected to update. Signal the wait condition with a non-essential container attached to the task.
Example: https://github.com/deuscapturus/cloudformation-macro-WaitConditionUpdate/blob/master/example-ecs-service.yaml
The Macro for the above example can be found here: https://github.com/deuscapturus/cloudformation-macro-WaitConditionUpdate

My workaround for this problem is that before triggering an update stack, run a script in the background
./deployment-breaker.sh &
And for the script
#!/bin/bash
sleep 600
$deploymentStatus = (aws cloudformation describe-stack --stack-name STACK_NAME | jq XXX)
if [[ $deploymentStatus == YOUR_TERMINATE_CONDITION ]]then
aws cloudformation cancel-update-stack --stack-name STACK_NAME
fi

If your WaitCondition is in the original create you need to rename it (and the Handle). Once a waitcondition has been signaled as complete, it will always be complete. If you rename it and do an update, the original WaitCondition and Handle will be dropped and the new ones created created and signaled.
If you don't want to have to modify your template you might be able to use Lamba and Custom resources to create a unique WaitCondition via the aws cli for each update.

It's not possible at the moment with the provided CloudFormation types. I have the same problem and I might create a custom CloudFormation resource (usineg AWS Lambda) to replace my AWS::ECS::Service.
The other alternative is to use nested stacks to wrap the AWS::ECS::Service resources — it won't solve the problem, but it at least will isolate the individual service and the rest of the stack will be in a good state. My stacks have multiple services and this would help, but the custom resource is the best option so far (I know other people that did the same thing).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse