ECS + ALB - My applications only respond a few times - amazon-ecs

I've developed two spring boot applications for microservices and I've used ECS to deploy these applications into containers.
To do this, I followed the official pet clinic example (https://github.com/aws-samples/amazon-ecs-java-microservices/tree/master/3_ECS_Java_Spring_PetClinic_CICD).
All seems to works correctly, but when I make a request to the ALB very often I receive the 502 or 503 HTTP error and a few times I can see the correct response of the applications.
Can someone help me?
Thanks in advance.

You receive a 502 when you have no healthy task running and 503 when task is starting/restarting.
All of this mean that your task got stopped and then your cluster restart it, so you should find what make your task failed.
It can be something directly in your code that make it crash. or it can be the cluster healthcheck defined in your target group that failed.
Firstly you should look your task in the AWS ECS Console and see what error your task receive when it's stopped.
But as you are able to make request for some time and then it failed. I pretty sure your problem come from your healthcheck. So go in your target group used by the task (in AWS EC2 Console) and make sure the healthcheck path configured exist and returned a 200 status code.

Related

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

the process cannot access the file '/app' because is being userd by another process - swagger on kubernetes

I have one .net core 3.1 WebApi aplication running on AKS.
But some times POD retunr the error: "the process cannot access the file '/app' because is being userd by another process"
The others PODs working right. But it doesn't work anymore and keep running receiving requests.
I couldn't simulate the errors. Has somebody any idea?
Thanks

Timeout waiting for network interface provisioning to complete

Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

Qlik Sense Scheduler Stopped after removing and adding back proxy node

need some advice.
After we removed the proxy node and added it back. The scheduler service stopped and is stuck at "Starting".
Seeking all for any advice to make it start..
The event log shows error:
Try restarting the scheduler service and any other services that it depends on like repository service.