Concourse Worker "file not found" - concourse

Currently when running concourse-workers with docker-compose.
When you shutdown the stack for restart or upgrade, the worker becomes to be in a broken, undefined state.
This leads to a lot of different issues like "file not found" when running a task /job in concourse or during a resource trigger job / task is stalled / stuck in the "preparing build" state, not doing anything.
Is there a way to workaround/solve it?

We had a similar problem recently. Without seeing your configuration, this is just a shot in the dark, but we solved it by adding the ssh-port of the web-container to the TSA_HOST config for the worker in docker-compose.yml. (Change in last line below):
worker:
build: concourse-worker
privileged: true
links:
- web
environment:
CONCOURSE_TSA_HOST: web:2222

It sounds like similar to the issue described here, basically ATC tries to register the same worker name as before the upgrade, but seems like the worker name has changed. So you can do a fly workers to get the stalled worker name, concourse retire-worker to retire it and register with the new worker using concourse land-worker.
A not so pretty way to temporarily solve this is recreate the worker by run the docker-compose again, wait a bit since the new worker needs to register itself with ATC.

Related

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

Celery: Some tasks are not printing logs on ECS deployment

I have the following celery setup in production
Using RabbitMQ as the broker
I am running multiple instances of the celery using ECS Fargate
Logs are sent to CloudWatch using default awslogs driver
Result backend - MongoDB
The issue I am facing is this. A lot of my tasks are not showing logs on cloudwatch.
I just see this log
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] received
But I do not see the actual logs for the execution of this task. Nor do I see the logs for completion - for example something like this is nowhere in the logs to be found
Task task_name[3d56f396-4530-4471-b37c-9ad2985621dd] succeeded
This does not happen all the time. It happens intermittently but consistently. I see that a lot of tasks are printing the logs.
I can see that result backend has the task results and I know that the task has executed but the logs for the task are completely missing. It is not specific to some task_name.
On my local setup, I have not been able to isolate the issue
I am not sure if this is a celery logging issue or awslogs issue. What can I do to troubleshoot this?
** UPDATE **
Found the root cause - it was that I had some code in the codebase that was removing handlers from the root logger. Leaving this question on stack overflow in case someone else faces this issue
Log all requests from the python-requests module

Running a pod/container in Kubernetes that applies maintenance to a DB

I have found several people asking about how to start a container running a DB, then run a different container that runs maintenance/migration on the DB which then exits. Here are all of the solutions I've examined and what I think are the problems with each:
Init Containers - This wont work because these run before the main container is up and they block the starting of the main container until they successfully complete.
Post Start Hook - If the postStart hook could start containers rather than simply exec a command inside the container then this would work. Unfortunately, the container with the DB does not (and should not) contain the rather large maintenance application required to run it this way. This would be a violation of the principle that each component should do one thing and do it well.
Sidecar Pattern - This WOULD work if the restartPolicy were assignable or overridable at the container level rather than the pod level. In my case the maintenance container should terminate successfully before the pod is considered Running (just like would be the case if the postStart hook could run a container) while the DB container should Always restart.
Separate Pod - Running the maintenance as a separate pod can work, but the DB shouldn't be considered up until the maintenance runs. That means managing the Running state has to be done completely independently of Kubernetes. Every other container/pod in the system will have to do a custom check that the maintenance has run rather than a simple check that the DB is up.
Using a Job - Unless I misunderstand how these work, this would be equivalent to the above ("Separate Pod").
OnFailure restart policy with a Sidecar - This means using a restartPolicy of OnFailure for the POD but then hacking the DB container so that it always exits with an error. This is doable but obviously just a hacked workaround. EDIT: This also causes problems with the state of the POD. When the maintenance runs and stays up and both containers are running, the state of the POD is Ready, but once the maintenance container exits, even with a SUCCESS (0 exit code), the state of the POD goes to NotReady 1/2.
Is there an option I've overlooked or something I'm missing about the above solutions? Thanks.
One option would be to use the Sidecar pattern with 2 slight changes to the approach you described:
after the maintenance command is executed, you keep the container running with a while : ; do sleep 86400; done command or something similar.
You set an appropriate startupProbe in place that resolves successfully only when your maintenance command is executed successfully. You could for example create a file /maintenance-done and use a startupProbe like this:
startupProbe:
exec:
command:
- cat
- /maintenance-done
initialDelaySeconds: 5
periodSeconds: 5
With this approach you have the following outcome:
Having the same restartPolicy for both your database and sidecar containers works fine thanks to the sleep hack.
You Pod only becomes ready when both containers are ready. In the sidecar container case this happens when the startupProbe succeedes.
Furthermore, there will be no noticeable overhead in your pod: even if the sidecar container keeps running, it will consume close to zero resources since it is only running the sleep command.

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

Kubernetes: active job is erroneously marked as completed after cluster upgrade

I have a working kubernetes cluster (v1.4.6) with an active job that has a single failing pod (e.g. it is constantly restarted) - this is a test, the job should never reach completion.
If I restart the same cluster (e.g. reboot the node), the job is properly re-scheduled and continues to be restarted
If I upgrade the cluster to v1.5.3, then the job is marked as completed once the cluster is up. The upgrade is basically the same as restart - both use the same etcd cluster.
Is this the expected behavior when going to v1.5.x? If not, what can be done to have the job continue running?
I should provide a little background on my problem - the job is to ultimately become a driver in the update process and it is important to have it running (even in face of cluster restarts) until it achieves a certain goal. Is this possible using a job?
In v1.5.0 extensions/v1beta1.Jobs was deprecated in favor of batch/v1.Job, so simply upgrading the cluster without updating the job definition is expected to cause side effects.
See the Kubernetes CHANGELOG for a complete list of changes and deprecations in v1.5.0.