Gitlab coordinator waits 5 minutes after job finished by runner - How to diagnose - kubernetes

We have a current (13.5.1-ee) Kubernetes deployed Gitlab which has recently developed an unusual and obstructive behaviour:
After a job finishes (either successfully or with a failure), and this is reported by the runner (as confirmed by the local log), the coordinator waits for 5 minutes to report the status via the UI and start the next job.
This behaviour does not depend on the:
type of executor. It occurs with the docker and kubernetes executor.
size of artifacts (for test cases there are none)
size of logs (for test cases they are 5 lines long)
image being used (for test cases it is busybox)
script (for tests it is empty)
network quality (for tests I have activated feature flags that this)
It feels that the coordinator is attempting a call to another system and times out.
Has anyone seen this before? Does anyone have a means to diagnose?

This was a bug affecting Azure.

Related

Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.
Summary
We are currently experiencing a scheduling issue with our Flink cluster.
The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.
The failing exception is the following one:
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.
We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.
Additional information
Flink version: 1.13.1 commit a7f3192
Flink cluster in session mode
2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.
Questions
May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
Is there any way to increase logging in the scheduler to debug this issue?
Is it a known issue? If yes, is there any workaround/solution to resolve it?
P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

Does ReactTestingLibrary take up a lot of CPU/RAM, causing tests to timeout & fail?

I've had some issues with tests timing out randomly. Usually on CircleCI, but sometimes locally. Based on Kent Dodds suggestion to write fewer longer tests I now have more tests with multiple clicks & multiple network requests (mocking fetch too). Theses tests seem to timeout. Just recently CircleCI added a Resources tab to the pipeline for some interesting metrics. When the tests timeout, the 4GB ram clearly gets to 100% for extended time, and the test fails. On a passed test, the ram stays mostly below 100%.
Failed test (4GB):
Passed test (4GB):
Updated Resource_class to 8GB
I tried a single experiment to update my circleci config so that the resource_class gets updated to large/8GB. Test passed and even better CPU usage %.
So, does React Testing Library take up a lot of horsepower?
Is our default 4GB RAM docker image ok?

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

PowerShell session/environment isolation - Jobs sharing same context?

I'm testing a workflow runbook that utilizes Add-Type to add some custom C# code.
All of a sudden I started getting 'type already exists' errors on subsequent test jobs, as if a new PSSession is not being created.
In other words, it looks like new jobs are sharing the same execution context. I only get this locally if I try to run the same command twice per PS instance.
The type in question is a static class with some Extension methods. Since it also happens to be the first type declared in the source block, I don't doubt other non-static types would throw errors as well.
I've executed this handfuls of times already, so I fully expect that 'eventually' this will stop happening, but I can't seem to force it, and I have no idea what I could've done to trip it into this situation, either.
Seeing evidence of shared execution contexts across jobs like this - even (especially?) if only temporal - makes me wonder if some or all of the general execution inconsistencies we've seen in the past when making & deploying changes & performing subsequent tests soon-after are related to this.
I'm tempted to think that this is simply a part of the difference between a Test Job and a 'real' one, but that raises questions about the validity of the Test jobs themselves WRT mimicking Published Jobs.
Are all Azure Automation Jobs supposed to execute in Isolation? Can this be controlled/exploited by a developer?
Each automation account has its own isolated sandboxes where its jobs run. Those sandboxes are distributed among a number of worker machines. For test jobs, to try to improve job start time since [make code change, retest] over and over is very common, Automation reuses the same sandbox as used for previous test jobs of this runbook, if the sandbox has not been cleaned up yet, so that sandboxes do not have to be spun up for each unique test job (sandbox creation is one reason for a longer job start time than desired). Due to this behavior, if you execute test jobs of the same runbook within a short amount of time, you will get the behavior you're seeing above.
However, even for production jobs, jobs of the same automation account (across runbooks) can share the same sandboxes. We randomly distribute jobs across our worker machines, so its possible job A is queued for execution and is placed on worker W, then 5 minutes later, job B is queued for execution and is placed on worker W as well. If job A and job B are of the same automation account and have the same "demands" in terms of modules / module versions, they will be placed in the same sandbox, if job A's sandbox is still around. "Module / module version demands" does not mean the modules used by the runbook, but the modules / latest module versions that existed in the automation account at the time when the job was started / runbook was scheduled (for jobs started via schedule) / runbook was assigned to a webhook (for jobs started via webhook)
In terms of resolving your specific problem, you could surround Add-Type with a try, catch statement, or maybe use Add-Type -IgnoreWarnings