gitlab job pod exits unexpectedly - kubernetes

I currently have gitlab runner deployed in my kubernetes cluster with 2 replicas.
When I run a job in gitlab, the runners are successful in spawning pods that run the pipeline. But in some cases, after the pipeline runs the job, I suddenly get the error
Running after_script
00:00
Uploading artifacts for failed job
00:00
Cleaning up project directory and file based variables
00:00
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found
When I have a look at the runner logs, all I see is
gitlab-runners-exchange-587cdbf898-pkgt2 | grep "runner-hzfiusrx-project-37057717-concurrent-21gs8vm"
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: command terminated with exit code 137. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error while executing file based variables removal script error=couldn't get pod details: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found job=2986450184 project=37057717 runner=hzFiusRx
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found duration_s=2067.525269137 job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Failed to process runner builds=32 error=pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found executor=kubernetes runner=hzFiusRx
Im trying to understand the issue here.
My kubernetes runner config is
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = true
image = "ubuntu:20.04"
namespace = "exchange"
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "5"
memory_request = "25Gi"
The nodes on which the job pods get scheduled have the following capacity
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32523380Ki
pods: 58
So what exactly might be the issue here ? The cpu and memory dimensioning for the nodes seem correct.
looking at the utilization, everything seems good too
So what might be the issue here ? Is it that kubernetes/gitlab is not gracefully killing the job pod ? Or does it need more memory ?

Related

Minikube deployment through pulumi giving etcd timeouts

Kinda lost here, I have a single node minikube instance running Kubernetes v1.23.8 with docker as the driver. I also use pulumi to deploy code-based infrastructure.
Started to see erratic errors when deploying on different resources like etcdserver: request timed out and timed out waiting to be Ready
Tried to debug the api-server in minikube to see if I get more info and got errors like
│ {"level":"warn","ts":"2022-08-09T14:39:44.235Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"2.000487575s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"} │
│ WARNING: 2022/08/09 14:39:44 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing" │
│ {"level":"warn","ts":"2022-08-09T14:39:46.244Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"2.000336072s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}
Completely lost on how to move forward troubleshooting this

Airflow Logs in Kubernetes deployment

We currently have airflow deployed in Kubernetes using the Airflow 2.0 & we've also tried with the 2.1 helm chart. When we try to get logs for running DAGS, we get the following:
*** Trying to get logs (last 100 lines) from worker pod dagsentstartpipeline.sdfsdfsfsdfsfsfdsee343443f34f3 ***
Waiting for host: airflow-postgresql.airflow.svc.cluster.local. 5432
[2021-06-30 12:35:52,524] {plugins_manager.py:286} INFO - Loading 1 plugin(s) took 3.23 seconds
[2021-06-30 12:35:53,460] {dagbag.py:440} INFO - Filling up the DagBag from /usr/local/airflow/dags/test_dag.py
Running <TaskInstance: test_dag.start_pipeline 2021-06-30T12:35:40.924151+00:00 [queued]> on host test_dagstartpipeline.70e4739231494c24a3e368
Running <TaskInstance: test_dag.test.start_pipeline 2021-06-30T12:35:40.924151+00:00 [queued]> on host test_dagstartpipeline.70e4739231494c24a3e368
Our DAG fails and we get no indiciation of why it fails because we cannot see the logs. Is there any suggestions of how we can view these logs?
We have tried setting the PODS to not delete after running, and we get the same logs
Thanks,

How to use the Akka sample cluster kubernetes with Scala and minikube?

I am trying to run the akka-sample-cluster-kubernetes-scala as it is recommended to deploy an Akka cluster on to minikube using akka-management-cluster-bootstrap. After run every step recommended on the README file I can see the pods running on my kubectl output:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
appka-8b88f7bdd-485nx 1/1 Running 0 48m
appka-8b88f7bdd-4blrv 1/1 Running 0 48m
appka-8b88f7bdd-7qlc9 1/1 Running 0 48m
When I execute the ./scripts/test.sh it seems to fail on the last step:
"No 3 MemberUp log events found"
And I cannot connect to the given address said on the README file. The error:
$ curl http://127.0.0.1:8558/cluster/members/
curl: (7) Failed to connect to 127.0.0.1 port 8558: Connection refused
From now on I describe how I try to find the reason that I cannot use the sample akka + kubernetes project. I am trying to find the cause of the error above mentioned. I suppose I have to execute sbt run, even it is not mentioned on the sample project. And them I get the following error with respect to the ${REQUIRED_CONTACT_POINT_NR} variable at application.conf:
[error] Exception in thread "main"
com.typesafe.config.ConfigException$UnresolvedSubstitution:
application.conf #
jar:file:/home/felipe/workspace-idea/akka-sample-cluster-kubernetes-scala/target/bg-jobs/sbt_12a05599/job-1/target/d9ddd12d/64fe375d/akka-sample-cluster-kubernetes_2.13-0.0.0-70-49d6a855-20210104-1057.jar!/application.conf:
19: Could not resolve substitution to a value:
${REQUIRED_CONTACT_POINT_NR}
#management-config
akka.management {
cluster.bootstrap {
contact-point-discovery {
# pick the discovery method you'd like to use:
discovery-method = kubernetes-api
required-contact-point-nr = ${REQUIRED_CONTACT_POINT_NR}
}
}
}
#management-config
So, I suppose that it is not getting the configuration from the kubernetes/akka-cluster.yml file: name: REQUIRED_CONTACT_POINT_NR. Changing it to required-contact-point-nr = 3 or 4 I get the error:
[error] SLF4J: A number (4) of logging calls during the initialization phase have been intercepted and are
[error] SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
[error] SLF4J: See also http://www.slf4j.org/codes.html#replay
...
[info] [2021-01-04 11:00:57,373] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Shutting down remote daemon. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.373UTC}
[info] [2021-01-04 11:00:57,376] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.375UTC}
[info] [2021-01-04 11:00:57,414] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Remoting shut down. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.414UTC}
[error] Nonzero exit code returned from runner: 255
[error] (Compile / run) Nonzero exit code returned from runner: 255
[error] Total time: 6 s, completed Jan 4, 2021 11:00:57 AM
You are getting your contact point error because you are trying to use sbt run. sbt run will run a single instance outside of minikube, which isn't what you want. And since it's running outside of Minikube it won't pick up the environment variables being set in the container spec. The scripts do the build/deploy and you should not need to run sbt manually.
Also, the main error is not the connection to 8558, I don't believe that the configuration exposes that admin port outside of minikube.
The fact that all three containers report a status Running indicates that you may actually have a running cluster and the test script may just be missing the messages in the logs. As others have said in comments, the logs from one of the containers would be helpful in determining whether you have a working cluster, and diagnosing any problems in cluster formation.
The answer and the comments on my questions are right. I don't need to run sbt run to visualize the output at the web browser. What I was missing is to port-forward the current port of the cluster nodes to the output port. This is not specified on the akka-sample-cluster-kubernetes-scala, but I believe that is because they run directly on the Google Kubernetes Engine platform and not inside minikube first.
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
appka-1 appka-8b88f7bdd-97zfl 1/1 Running 0 9m10s
appka-1 appka-8b88f7bdd-bhv44 1/1 Running 0 9m10s
appka-1 appka-8b88f7bdd-ff76s 1/1 Running 0 9m10s
$ #### THIS COMMAND IS NOT IN THE README FILE OF THE DEMO ####
$ kubectl port-forward appka-8b88f7bdd-ff76s 8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
now I can see the output:
$ http GET http://127.0.0.1:8080/
HTTP/1.1 200 OK
Content-Length: 11
Content-Type: text/plain; charset=UTF-8
Date: Wed, 06 Jan 2021 10:59:40 GMT
Server: akka-http/10.2.1
Hello world

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.