Airflow Logs in Kubernetes deployment - kubernetes

We currently have airflow deployed in Kubernetes using the Airflow 2.0 & we've also tried with the 2.1 helm chart. When we try to get logs for running DAGS, we get the following:
*** Trying to get logs (last 100 lines) from worker pod dagsentstartpipeline.sdfsdfsfsdfsfsfdsee343443f34f3 ***
Waiting for host: airflow-postgresql.airflow.svc.cluster.local. 5432
[2021-06-30 12:35:52,524] {plugins_manager.py:286} INFO - Loading 1 plugin(s) took 3.23 seconds
[2021-06-30 12:35:53,460] {dagbag.py:440} INFO - Filling up the DagBag from /usr/local/airflow/dags/test_dag.py
Running <TaskInstance: test_dag.start_pipeline 2021-06-30T12:35:40.924151+00:00 [queued]> on host test_dagstartpipeline.70e4739231494c24a3e368
Running <TaskInstance: test_dag.test.start_pipeline 2021-06-30T12:35:40.924151+00:00 [queued]> on host test_dagstartpipeline.70e4739231494c24a3e368
Our DAG fails and we get no indiciation of why it fails because we cannot see the logs. Is there any suggestions of how we can view these logs?
We have tried setting the PODS to not delete after running, and we get the same logs
Thanks,

Related

gitlab job pod exits unexpectedly

I currently have gitlab runner deployed in my kubernetes cluster with 2 replicas.
When I run a job in gitlab, the runners are successful in spawning pods that run the pipeline. But in some cases, after the pipeline runs the job, I suddenly get the error
Running after_script
00:00
Uploading artifacts for failed job
00:00
Cleaning up project directory and file based variables
00:00
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found
When I have a look at the runner logs, all I see is
gitlab-runners-exchange-587cdbf898-pkgt2 | grep "runner-hzfiusrx-project-37057717-concurrent-21gs8vm"
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: command terminated with exit code 137. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error while executing file based variables removal script error=couldn't get pod details: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found job=2986450184 project=37057717 runner=hzFiusRx
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found duration_s=2067.525269137 job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Failed to process runner builds=32 error=pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found executor=kubernetes runner=hzFiusRx
Im trying to understand the issue here.
My kubernetes runner config is
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = true
image = "ubuntu:20.04"
namespace = "exchange"
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "5"
memory_request = "25Gi"
The nodes on which the job pods get scheduled have the following capacity
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32523380Ki
pods: 58
So what exactly might be the issue here ? The cpu and memory dimensioning for the nodes seem correct.
looking at the utilization, everything seems good too
So what might be the issue here ? Is it that kubernetes/gitlab is not gracefully killing the job pod ? Or does it need more memory ?

How to use the Akka sample cluster kubernetes with Scala and minikube?

I am trying to run the akka-sample-cluster-kubernetes-scala as it is recommended to deploy an Akka cluster on to minikube using akka-management-cluster-bootstrap. After run every step recommended on the README file I can see the pods running on my kubectl output:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
appka-8b88f7bdd-485nx 1/1 Running 0 48m
appka-8b88f7bdd-4blrv 1/1 Running 0 48m
appka-8b88f7bdd-7qlc9 1/1 Running 0 48m
When I execute the ./scripts/test.sh it seems to fail on the last step:
"No 3 MemberUp log events found"
And I cannot connect to the given address said on the README file. The error:
$ curl http://127.0.0.1:8558/cluster/members/
curl: (7) Failed to connect to 127.0.0.1 port 8558: Connection refused
From now on I describe how I try to find the reason that I cannot use the sample akka + kubernetes project. I am trying to find the cause of the error above mentioned. I suppose I have to execute sbt run, even it is not mentioned on the sample project. And them I get the following error with respect to the ${REQUIRED_CONTACT_POINT_NR} variable at application.conf:
[error] Exception in thread "main"
com.typesafe.config.ConfigException$UnresolvedSubstitution:
application.conf #
jar:file:/home/felipe/workspace-idea/akka-sample-cluster-kubernetes-scala/target/bg-jobs/sbt_12a05599/job-1/target/d9ddd12d/64fe375d/akka-sample-cluster-kubernetes_2.13-0.0.0-70-49d6a855-20210104-1057.jar!/application.conf:
19: Could not resolve substitution to a value:
${REQUIRED_CONTACT_POINT_NR}
#management-config
akka.management {
cluster.bootstrap {
contact-point-discovery {
# pick the discovery method you'd like to use:
discovery-method = kubernetes-api
required-contact-point-nr = ${REQUIRED_CONTACT_POINT_NR}
}
}
}
#management-config
So, I suppose that it is not getting the configuration from the kubernetes/akka-cluster.yml file: name: REQUIRED_CONTACT_POINT_NR. Changing it to required-contact-point-nr = 3 or 4 I get the error:
[error] SLF4J: A number (4) of logging calls during the initialization phase have been intercepted and are
[error] SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
[error] SLF4J: See also http://www.slf4j.org/codes.html#replay
...
[info] [2021-01-04 11:00:57,373] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Shutting down remote daemon. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.373UTC}
[info] [2021-01-04 11:00:57,376] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.375UTC}
[info] [2021-01-04 11:00:57,414] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Remoting shut down. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.414UTC}
[error] Nonzero exit code returned from runner: 255
[error] (Compile / run) Nonzero exit code returned from runner: 255
[error] Total time: 6 s, completed Jan 4, 2021 11:00:57 AM
You are getting your contact point error because you are trying to use sbt run. sbt run will run a single instance outside of minikube, which isn't what you want. And since it's running outside of Minikube it won't pick up the environment variables being set in the container spec. The scripts do the build/deploy and you should not need to run sbt manually.
Also, the main error is not the connection to 8558, I don't believe that the configuration exposes that admin port outside of minikube.
The fact that all three containers report a status Running indicates that you may actually have a running cluster and the test script may just be missing the messages in the logs. As others have said in comments, the logs from one of the containers would be helpful in determining whether you have a working cluster, and diagnosing any problems in cluster formation.
The answer and the comments on my questions are right. I don't need to run sbt run to visualize the output at the web browser. What I was missing is to port-forward the current port of the cluster nodes to the output port. This is not specified on the akka-sample-cluster-kubernetes-scala, but I believe that is because they run directly on the Google Kubernetes Engine platform and not inside minikube first.
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
appka-1 appka-8b88f7bdd-97zfl 1/1 Running 0 9m10s
appka-1 appka-8b88f7bdd-bhv44 1/1 Running 0 9m10s
appka-1 appka-8b88f7bdd-ff76s 1/1 Running 0 9m10s
$ #### THIS COMMAND IS NOT IN THE README FILE OF THE DEMO ####
$ kubectl port-forward appka-8b88f7bdd-ff76s 8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
now I can see the output:
$ http GET http://127.0.0.1:8080/
HTTP/1.1 200 OK
Content-Length: 11
Content-Type: text/plain; charset=UTF-8
Date: Wed, 06 Jan 2021 10:59:40 GMT
Server: akka-http/10.2.1
Hello world

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

Flink: HA mode killing leading jobmanager terminating standby jobmanagers

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.
So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.
My setup:
4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.
ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting
My masters file looks like this:
ad011.local:8081
ad011.local:8082
ad011.local:8083
My flink-conf.yaml:
jobmanager.rpc.address: ad011.local
blob.server.port: 6130,6131,6132
jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791
high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002
When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?
I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.
This is what I see in my standby JobManager logs:
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#ad011.local:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService#72d546c8.
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#ad011.local:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#ad011.local:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#ad011.local:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6132
Any help would be appreciated.
Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.