airflow ui log error after deploying on kubernetes while using celery workers - kubernetes

After deploying BITNAMI HELM chart for AIRFLOW, on kubernetes cluster, ALTHOUGH EVERYTHING WORKS, logging is still unreachable.
Turns out that helm chat that is being used to deploy is using a headless service for communication between celery workers and is not able to show me logs.
I have set the hostname_callable setting right, and yet, LOGS ALWAYS PICK UP THE NAME OF HEADLESS SERVICE as their hostname, but, not the DNS name.
*** Log file does not exist: /opt/bitnami/airflow/logs/secondone/s3files/2020-06-19T10:35:00+00:00/1.log
*** Fetching from: http://mypr-afw-worker-1.mypr-afw-headless.mynamespace.svc.cluster.local:8793/log/secondone/s3files/2020-06-19T10:35:00+00:00/1.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='mypr-afw-worker-1.mypr-afw-headless.mynamespace.svc.cluster.local', port=8793): Max retries exceeded with url: /log/secondone/s3files/2020-06-19T10:35:00+00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f12917f5630>: Failed to establish a new connection: [Errno 111] Connection refused',))
Any help in this regard would be appreciated! thanks!

How are you setting the hostname, it seems you need to pass them as an array:
## The list of hostnames to be covered with this ingress record.
## Most likely this will be just one host, but in the event more hosts are needed, this is an array
##
hosts:
- name: airflow.local
path: /
or --set ingress.hosts[0].name=airflow.local --set ingress.hosts[0].path=/ in the helm install command

Related

Kubernetes unable to create impersonator account timeout

Hosting: Azure centos VM's running RKE1
Rancher Version: v2.6.2
Kubernetes Version: 1.18.6
Looking for help diagnosing this issue; I get two error messages
From Rancher:
Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://<NODE_IP>:6443/api/v1/namespace/kube-system?timeout=45s": write tcp 172.16.0.2:443 -> <NODE_IP>:7832:i/o timeout
From Kubectl:
unable to create impersonator account: error setting up impersonation for user user-sd7q9: Put "https://<NODE_IP>:6443/apis/rbac.authorization.k8s.io/v1/clusterroles/cattle-impersonation-user-sd7q9": write tcp 172.17.0.2:443-> <NODE_IP>:7832: i/o timeout
Nothing appears to be broken and my applications are still available via ingress.
https://github.com/rancher/rancher/issues/34671

Terraform dial tcp 192.xx.xx.xx:443: i/o timeout error

I am trying to implement CI / CD using GitLab + Terraform to K8S Cluster and K8S Control Plane (Master node) was setup on CentOS
However, Pipeline job fails with the following error
Error: Failed to get existing workspaces: Get "https://192.xx.xx.xx/api/v1/namespaces/default/secrets?labelSelector=tfstate%3Dtrue": dial tcp 192.xx.xx.xx:443: i/o timeout
From the error mentioned above (default/secrets?labelSelector=tfstate%3Dtrue), I assume the error is related to missing 'terraform secret' on default namespace
Example (Terraform secret taken from my Windows)
PS C:\> kubectl get secret
NAME TYPE DATA AGE
default-token-7mzv6 kubernetes.io/service-account-token 3 27d
tfstate-default-state Opaque 1 15h
However, I am not sure which process would create 'tfsecret' or should we create it manually ?
Kindly let me know if I my understanding is wrong and had I missed anything else
EDIT
The issue mentioned above occurred because existing Gitlab-runner was on a different subnet (eg 172.xx.xx.xx instead of 192.xx.xx.xx)
I was asked to use a different Gitlab-runner which runs on the same subnet and now it throws the following error
Error: Failed to get existing workspaces: Get "https://192.xx.xx.xx:6443/api/v1/namespaces/default/secrets?labelSelector=tfstate%3Dtrue": x509: certificate signed by unknown authority
Now, I am bit confused whether the certificate-issue is between GitLab-Runner and Gitlab-Server or Gitlab-Server and K8S Cluster or something else
You have configured Kubernetes as the remote state backend for your Terraform configuration. The error is, that the backend is trying to query existing secrets to determine what workspaces are configured. The x509: certificate signed by unknown authority indicates, that the KUBECONFIG the remote state backend uses does not match the CA of the API server you're connecting to.
If the runners are K8s pods themselves, make sure you provide a KUBECONFIG that matches your target cluster and that the remote state does not configure itself as in-cluster by reading the service account token every K8s pod has - which in most cases will only work for the cluster the pod is running on.
You don't provide enough information to be more specific. But big picture, you have to configure the state backend, and any provider that connect to K8s. Theoretically, the state backend secrets and the K8s resources do not have to be on the same cluster. Meaning, you may have to have different configuration for state backend and K8s providers.

k8s, RabbitMQ, and Peer Discovery

We are trying to run an instance of the RabbitMQ chart with Helm from the helm/charts/stable/rabbit project. I had it running perfect but then I had to restart k8s for some maintenance. Now we are completely unable to launch the RabbitMQ chart in any way shape or form. I am not even trying to run the chart with any variables, i.e. just the default values.
Here is all I am doing:
helm install stable/rabbitmq
I have confirmed I can simply run the default right on my local k8s which I'm running with Docker for Desktop. When we run the rabbit chart on our shared k8s the exact same way as on desktop and what we did before the restart, the following error is thrown:
Failed to get nodes from k8s - 503
I have also posted an issue on the Helm charts repo as well. Click here to see the issue on Github.
We are suspecting the DNS but are unable to confirm anything yet. What is very frustrating is after the restart every single other chart we installed restarted perfectly except Rabbit which now will not start at all.
Anyone know what I could do to get Rabbits peer discovery to work? Anyone seen issue like this after restarting k8s?
So I actually got rabbit to run. Turns out my issue was the k8s peer discovery could not connect over the default port 443 and I had to use the external port 6443 because kubernetes.default.svc.cluster.local resolved to the public port and could not find the internal, so yeah our config is messed up too.
It took me a while to realize the variable below was not overriding when I overrode it with helm install . -f server-values.yaml.
rabbitmq:
configuration: |-
## Clustering
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.port = 6443
cluster_formation.node_cleanup.interval = 10
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
# queue master locator
queue_master_locator=min-masters
# enable guest user
loopback_users.guest = false
I had to add cluster_formation.k8s.port = 6443 to the main values.yaml file instead of my own. Once the port was changed specifically in the values.yaml, rabbit started right up.
I'm wondering what is the reason of using rabbit_peer_discovery_k8s plugin, if values.yaml defaults to 1 replicas (your manifest file does not override this setting) ?
I was trying to reproduce your issue with given by you override values (dev-server.yaml), as per the details in your github issue #10811, but I somewhat failed. Here are my observations:
If to install RabbitMQ chart with your custom values, my rabbitmq-dev-default-0 pod gets stuck in CrashLoopBackOff state.
It`s quite hard to troubleshoot it further for me as bitnami`s rabbitmq image containers, used by this rabbitmq Helm chart, are shipped with non-root account.
On the other hand if rabbitmq chart is installed on my Kubernetes cluster (v1.13.2) in simplest form:
helm install stable/rabbitmq
I observe similar issue then. I mean rabbitmq server survives a simulated VM restart of all cluster nodes (including master), but I cannot connect to it from outside:
Post VM restart, I`m getting following error from my python mqclient:
socket.gaierror: [Errno -2] Name or service not known
Few remarks here:
Yes, I did port(s)-forward as per instructions on "helm status " command:
The readiness probe works fine:
curl -sS -f --user user:<my_pwd> 127.0.0.1:15672/api/healthchecks/node
{"status":"ok"}
rabbitmqctl to rabbitmq-server connectivity from inside the container works fine too:
kubectl exec rabbitmq-dev-default-0 -- rabbitmqctl list_queues
warning: the VM is running with native name encoding of latin1 which may cause Elixir to malfunction as it expects utf8. Please ensure your locale is set to UTF-8 (which can be verified by running "locale" in your shell)
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
hello 11
From the moment I used kubectl port-forward to pod instead service, connectivity to rabbitmq server is restored:
kubectl port-forward --namespace default pod/rabbitmq-dev-default-0 5672:5672
$ python send.py
[x] Sent 'Hello World!'

Can't send HTTP/HTTPS traffic to a GKE node

I get the error below every time I try to open http/https ports on a gke vm so I could let traffic through to my nginx ingress. The Node in kubernetes just dies every time I try to do any changes to firewall configs for the node.
Editing VM instance gke-*-cluster-pool-1-75f2f99f-r3b0 failed. Error: Invalid value for field 'resource.natIP': '104...254'. The specified external IP address '104...254' was not found in region 'europe-west1'.
I have also tried to use the GCE loadbalancer ingress without success, it just gives me another error, which seems to say that it can't bootstrap itself
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
I was able to get around the crashing of nodes while using the web UI with help of gcloud cli by adding the http tags manually.
gcloud compute instances add-tags [YOUR_INSTANCE_NAME] --tags http-server,https-server

Rancher v1.3.1 Kubernetes Dashboard not working

I try to install Rancher v1.3.1 and enable Kubernetes Environment, the install seem OK but when i navigate to Dashboard but result is blank page, i check 2 deployment :kubernetes-dashboard and tiller-deploy restart every time with log:
Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.43.0.1:443/version: dial tcp 10.43.0.1:443: i/o timeout
I dont know why, Please help me
I dont know why kubernetes service for expose 10.43.0.1:443 belong different namespace(default) with others(kube-system)
Please try switching from using https://10.43.0.1:443 to http://10.43.0.1 by editing the following in the deployment.
args:
- --auto-generate-certificates
- --namespace=kubernetes-dashboard
# Uncomment the following line to manually specify Kubernetes API server Host
# If not specified, Dashboard will attempt to auto discover the API server and connect
# to it. Uncomment only if the default does not work.
- --apiserver-host=http://10.43.0.1