Gitlab Runner is not able to resolve DNS of Gitlab Server - kubernetes

I'm facing a pretty strange Problem.
First of all my setup:
I got a private Gitlab server which uses Gitlab CI Runners on Kubernetes to build Docker Images. For that purpose I use the Kaniko Image.
The Runners are provisioned by Gitlab itself with the built-in Kubernetes management. All that is running behind a PFSense server.
Now to my problem:
Sometimes the Kaniko Pods can't resolve the Hostname of the GitLab server.
This leads to failed git pull and so to a failed build.
I would rate the chance to fail by 60%, which is way too high for us.
After retrying the build a few times, it will run without any problem.
The Kubernetes Cluster running the Gitlab CI is setup on CentOS 7.
SELinux and FirewallD are disabled. All of the Hosts can resolve the GitLab Server. It is also not related to a specific Host Server, which is causing the problem. I have seen it fail on all of the 5 Servers including the Manager Server. Also I haven't seen this problem appear in other Pods. But the other Deployments in the cluster dont really do connections via DNS. I am sure that the Runner is able to access DNS at all, because it is pulling the Kaniko Image from gcr.io.
Has anyone ever seen this problem or knows a workaround?
I have already tried spawning Pods that only do DNS requests to the Domain. I didn't see a single fail.
Also I tried to Reboot the whole Cluster and Gitlab instance.
I tried to do a static overwrite of the DNS route in PFSense. Still same problem.
Here is my CI config:
build:
stage: build
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
script:
- echo $REGISTRY_AUTH > /kaniko/.docker/config.json
- /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $REGISTRY_URL/$REGISTRY_IMAGE:$CI_JOB_ID
only:
- master
The following error happens:
Initialized empty Git repository in /builds/MYPROJECT/.git/
Fetching changes...
Created fresh repository.
fatal: unable to access 'https://gitlab-ci-token:[MASKED]#git.mydomain.com/MYPROJECT.git/': Could not resolve host: git.mydomain.com

We had same issue for couple of days. We tried change CoreDNS config, move runners to different k8s cluster and so on. Finally today i checked my personal runner and found that i'm using different version. Runners in cluster had gitlab/gitlab-runner:alpine-v12.3.0, when mine had gitlab/gitlab-runner:alpine-v12.0.1. We added line
image: gitlab/gitlab-runner:alpine-v12.1.0
in values.yaml and this solved problem for us

There are a env for gitlab-runner that can solve this problem
- name: RUNNER_PRE_CLONE_SCRIPT
value: "exec command before git fetch ..."
for example:
edit /etc/hosts
echo '127.0.0.1 git.demo.xxxx' >> /etc/hosts
or edit /etc/resolv.conf
echo 'nameserver 8.8.8.8' > /etc/resolv.conf
hope it works for you

Related

Is it possible to use cloud code extension in vscode to deploy kubernetes pods on a non-GKE cluster?

This is my very first post here and looking for some advise please.
I am learning Kubernetes and trying to get cloud code extension to deploy Kubernetes manifests on non-GKE cluster. Guestbook app can be deployed using cloud code extension to local K8 cluster(such as MiniKube or Docker-for-Desktop).
I have two other K8 clusters as below and I cannot deploy manifests via cloud code. I am not entirely sure if this is supposed to work or not as I couldn't find any docs or posts on this. Once the GCP free trial is finished, I would want to deploy my test apps on our local onprem K8 clusters via cloud code.
3 node cluster running on CentOS VMs(built using kubeadm)
6 node cluster on GCP running on Ubuntu machines(free trial and built using Hightower way)
Skaffold is installed locally on MAC and my local $HOME/.kube/config has contexts and users set to access all 3 clusters.
➜
guestbook-1 kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
docker-desktop docker-desktop docker-desktop
* kubernetes-admin#kubernetes kubernetes kubernetes-admin
kubernetes-the-hard-way kubernetes-the-hard-way admin
Error:
Running: skaffold dev -v info --port-forward --rpc-http-port 57337 --filename /Users/testuser/Desktop/Cloud-Code-Builds/guestbook-1/skaffold.yaml -p cloudbuild --default-repo gcr.io/gcptrial-project
starting gRPC server on port 50051
starting gRPC HTTP server on port 57337
Skaffold &{Version:v1.19.0 ConfigVersion:skaffold/v2beta11 GitVersion: GitCommit:63949e28f40deed44c8f3c793b332191f2ef94e4 GitTreeState:dirty BuildDate:2021-01-28T17:29:26Z GoVersion:go1.14.2 Compiler:gc Platform:darwin/amd64}
applying profile: cloudbuild
no values found in profile for field TagPolicy, using original config values
Using kubectl context: kubernetes-admin#kubernetes
Loaded Skaffold defaults from \"/Users/testuser/.skaffold/config\"
Listing files to watch...
- python-guestbook-backend
watching files for artifact "python-guestbook-backend": listing files: unable to evaluate build args: reading dockerfile: open /Users/adminuser/Desktop/Cloud-Code-Builds/src/backend/Dockerfile: no such file or directory
Exited with code 1.
skaffold config file skaffold.yaml not found - check your current working directory, or try running `skaffold init`
I have the docker and skaffold file in the path as shown in the image and have authenticated the google SDK in vscode. Any help please ?!
I was able to get this working in the end. What helped in this particular case was removing skaffold.yaml, then skaffold init, generated new skaffold.yaml. And, Cloud Code was then able deploy pods on both remote clusters. Thanks for all your help.

Error when installing Spinnaker on Kubernetes on prem cluster

I'm trying to install Spinnaker on a Kubernetes setup onprem.
Following instructions from https://www.spinnaker.io/setup/
Install and run Halyard as Docker on the Kubernetes master.
Run everything as root
mkdir ~/.hal on Kubemaster. Created the service account as instrcuted in the site.
Copied the kubeconfig file from ./kube/config into ~/.hal/kubeconfig as it didnt work with docker -v option, there was some permission issue, so made it work this way
docker run halyard command -- all up and running fine.
Ran Bash and Inside halyard.
Now when I do these two things inside halyard
Point kubectl to the kubeconfig by export KUBECONFIG command
Enable kubernetes provider "hal config provider kubernetes enable"
The command gets executed sometimes successfully or it fails with this warning after timeout error
Getting object contents of versions.yml
Unexpected error comparing versions: com.netflix.spinnaker.halyard.core.error.v1.HalException: Could not load "versions.yml" from config bucket: www.googleapis.com.*
Even if it somehow manages to run successfully. When I run these,
CONTEXT=$(kubectl config current-context)
hal config provider kubernetes account add my-k8s-account --context $CONTEXT
It fails with the same error as above.
Total weird stuff. Its intermittent. Does it have something to do with the kubeconfig file? Any pointers or help would be greatly appreciated.
Thanks.
As noted in comments these kind of errors could result when there lack of network connectivity from inside the container.
As Vikram mentioned in his comment:
Yes, that was the problem. Azure support recommended installing a CNI plugin and it resolved the issue. So, it seems like inside of Azure VM without a Public IP, the CNI plugin is needed for a VM To connect to internet.
To configure CNI plugin on Azure platform use this guide.
Hope it helps.

How to fix nodes hostnames too long when deploying helm chart in Gitlab CI

What I'm trying to achieve
Hi, I am currently trying to deploy a postgresql-ha (more specifically bitnami/postgresql-ha) helm chart through a gitlab CI job.
What I can achieve locally
Locally, when running helm install postgresql ./postgresql-ha, I am able to create the chart on my remote Kubernetes Cluster successfully without any error.
./postgresql-ha is an unmodified pull of bitnami/postgresql-ha.
What happens when I run it through CI
When I try to run the same command in Gitlab's CI, with a properly set helm and kubectl, I get this error:
Release "postgresql" does not exist. Installing it now.
Error: template: postgresql-ha/templates/NOTES.txt:60:4: executing "postgresql-ha/templates/NOTES.txt" at <include "postgresql-ha.validateValues" .>: error calling include: template: postgresql-ha/templates/_helpers.tpl:682:51: executing "postgresql-ha.validateValues" at <fail>: error calling fail:
VALUES VALIDATION:
postgresql-ha: Nodes hostnames
PostgreSQL nodes hostnames exceeds the characters limit for Pgpool: 128.
Consider using a shorter release name or namespace.
The first line validates the kubectl/helm configurations.
.gitlab-ci.yml File
##########
# Stages #
##########
stages:
- deploy charts
########
# Jobs #
########
deploy_postgresql-ha:
image: dtzar/helm-kubectl:3.0.0
stage: deploy charts
environment:
name: production
script:
- helm upgrade --install postgresql ./postgresql-ha
The Kubernetes Cluster was created via Gitlab's Kubernetes Integration tool and is therefore already set up, as per this Gitlab Documentation page.
Thanks in advance the help!
It seems there is a bug on pgpool:
ping probes fail with long hostnames due to small buffer
pgpool calls "ping -q -c3 " and parses the result to ascertain if a host is up or down. Unfortunately it uses a rather small buffer to read the output and the last line of the ping command can get truncated.
This means that pgpool assumes the host is down.
https://pgpool.net/mantisbt/print_all_bug_page_word.php?search=&sort=&dir=DESC&type_page=html&export=-1&show_flag=0
Bitnami upgraded the code so now long hostnames are not allowed anymore. Nodehostname is composed on the following components, so you can adjust:
$nodeHostname := printf "%s-00.%s.%s.svc.%s:1234" $postgresqlFullname $postgresqlHeadlessServiceName $releaseNamespace $clusterDomain }}
https://github.com/bitnami/charts/blob/master/bitnami/postgresql-ha/templates/_helpers.tpl

k8s, RabbitMQ, and Peer Discovery

We are trying to run an instance of the RabbitMQ chart with Helm from the helm/charts/stable/rabbit project. I had it running perfect but then I had to restart k8s for some maintenance. Now we are completely unable to launch the RabbitMQ chart in any way shape or form. I am not even trying to run the chart with any variables, i.e. just the default values.
Here is all I am doing:
helm install stable/rabbitmq
I have confirmed I can simply run the default right on my local k8s which I'm running with Docker for Desktop. When we run the rabbit chart on our shared k8s the exact same way as on desktop and what we did before the restart, the following error is thrown:
Failed to get nodes from k8s - 503
I have also posted an issue on the Helm charts repo as well. Click here to see the issue on Github.
We are suspecting the DNS but are unable to confirm anything yet. What is very frustrating is after the restart every single other chart we installed restarted perfectly except Rabbit which now will not start at all.
Anyone know what I could do to get Rabbits peer discovery to work? Anyone seen issue like this after restarting k8s?
So I actually got rabbit to run. Turns out my issue was the k8s peer discovery could not connect over the default port 443 and I had to use the external port 6443 because kubernetes.default.svc.cluster.local resolved to the public port and could not find the internal, so yeah our config is messed up too.
It took me a while to realize the variable below was not overriding when I overrode it with helm install . -f server-values.yaml.
rabbitmq:
configuration: |-
## Clustering
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.port = 6443
cluster_formation.node_cleanup.interval = 10
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
# queue master locator
queue_master_locator=min-masters
# enable guest user
loopback_users.guest = false
I had to add cluster_formation.k8s.port = 6443 to the main values.yaml file instead of my own. Once the port was changed specifically in the values.yaml, rabbit started right up.
I'm wondering what is the reason of using rabbit_peer_discovery_k8s plugin, if values.yaml defaults to 1 replicas (your manifest file does not override this setting) ?
I was trying to reproduce your issue with given by you override values (dev-server.yaml), as per the details in your github issue #10811, but I somewhat failed. Here are my observations:
If to install RabbitMQ chart with your custom values, my rabbitmq-dev-default-0 pod gets stuck in CrashLoopBackOff state.
It`s quite hard to troubleshoot it further for me as bitnami`s rabbitmq image containers, used by this rabbitmq Helm chart, are shipped with non-root account.
On the other hand if rabbitmq chart is installed on my Kubernetes cluster (v1.13.2) in simplest form:
helm install stable/rabbitmq
I observe similar issue then. I mean rabbitmq server survives a simulated VM restart of all cluster nodes (including master), but I cannot connect to it from outside:
Post VM restart, I`m getting following error from my python mqclient:
socket.gaierror: [Errno -2] Name or service not known
Few remarks here:
Yes, I did port(s)-forward as per instructions on "helm status " command:
The readiness probe works fine:
curl -sS -f --user user:<my_pwd> 127.0.0.1:15672/api/healthchecks/node
{"status":"ok"}
rabbitmqctl to rabbitmq-server connectivity from inside the container works fine too:
kubectl exec rabbitmq-dev-default-0 -- rabbitmqctl list_queues
warning: the VM is running with native name encoding of latin1 which may cause Elixir to malfunction as it expects utf8. Please ensure your locale is set to UTF-8 (which can be verified by running "locale" in your shell)
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
hello 11
From the moment I used kubectl port-forward to pod instead service, connectivity to rabbitmq server is restored:
kubectl port-forward --namespace default pod/rabbitmq-dev-default-0 5672:5672
$ python send.py
[x] Sent 'Hello World!'

Unable to connect to the server: dial tcp [::1]:8080: connectex: No connection could be made because the target machine actively refused it

Am working on Azure Kubernates where we can store Docker Images in Azure. Here am trying to check my kubectl version, then am getting
Unable to connect to the server: dial tcp [::1]:8080: connectex: No
connection could be made because the target machine actively refused
it.
For this I followed MSDN:uilding Microservices with AKS and VSTS – Part 2 and MSDOCS:Kubernetes on windows
So, can you please suggest me “How to resolve for this issue?”
I am on windows 10, and for me I did not enable kubernetes on Docker Desktop.
As you can see here, there are no contexts available.
So go to settings of docker desktop and enable it as follows.
Now run a command as follows.
kubectl config get-contexts
Ensure you see something like this.
Also you can also try listing the nodes as follows.
kubectl get nodes
I think you might missed out to configure the cluster, for that you need to run the below command in your command prompt.
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
The above CLI command creates .config file with complete cluster and nodes details in your local machine.
After that you run kubectl get nodes command in your command prompt, then you can get the list of nodes inside the cluster like in the below image.
For reference follow this Deploy an Azure Kubernetes Service (AKS) cluster.
If you can see that your config file is correctly configured by going to $HOME/.kube/config - Linux or %UserProfile%/.kube/config - Windows but you are still receiving the error message - try running command line as an administrator.
More information on the config file can be found here: https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/
In my case, I was shuffling between az aks k8s cluster and local docker-desktop.
So every time I change the cluster context I need to restart the docker, else I get the same described error.
Unable to connect to the server: dial tcp 127.0.0.1:6443: connectex: No connection could be made because the target machine actively refused it.
PS: make sure your cluster is started as shown in this picture showing (Stop local cluster)
For me it appeared to be due to Windows not having a HOME environment variable set. According to the docs kubectl will use the config file $(HOME)/.kube/config. But since this variable isn't set on Window it can't locate the file.
I created a HOME variable with the same value as USERPROFILE and it started working.
I'm using Hyper-V on Local Windows and I met this error because I didn't configure minikube.
(I know the question is about Azure, not minikube. But this article is on the top for the error message. So, I've put the solution here.)
1. enable Hyper-V.
Type in systeminfo on your Terminal. If you can find the line below,
Hyper-V Requirements: A hypervisor has been detected. Features required for Hyper-V will not be displayed.
Hyper-V works correctly.
If you can't, enable it from settings.
2. Create Hyper-V Network Switch
Open Hyper-V manager. (Searching it is the fastest way.)
Next, click your PC name on the left.
Then, you can find Virtual Switch Manager menu on the right.
Click it and choose External Virtual Switch with name: "Minikube Switch"
Click apply to create it.
3. start minikube
Go back to terminal and type in:
minikube start --vm-driver hyperv --hyperv-virtual-switch "Minikube Switch"
For more information, check the steps in this article.
Check docker is running and you started minikube or whichever cloud kube you using.
my issue resolved after running "minikube start --driver=docker"
Essentially this problem occurs if your minikube or kind isn't configured. Just try to restart your minikube or kind. If that doesn't solve your problem then try to restart your hypervisor which minikube uses.
minikube start
This command solved my issue.
I was facing the same error while firing the command "kubectl get pods"
The issue has been resolved by having following steps below:
a) First find out current-context
kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
b) if no context is set then set it manually by using
kubectl config set-context <Your context>
Hope this will help you.
If you're facing this error on windows, its possible that your docker instance is not running.
These are the steps I followed to replicate the above error;
Stopped docker and then tried to start-up an nginx-deployment. Doing this caused the mentioned error above to happen.
How did I solve it?
Check if minikube is running in my case this was not running
Start minikube
Retry applying your configuration above. In my case see the screenshot below
When you see that your deployment has been created, then all should be fine.
I had exactly the same problem even after having correct config (by running an azure cli command).
It seems that kubectl expects HOME env.variable set but it did not exist for me. There is however a solution:
If you add a KUBECONFIG environmental variable that will point to config it will start working.
Example:
setx KUBECONFIG %UserProfile%\.kube\config
When the variable is present kubectl has no troubles reading from file.
P.S. It is an alternative to setting a HOME variable as suggested in another answer.
Azure self-hosted agent doesn't have the permission to access Kubernates cluster:
Remove Azure self-hosted agent - .\config.cmd Remove
configure again ( .\config.cmd) with a user have permission to access Kubernates cluster
I encountered similar problem:
> kubectl cluster-info
"To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Unable to connect to the server: dial tcp xxx.x.x.x:8080: connectex: No connection could be made because the target machine actively refused it."
> kubectl cluster-info dump
Unable to connect to the server: dial tcp xxx.0.0.x:8080: connectex: No connection could be made because the target machine actively refused it.
This setup was working fine until Docker for Desktop bought it's own copy of kubectl. There are 2 ways to overcome this situation:
1 - Quit / Stop Docker for Desktop while using the cluster
2 - Set KUBECONFIG file path
I tried both the options and they worked.
Found a good source for .kube/config, sending it over here for quick reference:
apiVersion: v1
clusters:
- cluster:
certificate-authority: fake-ca-file
server: https://1.2.3.4
name: development
- cluster:
insecure-skip-tls-verify: true
server: https://5.6.7.8
name: scratch
contexts:
- context:
cluster: development
namespace: frontend
user: developer
name: dev-frontend
- context:
cluster: development
namespace: storage
user: developer
name: dev-storage
- context:
cluster: scratch
namespace: default
user: experimenter
name: exp-scratch
current-context: ""
kind: Config
preferences: {}
users:
- name: developer
user:
client-certificate: fake-cert-file
client-key: fake-key-file
- name: experimenter
user:
password: some-password
username: exp
Reference: https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/
Following #ilya-chernomordik,
I've added my config path to the System Variable by doing
setx KUBECONFIG "D:\Minikube\Minikube.minikube\config"
I have changed the default Location from C: Drive to D: Drive as i have less space in C.
Now the problem is fixed.
edit: after 5 mins, the api server again stopped. It's been more than 5-6 hours i'm trying to solve this issue. I'm not sure why this problem is happening, even after adding the coreect path.
On Rancher Desktop, make sure context is correctly choosen
In my situation, I'm in windows with docker desktop in a simple scenario just for studies, but the case is:
In the docker version in 20.10 or above, it come with kubernetes installed. Then it doesn't necessary installed a cluster adm like minikube. Then, when it just need to enable kubernetes in Docker Desktop configuration. Like:
Go to Docker Desktop: settings > kubernetes > check the box inside section Enable kubernetes and then click in Restart Kubernetes Cluster
When we do this, the docker provide all needed to works Kubernetes properly.
Referenced by: Blog