Issue Deploying Git Sync DAG to Airflow on Kubernetes

Issue Deploying Git Sync DAG to Airflow on Kubernetes - kubernetes

I have been trying to deploy the Git Sync DAG (v3.4.0) to my instance of Airflow (v2.4.1 with helm chart version 1.7.0) running on a kubernetes cluster (v1.23.7+rke2r2).
I followed the deployment instructions from the Airflow documentation which can be found here.
My override_values.yaml is the following.
dags:
gitSync:
enabled: true
repo: git#github.com/MY_COMPANY_NAME/MY_COMPANY-dags.git
branch: main
subPath: ""
sshKeySecret: airflow-ssh-secret
extraSecrets:
airflow-ssh-secret:
data: |
gitSshKey: 'MY_PRIVATE_KEY_IN_B64'
Once airflow is stable, I use the following helm command to update my airflow deployment.
helm upgrade --install airflow apache-airflow/airflow --namespace airflow -f override-values.yaml
This succeeds, but the deployment never achieves a new stable state with the git-sync containers. The git-sync-init repeatedly fails to complete. I have previously used this approach to deploy git-sync and it worked for months, however it stopped working suddenly. When I attempt to check the logs for the git-sync-init container, they are empty and there doesn't seem to be a verbosity attribute I can enable.
After reading through github issues on the git-sync repo, I also attempted to prepend the ssh:// scheme to the repo url, but that did not fix the issue.
Is there an alternative way for me deploy a git-sync sidecar container to my airflow deployment so that I can access code from private repos?
EDIT:
It appears like the issue was actually with the rancher GUI. Whenever I would use the GUI, the container logs and shell would not load or show anything. However, I was able to open up a kubectl shell, query for the airflow pods with kubectl get pods -n airflow, and query for the specific init container logs with ubectl logs airflow-scheduler-65fcdbb58d-4pnzf git-sync -n airflow.
This yielded the following error.
"msg"="unexpected error syncing repo, will retry" "error"="Run(git submodule update --init --recursive --depth 2): exit status 128: { stdout: "", stderr: "fatal: No url found for submodule path 'COMPANY_NAME/PACKAGE_PATH/PACKAGE' in .gitmodules\n" }"
This pointed to a misconfigured .gitmodules that was not updated when the structure of our dag repo was changed.

Related

UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Yesterday I stopped a helm upgrade when it was running on a release pipeline in Azure DevOps and the followings deployments got it failure.
I tried to see the chart that has failed with the aim of delete it but the chart of the microservice ("auth") doesn't appear. I used the command «helm list -n [namespace_of_AKS]» and it doesn't appear.
What can i do to solve this problem?
Error in Azure Release Pipeline
2022-03-24T08:01:39.2649230Z Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
2022-03-24T08:01:39.2701686Z ##[error]Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress
Helm List

This error can happen for few reasons, but it most commonly occurs when there is an interruption during the upgrade/install process as you already mentioned.
To fix this one may need to, first rollback to another version, then reinstall or helm upgrade again.
Try below command to list
helm ls --namespace <namespace>
but you may note that when running that command ,it may not show any columns with information
Try to check the history of the previous deployment
helm history <release> --namespace <namespace>
This provides with information mostly like the original installation was never completed successfully and is pending state something like STATUS: pending-upgrade state.
To escape from this state, use the rollback command:
helm rollback <release> <revision> --namespace <namespace>
revision is optional, but you should try to provide it.
You may then try to issue your original command again to upgrade or reinstall.

helm ls -a -n {namespace} will list all releases within a namespace, regardless of status.
You can also use helm ls -aA instead to list all releases in all namespaces -- in case you actually deployed the release to a different namespace (I've done that before)

Try deleting the latest helm secret for the deployment and re-run your helm apply command.
kubectl get secret -A | grep <app-name>
kubectl delete secret <secret> -n <namespace>

Is it possible to use cloud code extension in vscode to deploy kubernetes pods on a non-GKE cluster?

This is my very first post here and looking for some advise please.
I am learning Kubernetes and trying to get cloud code extension to deploy Kubernetes manifests on non-GKE cluster. Guestbook app can be deployed using cloud code extension to local K8 cluster(such as MiniKube or Docker-for-Desktop).
I have two other K8 clusters as below and I cannot deploy manifests via cloud code. I am not entirely sure if this is supposed to work or not as I couldn't find any docs or posts on this. Once the GCP free trial is finished, I would want to deploy my test apps on our local onprem K8 clusters via cloud code.
3 node cluster running on CentOS VMs(built using kubeadm)
6 node cluster on GCP running on Ubuntu machines(free trial and built using Hightower way)
Skaffold is installed locally on MAC and my local $HOME/.kube/config has contexts and users set to access all 3 clusters.
➜
guestbook-1 kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
docker-desktop docker-desktop docker-desktop
* kubernetes-admin#kubernetes kubernetes kubernetes-admin
kubernetes-the-hard-way kubernetes-the-hard-way admin
Error:
Running: skaffold dev -v info --port-forward --rpc-http-port 57337 --filename /Users/testuser/Desktop/Cloud-Code-Builds/guestbook-1/skaffold.yaml -p cloudbuild --default-repo gcr.io/gcptrial-project
starting gRPC server on port 50051
starting gRPC HTTP server on port 57337
Skaffold &{Version:v1.19.0 ConfigVersion:skaffold/v2beta11 GitVersion: GitCommit:63949e28f40deed44c8f3c793b332191f2ef94e4 GitTreeState:dirty BuildDate:2021-01-28T17:29:26Z GoVersion:go1.14.2 Compiler:gc Platform:darwin/amd64}
applying profile: cloudbuild
no values found in profile for field TagPolicy, using original config values
Using kubectl context: kubernetes-admin#kubernetes
Loaded Skaffold defaults from \"/Users/testuser/.skaffold/config\"
Listing files to watch...
- python-guestbook-backend
watching files for artifact "python-guestbook-backend": listing files: unable to evaluate build args: reading dockerfile: open /Users/adminuser/Desktop/Cloud-Code-Builds/src/backend/Dockerfile: no such file or directory
Exited with code 1.
skaffold config file skaffold.yaml not found - check your current working directory, or try running `skaffold init`
I have the docker and skaffold file in the path as shown in the image and have authenticated the google SDK in vscode. Any help please ?!

I was able to get this working in the end. What helped in this particular case was removing skaffold.yaml, then skaffold init, generated new skaffold.yaml. And, Cloud Code was then able deploy pods on both remote clusters. Thanks for all your help.

Kubernetes cluster not deleting

I am trying to delete the entire kubernetes that created for my CI/CD pipeline R&D. So for deleting the cluster and everything I run the following command,
kubectl config delete-cluster <cluster-name>
kubectl config delete-context <Cluster-context>
For making sure that the clustee is deleted, I build the jenkins pipeline job again. So I found that it is deploying with updated changes.
When I run the command "kubectl config view", I found the following result,
docker#mildevdcr01:~$ kubectl config view
apiVersion: v1
clusters: []
contexts: []
current-context: kubernetes-admin#cluster.local
kind: Config
preferences: {}
users: []
docker#mildevdcr01:~$
Still my Spring Boot micro service is deploying in cluster with updated changes.
I created the Kubernetes cluster using kubespray tool that I got reference from Github:
https://github.com/kubernetes-incubator/kubespray.git
What do I need to do for the deletion of everything that I created for this Kubernetes cluster? I need to remove everything including master node.

If you setup your cluster using Kubespray, you ran whole installation using ansible, so to delete cluster you have to use it too.
But you can also reset the entire cluster for fresh installation:
$ ansible-playbook -i inventory/mycluster/hosts.ini reset.yml
Remember to keep the “hosts.ini” updated properly.
You can remove node by node from your cluster simply adding specific node do section [kube-node] in inventory/mycluster/hosts.ini file (your hosts file) and run command:
$ ansible-playbook -i inventory/mycluster/hosts.ini remove-node.yml
KubeSpray documentation: kubespray.
Useful articles: kubespray-steps, kubespray-ansible.

Okay so for a kubespray CI/CD pipeline it's a little more complicated then just deleting the cluster context. You have to actively delete other items on each node and perform a reset.yml for ETCD.
Sometimes just running the reset.yml is enough for your pipeline so it resets the cluster back to the initial state but if this is not enough then you have to delete docker, kubelet, repositories, /etc/kubernetes and many other directories on the nodes to get a clean deployment. In this case it's almost always easier to just provision new nodes in your pipeline using terraform and vsphere(vra) API.

Gitlab Runner is not able to resolve DNS of Gitlab Server

I'm facing a pretty strange Problem.
First of all my setup:
I got a private Gitlab server which uses Gitlab CI Runners on Kubernetes to build Docker Images. For that purpose I use the Kaniko Image.
The Runners are provisioned by Gitlab itself with the built-in Kubernetes management. All that is running behind a PFSense server.
Now to my problem:
Sometimes the Kaniko Pods can't resolve the Hostname of the GitLab server.
This leads to failed git pull and so to a failed build.
I would rate the chance to fail by 60%, which is way too high for us.
After retrying the build a few times, it will run without any problem.
The Kubernetes Cluster running the Gitlab CI is setup on CentOS 7.
SELinux and FirewallD are disabled. All of the Hosts can resolve the GitLab Server. It is also not related to a specific Host Server, which is causing the problem. I have seen it fail on all of the 5 Servers including the Manager Server. Also I haven't seen this problem appear in other Pods. But the other Deployments in the cluster dont really do connections via DNS. I am sure that the Runner is able to access DNS at all, because it is pulling the Kaniko Image from gcr.io.
Has anyone ever seen this problem or knows a workaround?
I have already tried spawning Pods that only do DNS requests to the Domain. I didn't see a single fail.
Also I tried to Reboot the whole Cluster and Gitlab instance.
I tried to do a static overwrite of the DNS route in PFSense. Still same problem.
Here is my CI config:
build:
stage: build
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
script:
- echo $REGISTRY_AUTH > /kaniko/.docker/config.json
- /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $REGISTRY_URL/$REGISTRY_IMAGE:$CI_JOB_ID
only:
- master
The following error happens:
Initialized empty Git repository in /builds/MYPROJECT/.git/
Fetching changes...
Created fresh repository.
fatal: unable to access 'https://gitlab-ci-token:[MASKED]#git.mydomain.com/MYPROJECT.git/': Could not resolve host: git.mydomain.com

We had same issue for couple of days. We tried change CoreDNS config, move runners to different k8s cluster and so on. Finally today i checked my personal runner and found that i'm using different version. Runners in cluster had gitlab/gitlab-runner:alpine-v12.3.0, when mine had gitlab/gitlab-runner:alpine-v12.0.1. We added line
image: gitlab/gitlab-runner:alpine-v12.1.0
in values.yaml and this solved problem for us

There are a env for gitlab-runner that can solve this problem
- name: RUNNER_PRE_CLONE_SCRIPT
value: "exec command before git fetch ..."
for example:
edit /etc/hosts
echo '127.0.0.1 git.demo.xxxx' >> /etc/hosts
or edit /etc/resolv.conf
echo 'nameserver 8.8.8.8' > /etc/resolv.conf
hope it works for you

Cannot create priviledged containers

I am using instructions from https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/docker-multinode.md to setup a multinode kubernetes cluster on vmware vcloud infrastructure.
I was able to get the cluster working but when I tried the nfs example I was not able to create the nfs container. So I recreated all the VMs and rebuilt kubernetes from source using:
git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
sed -i 's/allow_privileged: .*/allow_privileged: true/g' cluster/saltbase/pillar/privilege.sls
./build/run.sh hack/build-cross.sh
cp _output/dockerized/bin/linux/$(dpkg --print-architecture)/kubectl /usr/local/bin
chmod +x /usr/local/bin/kubectl
and continued to setup the kubernetes cluster and retried the NFS example and I get the following error:
kubectl create -f nfs-server-pod.yaml
The Pod "nfs-server" is invalid.
spec.containers[0].securityContext.privileged: forbidden '<*>(0xc20931650c)true'
I tried with both the master and 1.0.3 release and had the same result.
Can you please tell me how to resolve this issue and Thanks for your support

We thought that turning privileged containers off by default would be good for security. It turns out to just be a pain point for a lot of people, so we're working to turn it on by default in kubernetes v1.1.
The --allow-privileged flag has to be set on both the kubelet and the apiserver - please check that

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse