PersistentVolumeClaim not being restored using velero - kubernetes

I have the following values set to my velero configuration that was installed using helm.
schedules:
my-schedule:
schedule: "5 * * * *"
template:
includeClusterResources: true
includedNamespaces:
- jenkins
includedResources:
- 'pvcs'
storageLocation: backups
snapshotVolumes: true
ttl: 24h0m0s
I had a PVC (and an underlying PV that had been dynamically provisioned) which I manually deleted (alongside with the PV).
I then performed a velero restore (pointing to a backup taken prior to PV/PVC deletions of course) as in:
velero restore create --from-backup velero-hourly-backup-20201119140005 --include-resources persistentvolumeclaims -n extra-services
extra-services is the namespace where velero is deployed btw.
Although the logs indicate the restore was successful:
▶ velero restore logs velero-hourly-backup-20201119140005-20201119183805 -n extra-services
time="2020-11-19T16:38:06Z" level=info msg="starting restore" logSource="pkg/controller/restore_controller.go:467" restore=extra-services/velero-hourly-backup-20201119140005-20201119183805
time="2020-11-19T16:38:06Z" level=info msg="Starting restore of backup extra-services/velero-hourly-backup-20201119140005" logSource="pkg/restore/restore.go:363" restore=extra-services/velero-hourly-backup-20201119140005-20201119183805
time="2020-11-19T16:38:06Z" level=info msg="restore completed" logSource="pkg/controller/restore_controller.go:482" restore=extra-services/velero-hourly-backup-20201119140005-20201119183805
I see the following error in the restore description:
Name: velero-hourly-backup-20201119140005-20201119183805
Namespace: extra-services
Labels: <none>
Annotations: <none>
Phase: PartiallyFailed (run 'velero restore logs velero-hourly-backup-20201119140005-20201119183805' for more information)
Started: 2020-11-19 18:38:05 +0200 EET
Completed: 2020-11-19 18:38:07 +0200 EET
Errors:
Velero: error parsing backup contents: directory "resources" does not exist
Cluster: <none>
Namespaces: <none>
Backup: velero-hourly-backup-20201119140005
Namespaces:
Included: all namespaces found in the backup
Excluded: <none>
Resources:
Included: persistentvolumeclaims
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
Cluster-scoped: auto
Namespace mappings: <none>
Label selector: <none>
Restore PVs: auto
Any ideas?
Does this having to do with me deleting PV/PVC? (after all I tried to simulate a disaster situation)
I have both backupsEnabled and snapshotsEnabled set to true.

Related

How to remove Cusom Resources from kubernetes?

I have configured Crossplane on my machine, created a cluster with a bunch of other resources, and now I am trying to cleanup everything.
For instance, I am trying to delete a Cluster by using kubectl delete clusters <cluster-name>, but it simply does not get removed. In minikube dashboard I see the following condition on that cluster:
Type: Terminating
Reason: InstanceDeletionCheck
Message: could not confirm zero CustomResources remaining: timed out waiting for the condition
My goal is to cleanup the managed resources created by playing with this github repo: https://github.com/upbound/platform-ref-multi-k8s, and I would really appreciate any help
This is the output of the kubectl describe cluster multik8s-cluster-aws-wd95c-qlbrt command:
Name: multik8s-cluster-aws-wd95c-qlbrt
Namespace:
Labels: crossplane.io/claim-name=multik8s-cluster-aws
crossplane.io/claim-namespace=default
crossplane.io/composite=multik8s-cluster-aws-wd95c
Annotations: crossplane.io/external-create-pending: 2022-05-03T08:36:12Z
crossplane.io/external-create-succeeded: 2022-05-03T08:36:14Z
crossplane.io/external-name: multik8s-cluster-aws
API Version: eks.aws.crossplane.io/v1beta1
Kind: Cluster
Metadata:
Creation Timestamp: 2022-05-03T08:24:13Z
Deletion Grace Period Seconds: 0
Deletion Timestamp: 2022-05-03T10:15:06Z
Finalizers:
finalizer.managedresource.crossplane.io
Generate Name: multik8s-cluster-aws-wd95c-
Generation: 6
Managed Fields:
...
Owner References:
API Version: multik8s.platformref.crossplane.io/v1alpha1
Controller: true
Kind: EKS
Name: multik8s-cluster-aws-wd95c-h2nbj
UID: 76852ac3-58a9-42ec-8307-c02e490e8f32
Resource Version: 507248
UID: f02fa30d-9878-4be9-bebc-838d7e58d565
Spec:
Deletion Policy: Delete
For Provider:
...
Status:
At Provider:
Arn: arn:aws:eks:us-west-2:305615705119:cluster/multik8s-cluster-aws
Created At: 2022-05-03T08:36:14Z
Endpoint: https://519EADEC62BE27B27903C30E01A8E22D.gr7.us-west-2.eks.amazonaws.com
Identity:
Oidc:
Issuer: https://oidc.eks.us-west-2.amazonaws.com/id/519EADEC62BE27B27903C30E01A8E22D
Platform Version: eks.6
Resources Vpc Config:
Cluster Security Group Id: sg-0b9baf2fff4385125
Vpc Id: vpc-0fca5959a43bbdf71
Status: ACTIVE
Conditions:
Last Transition Time: 2022-05-03T08:48:26Z
Message: update failed: cannot update EKS cluster version: InvalidParameterException: Unsupported Kubernetes minor version update from 1.21 to 1.16
Reason: ReconcileError
Status: False
Type: Synced
Last Transition Time: 2022-05-03T08:48:25Z
Reason: Available
Status: True
Type: Ready
Events: <none>

Velero - Restore Partially fails for volumes provisioned using CSI driver

As part of POC, I am trying to backup and restore volumes provisioned by the GKE CSI driver in the same GKE cluster. However, the restore fails with no logs to debug.
Steps:
Create volume snapshot class: kubectl create -f vsc.yaml
# vsc.yaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-gce-vsc
labels:
"velero.io/csi-volumesnapshot-class": "true"
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
Create storage class: kubectl create -f sc.yaml
# sc.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pd-example
provisioner: pd.csi.storage.gke.io
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
type: pd-standard
Create namespace: kubectl create namespace csi-app
Create a persistent volume claim: kubectl create -f pvc.yaml
# pvc.yaml
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: podpvc
namespace: csi-app
spec:
accessModes:
- ReadWriteOnce
storageClassName: pd-example
resources:
requests:
storage: 6Gi
Create a pod to consume the pvc: kubectl create -f pod.yaml
# pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
name: web-server
namespace: csi-app
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- mountPath: /var/lib/www/html
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: podpvc
readOnly: false
Once the pvc is bound, I created the velero backup.
velero backup create test --include-resources=pvc,pv --include-namespaces=csi-app --wait
Output:
Backup request "test" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
...
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test` and `velero backup logs test`.
velero describe backup test
Name: test
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.21.5-gke.1302
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=21
Phase: Completed
Errors: 0
Warnings: 1
Namespaces:
Included: csi-app
Excluded: <none>
Resources:
Included: pvc, pv
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2021-12-22 15:40:08 +0300 +03
Completed: 2021-12-22 15:40:10 +0300 +03
Expiration: 2022-01-21 15:40:08 +0300 +03
Total items to be backed up: 2
Items backed up: 2
Velero-Native Snapshots: <none included>
After the backup is created, I verified the backup was created and was available in my GCS bucket.
Delete all the existing resources to test restore.
kubectl delete -f pod.yaml
kubectl delete -f pvc.yaml
kubectl delete -f sc.yaml
kubectl delete namespace csi-app
Run restore command:
velero restore create --from-backup test --wait
Output:
Restore request "test-20211222154302" submitted successfully.
Waiting for restore to complete. You may safely press ctrl-c to stop waiting - your restore will continue in the background.
.
Restore completed with status: PartiallyFailed. You may check for more information using the commands `velero restore describe test-20211222154302` and `velero restore logs test-20211222154302`.
velero describe or velero logs command doesn't return any description/logs.
What did you expect to happen:
I was expecting the pv, pvc and the namespace get restored.
The following information will help us better understand what's going on:
velero debug --backup test --restore test-20211222154302 command is stuck for more than 10 minutes and I couldn't generate the support bundle.
Output:
2021/12/22 15:45:16 Collecting velero resources in namespace: velero
2021/12/22 15:45:24 Collecting velero deployment logs in namespace: velero
2021/12/22 15:45:28 Collecting log and information for backup: test
Environment:
Velero version (use velero version):
Client:
Version: v1.7.1
Git commit: -
Server:
Version: v1.7.1
Velero features (use velero client config get features):
features:
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:33:37Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5-gke.1302", GitCommit:"639f3a74abf258418493e9b75f2f98a08da29733", GitTreeState:"clean", BuildDate:"2021-10-21T21:35:48Z", GoVersion:"go1.16.7b7", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes installer & version:
GKE 1.21.5-gke.1302
Cloud provider or hardware configuration:
GCP
OS (e.g. from /etc/os-release):
GCP Container-Optimized OS (COS)
You should be able to check the logs as mentioned:
Restore completed with status: PartiallyFailed. You may check for more information using the commands `velero restore describe test-20211222154302` and `velero restore logs test-20211222154302`.
velero describe or velero logs command doesn't return any description/logs.
The latter is available after the restore is completed, check for errors in it and it should show what went wrong.
Since you were doing a PV/PVC backup with CSI, you should have Velero setup to support it:
https://kubernetes-csi.github.io/docs/snapshot-restore-feature.html
Depending on the plugin you used, it might have been a bug, like :
https://github.com/vmware-tanzu/velero-plugin-for-csi/pull/122
This should be fixed in the latest 0.3.2 release for example:
https://github.com/vmware-tanzu/velero-plugin-for-csi/releases/tag/v0.3.2
So start with :
velero restore logs test-20211222154302
and go from there. Update the question with the findings and if you resolved it please, thank you.

How to enforce MustRunAsNonRoot policy in K8S cluster in AKS

I have a K8S cluster running in Azure AKS service.
I want to enforce MustRunAsNonRoot policy. How to do it?
The following policy is created:
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restrict-root
spec:
privileged: false
allowPrivilegeEscalation: false
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
fsGroup:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- '*'
It is deployed in the cluster:
$ kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
restrict-root false RunAsAny MustRunAsNonRoot RunAsAny RunAsAny false *
Admission controller is running in the cluster:
$ kubectl get pods -n gatekeeper-system
NAME READY STATUS RESTARTS AGE
gatekeeper-audit-7b4bc6f977-lvvfl 1/1 Running 0 32d
gatekeeper-controller-5948ddcd54-5mgsm 1/1 Running 0 32d
gatekeeper-controller-5948ddcd54-b59wg 1/1 Running 0 32d
Anyway it is possible to run a simple pod running under root:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: mypod
image: busybox
args: ["sleep", "10000"]
securityContext:
runAsUser: 0
Pod is running:
$ kubectl describe po mypod
Name: mypod
Namespace: default
Priority: 0
Node: aks-default-31327534-vmss000001/10.240.0.5
Start Time: Mon, 08 Feb 2021 23:10:46 +0100
Labels: <none>
Annotations: <none>
Status: Running
Why MustRunAsNonRoot is not applied? How to enforce it?
EDIT: It looks like AKS engine does not support PodSecurityPolicy (list of supported policies). Then the question is still the same: how to enforce MustRunAsNonRoot rule on workloads?
You shouldn't use PodSecurityPolicy on Azure AKS cluster as it has been set for deprecation as of May 31st, 2021 in favor of Azure Policy for AKS. Check the official docs for further details:
Warning
The feature described in this document, pod security policy (preview), is set for deprecation and will no longer be available
after May 31st, 2021 in favor of Azure Policy for
AKS.
The deprecation date has been extended from the previous date of
October 15th, 2020.
So currently you should rather use Azure Policy for AKS, where among other built-in policies grouped into initiatives (an initiative in Azure Policy is a collection of policy definitions that are tailored towards achieving a singular overarching goal), you can find a policy which goal is to disallow running of privileged containers on your AKS cluster.
As to PodSecurityPolicy, for the time being it should still work. Please check here if you didn't forget about anything e.g. make sure you set up the corresponding ClusterRole and ClusterRoleBinding to allow the policy to be used.

Kubernetes cluster cannot attach and mount automatically created google cloud platform disks to worker nodes

Basically I'm trying to deploy a cluster on GCE via kubeadm with StorageClass supported (without using Google Kubernetes Engine).
Say I deployed a cluster with a master node at Tokyo, and 3 work nodes in Hong Kong, Taiwan and Oregon.
NAME STATUS ROLES AGE VERSION
k8s-node-hk Ready <none> 3h35m v1.14.2
k8s-node-master Ready master 3h49m v1.14.2
k8s-node-oregon Ready <none> 3h33m v1.14.2
k8s-node-tw Ready <none> 3h34m v1.14.2
Kube-controller-manager and kubelet both started with cloud-provider=gce, and now I can apply StorageClass and PersistentVolumeClaim then get disks created automatically (say a disk in Taiwan) on GCP and get PV and PVC bound.
kubectl get pvc:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
eth-pvc Bound pvc-bf35e3c9-81e2-11e9-8926-42010a920002 10Gi RWO slow 137m
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-bf35e3c9-81e2-11e9-8926-42010a920002 10Gi RWO Delete Bound default/eth-pvc slow 137m
However, kube-controller-manager cannot find the Taiwan node and mount the disk to the node in same zone, and it logged (we can see the zone asia-northeast1-a is not correct):
I0529 07:25:46.366500 1 reconciler.go:288] attacherDetacher.AttachVolume started for volume "pvc-bf35e3c9-81e2-11e9-8926-42010a920002" (UniqueName: "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-bf35e3c9-81e2-11e9-8926-42010a920002") from node "k8s-node-tokyo"
E0529 07:25:47.296824 1 attacher.go:102] Error attaching PD "kubernetes-dynamic-pvc-bf35e3c9-81e2-11e9-8926-42010a920002" to node "k8s-node-tokyo": GCE persistent disk not found: diskName="kubernetes-dynamic-pvc-bf35e3c9-81e2-11e9-8926-42010a920002" zone="asia-northeast1-a"
The kubelet on each node started with --cloud-provider=gce, but I didn't find how to configure the zone. And when I checked kubelet's log, I find this flag is already deprecated on kubernetes v1.14.2 (the latest in May 2019).
May 29 04:36:03 k8s-node-tw kubelet[29971]: I0529 04:36:03.623704 29971 server.go:417] Version: v1.14.2
May 29 04:36:03 k8s-node-tw kubelet[29971]: W0529 04:36:03.624039 29971 plugins.go:118] WARNING: gce built-in cloud provider is now deprecated. The GCE provider is deprecated and will be removed in a future release
However, kubelet annotated k8s-node-tw node with the correct zone and region:
May 29 04:36:05 k8s-node-tw kubelet[29971]: I0529 04:36:05.157665 29971 kubelet_node_status.go:331] Adding node label from cloud provider: beta.kubernetes.io/instance-type=n1-standard-1
May 29 04:36:05 k8s-node-tw kubelet[29971]: I0529 04:36:05.158093 29971 kubelet_node_status.go:342] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=asia-east1-a
May 29 04:36:05 k8s-node-tw kubelet[29971]: I0529 04:36:05.158201 29971 kubelet_node_status.go:346] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=asia-east1
Thanks for reading here. My question is:
If it's possible, how can I configure kubelet or kube-controller-manager correctly to make it support GCP's storage class and the created disks attached and mounted successfully?
==================K8s config files======================
Deployment (related part):
volumes:
- name: data
persistentVolumeClaim:
claimName: eth-pvc
- name: config
configMap:
name: {{ template "ethereum.fullname" . }}-geth-config
- name: account
secret:
secretName: {{ template "ethereum.fullname" . }}-geth-account
PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: eth-pvc
spec:
storageClassName: slow
resources:
requests:
storage: 10Gi
accessModes:
- ReadWriteOnce
SC:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-standard
replication-type: none
zone: asia-east1-a
After several days' research, I found that the reason is:
Master node is in asia-northeast1-a (Tokyo)
Worker nodes are in asia-east1-a (Taiwan) and other zones
cloud-provider-gcp only search the zone in one region (normally the master node's zone, but you can specify it by setting local-zone in the cloud config file), which means it can only support one zone or multiple zones in one region by default
Conclusion:
In order to support multiple zones among multiple regions, we need to modify the gce provider code of configuration, like add another field to configure which zones should be searched.
==========================UPDATE=========================
I modified the k8s code to add a extra-zones config field like this diff on github to make it work on my use case.

Snapshot of Hostpath volume in kubernetes example clarification

I have a K8s cluster inside Azure VMs, running Ubuntu 18.
Cluster was provisioned using conjure-up.
I am trying to test the kubernetes snapshot feature. Trying to follow the steps here:
https://github.com/kubernetes-incubator/external-storage/blob/master/snapshot/doc/examples/hostpath/README.md
While i can follow most instructions on the page, not sure of what this specific command does:
"_output/bin/snapshot-controller -kubeconfig=${HOME}/.kube/config"
directly executing this instruction doesnt work as such.
Can anyone explain what this does and how to run this part successfully?
Or better yet point to a complete walk-through if it exists.
Update
Tried out steps from
https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot/deploy/kubernetes/hostpath
Commented out below line since not using RBAC
# serviceAccountName: snapshot-controller-runner
Then deployed using
kubectl create -f deployment.yaml
kubectl create -f pv.yaml
kubectl create -f pvc.yaml
kubectl create -f snapshot.yaml
These yaml are from examples 'as is':
github.com/kubernetes-incubator/external-storage/blob/master/snapshot/doc/examples/hostpath/
kubectl describe volumesnapshot snapshot-demo Name: snapshot-demo
Namespace: default
Labels: SnapshotMetadata-PVName=hostpath-pv
SnapshotMetadata-Timestamp=1555999582450832931
Annotations: <none>
API Version: volumesnapshot.external-storage.k8s.io/v1
Kind: VolumeSnapshot
Metadata:
Creation Timestamp: 2019-04-23T05:56:05Z
Generation: 2
Resource Version: 261433
Self Link: /apis/volumesnapshot.external-storage.k8s.io/v1/namespaces/default/volumesnapshots/snapshot-demo
UID: 7b89194a-658c-11e9-86b2-000d3a07ff79
Spec:
Persistent Volume Claim Name: hostpath-pvc
Snapshot Data Name:
Status:
Conditions: <nil>
Creation Timestamp: <nil>
Events: <none>
the snapshot resource is created however the volumesnapshotdata is NOT created.
kubectl get volumesnapshotdata
No resources found.
kubectl get crd
NAME CREATED AT
volumesnapshotdatas.volumesnapshot.external-storage.k8s.io 2019-04-21T04:18:54Z
volumesnapshots.volumesnapshot.external-storage.k8s.io 2019-04-21T04:18:54Z
kubectl get pod
NAME READY STATUS RESTARTS AGE
azure 1/1 Running 2 2d21h
azure-2 1/1 Running 2 2d20h
snapshot-controller-5d798696ff-qsh6m 2/2 Running 2 14h
ls /tmp/test/
data
Enabled featuregate for volume snapshot
cat /var/snap/kube-apiserver/924/args
--advertise-address="192.168.0.4"
--min-request-timeout="300"
--etcd-cafile="/root/cdk/etcd/client-ca.pem"
--etcd-certfile="/root/cdk/etcd/client-cert.pem"
--etcd-keyfile="/root/cdk/etcd/client-key.pem"
--etcd-servers="https://192.168.0.4:2379"
--storage-backend="etcd3"
--tls-cert-file="/root/cdk/server.crt"
--tls-private-key-file="/root/cdk/server.key"
--insecure-bind-address="127.0.0.1"
--insecure-port="8080"
--audit-log-maxbackup="9"
--audit-log-maxsize="100"
--audit-log-path="/root/cdk/audit/audit.log"
--audit-policy-file="/root/cdk/audit/audit-policy.yaml"
--basic-auth-file="/root/cdk/basic_auth.csv"
--client-ca-file="/root/cdk/ca.crt"
--requestheader-allowed-names="system:kube-apiserver"
--requestheader-client-ca-file="/root/cdk/ca.crt"
--requestheader-extra-headers-prefix="X-Remote-Extra-"
--requestheader-group-headers="X-Remote-Group"
--requestheader-username-headers="X-Remote-User"
--service-account-key-file="/root/cdk/serviceaccount.key"
--token-auth-file="/root/cdk/known_tokens.csv"
--authorization-mode="AlwaysAllow"
--admission-control="NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota"
--allow-privileged=true
--enable-aggregator-routing
--kubelet-certificate-authority="/root/cdk/ca.crt"
--kubelet-client-certificate="/root/cdk/client.crt"
--kubelet-client-key="/root/cdk/client.key"
--kubelet-preferred-address-types="[InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP]"
--proxy-client-cert-file="/root/cdk/client.crt"
--proxy-client-key-file="/root/cdk/client.key"
--service-cluster-ip-range="10.152.183.0/24"
--logtostderr
--v="4"
--feature-gates="VolumeSnapshotDataSource=true"
What am i missing here?
I think everything you need is already present here: https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot/deploy/kubernetes/hostpath
There is one YAML for deployment of snapshot controller and one YAML for snapshotter RBAC rules.