Use SSD or local SSD in GKE cluster

Use SSD or local SSD in GKE cluster - kubernetes

I would like to have Kubernetes use the local SSD in my Google Kubernetes engine cluster without using alpha features. Is there a way to do this?
Thanks in advance for any suggestions or your help.

https://cloud.google.com/kubernetes-engine/docs/concepts/local-ssd explains how to use local SSDs on your nodes in Google Kubernetes Engine. Based on the gcloud commands, the feature appears to be beta (not alpha) so I don't think you need to rely on any alpha features to take advantage of it.

You can use local SSD with your Kubernetes nodes as explained in the below documentation:
Visit the Kubernetes Engine menu in GCP Console.
Click Create cluster.
Configure your cluster as desired. Then, from the Local SSD disks (per node) field, enter the desired number of SSDs as an absolute number.
Click Create.
To create a node pool with local SSD disks in an existing cluster:
Visit the Kubernetes Engine menu in GCP Console.
Select the desired cluster.
Click Edit.
From the Node pools menu, click Add node pool.
Configure the node pool as desired. Then, from the Local SSD disks (per node) field, enter the desired number of SSDs as an absolute number.
Click Save.
Be aware of the disadvantages/limitations of local SSD storage in Kubernetes as explained in this documentation link:
Because local SSDs are physically attached to the node's host virtual machine instance, any data stored in them only exists on that node. As the data stored on the disks is local, you should ensure that your application is resilient to having this data being unavailable.
A Pod that writes to a local SSD might lose access to the data stored on the disk if the Pod is rescheduled away from that node. Additionally, upgrading a node causes the data to be erased.
You cannot add local SSDs to an existing node pool.
Above points are very important if you want to have high availability in your Kubernetes deployment.
Kubernetes local SSD storage is ephemeral and presents some problems for non-trivial applications when running in containers.
In Kubernetes, when a container crashes, kubelet will restart it, but the files in it will be lost because the container starts with a clean state.
Also, when running containers together in a Pod it is often necessary that those containers share files.
You can use Kubernetes Volume abstraction to solve above problems as explained in the following documentation.

If you're looking to run the whole of Docker on SSD's in your Kubernetes cluster, this is how I did it on my node pool (ubuntu nodes):
Go to Compute Engine > VM Instances
Edit your node to add a new SSD (explained in the first step "Create and attach a persistent disk in the Google Cloud Platform Console" here: https://cloud.google.com/compute/docs/disks/add-persistent-disk)
On your server:
# stop docker
sudo service docker stop
# format and mount disk
sudo mkfs.ext4 -m 0 -F -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb
rm -fr /var/lib/docker
sudo mkdir -p /var/lib/docker
sudo mount -o discard,defaults /dev/sdb /var/lib/docker
sudo chmod 711 /var/lib/docker
# backup and edit fstab
sudo cp /etc/fstab /etc/fstab.backup
echo UUID=`sudo blkid -s UUID -o value /dev/sdb` /var/lib/docker ext4 discard,defaults,nofail 0 2 | sudo tee -a /etc/fstab
# start docker
sudo service docker start
As mentioned by others, you might want to look into the "Local SSD's option" provided by GKE first. Reason the provided option of adding SSD's didn't cut it for me, was that my nodes needed a single SSD of 4TB and as I understand the local ssd's are a fixed size.

Related

How to see what a k8s container is writing to ephemeral storage

One of our containers is using ephemeral storage but we don't know why. The app running in the container shouldn't be writing anything to the disk.
We set the storage limit to 20MB but it's still being evicted. We could increase the limit but this seems like a bandaid fix.
We're not sure what or where this container is writing to, and I'm not sure how to check that. When a container is evicted, the only information I can see is that the container exceeded its storage limit.
Is there an efficient way to know what's being written, or is our only option to comb through the code?

Adding details to the topic.
Pods use ephemeral local storage for scratch space, caching, and logs.
Pods can be evicted due to other pods filling the local storage, after which new pods are not admitted until sufficient storage has been reclaimed.
The kubelet can provide scratch space to Pods using local ephemeral storage to mount emptyDir volumes into containers.
For container-level isolation, if a container's writable layer and log usage exceeds its storage limit, the kubelet marks the Pod for eviction.
For pod-level isolation the kubelet works out an overall Pod storage limit by summing the limits for the containers in that Pod. In this case, if the sum of the local ephemeral storage usage from all containers and also the Pod's emptyDir volumes exceeds the overall Pod storage limit, then the kubelet also marks the Pod for eviction.
To see what files have been written since the pod started, you can run:
find / -mount -newer /proc -print
This will output a list of files modified more recently than '/proc'.
/etc/nginx/conf.d
/etc/nginx/conf.d/default.conf
/run/secrets
/run/secrets/kubernetes.io
/run/secrets/kubernetes.io/serviceaccount
/run/nginx.pid
/var/cache/nginx
/var/cache/nginx/fastcgi_temp
/var/cache/nginx/client_temp
/var/cache/nginx/uwsgi_temp
/var/cache/nginx/proxy_temp
/var/cache/nginx/scgi_temp
/dev
Also, try without the '-mount' option.
To see if any new files are being modified, you can run some variations of the following command in a Pod:
while true; do rm -f a; touch a; sleep 30; echo "monitoring..."; find / -mount -newer a -print; done
and check the file size using the du -h someDir command.
Also, as #gohm'c pointed out in his answer, you can use sidecar/ephemeral debug containers.
Read more about Local ephemeral storage here.

We're not sure what or where this container is writing to, and I'm not sure how to check that.
Try look into the container volumeMounts section that is mounted with emptyDir, then add a sidecar container (eg. busybox) to start a shell session where you can check the path. If your cluster support ephemeral debug container you don't need the sidecar container.

persistence volume with multiple local disks

I have a home Kubernetes cluster with multiple SSDs attached to one of the nodes.
I currently have one persistence volume per mounted disk. Is there an easy way to create a persistence volume that can access data from multiple disks? I thought about symlink but it doesn't seem to work.

You would have to combine them at a lower level. The simplest approach would be Linux LVM but there's a wide range of storage strategies. Kubernetes orchestrates mounting volumes but it's not a storage management solution itself, just the last-mile bits.

As already mentioned by coderanger Kubernetes does not manage your storage at lower level. While with cloud solutions there might some provisioners that will do some of the work for you with bare metal there isn't.
The closest thing that help you manage local storage is Local-volume-static-provisionner.
The local volume static provisioner manages the PersistentVolume
lifecycle for pre-allocated disks by detecting and creating PVs for
each local disk on the host, and cleaning up the disks when released.
It does not support dynamic provisioning.
Have a look at this article for more example it.

I have a trick which is working for me.
You can mount these disks at a directory like /disks/, and then make a loop filesystem, mounted it, and make a symbol link from disks to the loop filesystem.
for example:
touch ~/disk-bunch1 && truncate -s 32M ~/disk-bunch1 && mke2fs -t ext4 -F ~/disk-bunch1
mount it and make a symbol link from disks to the loop filesystem:
mkdir -p /local-pv/bunch1 && mount ~/disk-bunch1 /local-pv/bunch1
ln -s /disk/disk1 /local-pv/bunch1/disk1
ln -s /disk/disk2 /local-pv/bunch1/disk2
Finally, use sig-storage-local-static-provisioner, modify the "hostDir" to "/local-pv" in the values.yaml and deploy the provisioner. And then, you could make a pod use multiple disks.
But this method have a drawback, when you run "kubectl get pv", the CAPACITY is just the size of the loop filesystem instead of the sum of several disk capacities...
By the way, this method, is not recommended ... You'd better think of such as raid0 or lvm and etc...

How to free storage on node when status is "Attempting to reclaim ephemeral-storage"?

I have a 3 node Kubernetes cluster used for development.
One of the node's status is "Attempting to reclaim ephemeral-storage" since 11 days.
How to reclaim storage ?
Since it is just development instance I cannot extend the storage. I dont care about the existing data in the storage. How to clear the storage ?
Thanks

Just run 'docker system prune command' to free up the space on the node. refer the below command
$ docker system prune -a --volumes
WARNING! This will remove:
- all stopped containers
- all networks not used by at least one container
- all volumes not used by at least one container
- all images without at least one container associated to them
- all build cache
Are you sure you want to continue? [y/N] y

Since it's a development environment you can just drain the node to clear all pods and their data and then uncordon for pods to be scheduled again
kubectl drain --delete-local-data --ignore-daemonsets $NODE_NAME && kubectl uncordon $NODE_NAME
--delete-local-data flag is for cleaning data of the pods.

How to restart master node in kubernetes

I have a kubernetes cluster with 3 masters and 3 workers, I want to restart one of the masters to update the system of the master machine.
So can I just reboot the machine directly on the console with reboot,
or some steps need to be done before the reboot to void the risk of out of service and data loss?

If you need to reboot a node (such as for a kernel upgrade, libc upgrade, hardware repair, etc.), and the downtime is brief, then when the Kubelet restarts, it will attempt to restart the pods scheduled to it. If the reboot takes longer (the default time is 5 minutes, controlled by --pod-eviction-timeout on the controller-manager), then the node controller will terminate the pods that are bound to the unavailable node. If there is a corresponding replica set (or replication controller), then a new copy of the pod will be started on a different node. So, in the case where all pods are replicated, upgrades can be done without special coordination, assuming that not all nodes will go down at the same time
If you want more control over the upgrading process, you may use the following workflow:
Use kubectl drain to gracefully terminate all pods on the node while marking the node as unschedulable:
kubectl drain $NODENAME
This keeps new pods from landing on the node while you are trying to get them off.
For pods with a replica set, the pod will be replaced by a new pod which will be scheduled to a new node. Additionally, if the pod is part of a service, then clients will automatically be redirected to the new pod.
For pods with no replica set, you need to bring up a new copy of the pod, and assuming it is not part of a service, redirect clients to it.
Perform maintenance work on the node.
Make the node schedulable again:
kubectl uncordon $NODENAME
Additionally if the node is hosting ETCD then you need to be extra careful in terms of rolling upgrade of ETCD and backing up the data

Take a backup of the ETCD if it's hosting the ETCD. You can use the in-built command to backup the data like
ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/snapshot-pre-boot.db
Now drain the node using
kubectl drain <master01>
Do the System update | patches and reboot.
Now uncordon the node back to the cluster
kubectl uncordon <master01>

Whenever you wish to reboot OS on the particular Node(Master, worker), K8s cluster engine does not aware for that action and it keeps all the cluster related events in ETCD key value storage, backing up the most recent data. As soon as you wish carefully prepare cluster Node reboot, you might have to adjust Maintenance job on this Node in order to drain it from scheduling and gracefully terminate all the existing Pods.
If you compose any relevant K8s resource within defined set of replicas, then ReplicationController guarantees that a specified number of pod replicas are running at any one time through each available Node. It simply re-spawns Pods if they failed health check, deleted or terminated, matching desired replicas. In case of Master nodes which host ETCDs you need to be extra careful in terms of rolling upgrade of ETCD and backing up the data.
1. Backup a single master
As mentioned previously, we need to backup etcd. In addition to that, we need the certificates and
optionally the kubeadm configuration file for easily restoring the
master. If you set up your cluster using kubeadm (with no special
configuration) you can do it similar to this:
Backup certificates:
$ sudo cp -r /etc/kubernetes/pki backup/
Make etcd snapshot:
$ sudo docker run --rm -v $(pwd)/backup:/backup \
--network host \
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
--env ETCDCTL_API=3 \
k8s.gcr.io/etcd-amd64:3.2.18 \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /backup/etcd-snapshot-latest.db
Backup kubeadm-config:
$ sudo cp /etc/kubeadm/kubeadm-config.yaml backup/
Note that the contents of the backup folder should then be stored somewhere safe, where it can survive if the master is completely destroyed. You perhaps want to use e.g. AWS S3 (or similar) for this.
There are three commands in the example and all of them should be run on the master node. The first one copies the folder containing all the certificates that kubeadm creates. These certificates are used for secure communications between the various components in a Kubernetes cluster. The final command is optional and only relevant if you use a configuration file for kubeadm. Storing this file makes it easy to initialize the master with the exact same configuration as before when restoring it.
If master update went wrong you can then simply restore old version of master node.
You can also automate etcd backups.
Doing a single backup manually may be a good first step but you really need to make regular backups for them to be useful. The easiest way to do this is probably to take the commands from the example above, create a small script and a cron job that runs the script every now and then. But since we are running Kubernetes anyway, use a Kubernetes CronJob. This would allow you to keep track of the backup jobs inside Kubernetes just like you monitor your workloads.
More information you can find here: backups-kubernetes.
2. Next step is to mark a node unschedulable, run this command:
$ kubectl drain $NODENAME
The kubectl drain command should only be issued to a single node at a time. However, you can run multiple kubectl drain commands for different nodes in parallel, in different terminals or in the background. Multiple drain commands running concurrently will still respect the PodDisruptionBudget you specify.
3. Execute the system update or patch and reboot.
4. Finally uncordon the node back to the cluster, execute command below:
$ kubectl uncordon $NODENAME
On GCP there is option such auto-upgrading nodes which improve managing node updates.
About maintenance Kubernetes nodes's you can read here: node-maintenace.

glusterfs volume creation failed - brick is already part of volume

In a cloud , we have a cluster of glusterfs nodes (participating in gluster volume) and clients (that mount to gluster volumes). These nodes are created using terraform hashicorp tool.
Once the cluster is up and running, if we want to change the gluster machine configuration like increasing the compute size from 4 cpus to 8 cpus , terraform has the provision to recreate the nodes with new configuration.So the existing gluster nodes are destroyed and new instances are created but with the same ip. In the newly created instance , volume creation command fails saying brick is already part of volume.
sudo gluster volume create VolName replica 2 transport tcp ip1:/mnt/ppshare/brick0 ip2:/mnt/ppshare/brick0
volume create: VolName: failed: /mnt/ppshare/brick0 is already part
of a volume
But no volumes are present in this instance.
I understand if I have to expand or shrink volume, I can add or remove bricks from existing volume. Here, I'm changing the compute of the node and hence it has to be recreated. I don't understand why it should say brick is already part of volume as it is a new machine altogether.
It would be very helpful if someone can explain why it says Brick is already part of volume and where it is storing the volume/brick information. So that I can recreate the volume successfully.
I also tried the below steps from this link to clear the glusterfs volume related attributes from the mount but no luck.
https://linuxsysadm.wordpress.com/2013/05/16/glusterfs-remove-extended-attributes-to-completely-remove-bricks/.
apt-get install attr
cd /glusterfs
for i in attr -lq .; do setfattr -x trusted.$i .; done
attr -lq /glusterfs (for testing, the output should pe empty)

Simply put "force" in the end of "gluster volume create ..." command.

Please check if you have directories /mnt/ppshare/brick0 created.
You should have /mnt/ppshare without the brick0 folder. The create command creates those folders. The error indicates that the brick0 folders are present.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse