cannot schedule kubernetes pods with request for nvidia.com/gpu - kubernetes

i have been able to get kubernetes to recognise my gpus on my nodes:
$ kubectl get node MY_NODE -o yaml
...
allocatable:
cpu: "48"
ephemeral-storage: "15098429006"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 263756344Ki
nvidia.com/gpu: "8"
pods: "110"
capacity:
cpu: "48"
ephemeral-storage: 16382844Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 263858744Ki
nvidia.com/gpu: "8"
pods: "110"
...
and i spin up a pod with
Limits:
cpu: 2
memory: 2147483648
nvidia.com/gpu: 1
Requests:
cpu: 500m
memory: 536870912
nvidia.com/gpu: 1
However, the pod stays in PENDING with:
Insufficient nvidia.com/gpu.
Am i spec'ing the resources correctly?

Have you installed NVIDIA plugin in K8S?
kubectl create -f nvidia.io/device-plugin.yml
Some devices are too old and cannot be healthchecked.So this option must be disabled:
containers:
- image: nvidia/k8s-device-plugin:1.9
name: nvidia-device-plugin-ctr
env:
- name: DP_DISABLE_HEALTHCHECKS
value: "xids"
Take a look at:
Device plugin: https://kubernetes.io/docs/concepts/cluster-administration/device-plugins/
NVIDIA github: https://github.com/NVIDIA/k8s-device-plugin

Related

How to modify Apache Nifi node adresses while deploying in Kubernetes?

In Kubernetes I would like to deploy Apache Nifi Cluster in StatefulSet with 3 nodes.
Problem is I would like to modify node adresses recursively in an init container in my yaml file.
I have to modify these parameters for each nodes in Kubernetes:
'nifi.remote.input.host'
'nifi.cluster.node.address'
I need to have these FQDN added recursively in Nifi properties:
nifi-0.nifi.NAMESPACENAME.svc.cluster.local
nifi-1.nifi.NAMESPACENAME.svc.cluster.local
nifi-2.nifi.NAMESPACENAME.svc.cluster.local
I have to modify the properties before deploying so I tried the following init container but doesn't work :
initContainers:
- name: modify-nifi-properties
image: busybox:v01
command:
- sh
- -c
- |
# Modify nifi.properties to use the correct hostname for each node
for i in {1..3}; do
sed -i "s/nifi-$((i-1))/nifi-$((i-1)).nifinamespace.nifi.svc.cluster.local/g" /opt/nifi/conf/nifi.properties
done
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 100m
memory: 100Mi
How can I do it ?

Autopurge config for zookeeper in kubernetes not working

I am trying to put the autopurge config for zookeeper in release.yaml file but it doesn't seem working.
Even after adding the purgeInterval =1 and snapRetainCount = 5, it's always autopurge.snapRetainCount=3 autopurge.purgeInterval=0
Below is the .yaml I am using for zookeeper in Kubernetes-
zookeeper:
## If true, install the Zookeeper chart alongside Pinot
## ref: https://github.com/kubernetes/charts/tree/master/incubator/zookeeper
enabled: true
urlOverride: "my-zookeeper:2181/pinot"
port: 2181
replicaCount: 3
autopurge:
purgeInterval: 1
snapRetainCount: 5
env:
## The JVM heap size to allocate to Zookeeper
ZK_HEAP_SIZE: "256M"
#ZOO_MY_ID: 1
persistence:
enabled: true
image:
PullPolicy: "IfNotPresent"
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
Can anyone please help?

Airflow Helm Chart Worker Node Error - CrashLoopBackOff

I am using official Helm chart for airflow. Every Pod works properly except Worker node.
Even in that worker node, 2 of the containers (git-sync and worker-log-groomer) works fine.
The error happened in the 3rd container (worker) with CrashLoopBackOff. Exit code status as 137 OOMkilled.
In my openshift, memory usage is showing to be at 70%.
Although this error comes because of memory leak. This doesn't happen to be the case for this one. Please help, I have been going on in this one for a week now.
Kubectl describe pod airflow-worker-0 ->
worker:
Container ID: <>
Image: <>
Image ID: <>
Port: <>
Host Port: <>
Args:
bash
-c
exec \
airflow celery worker
State: Running
Started: <>
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: <>
Finished: <>
Ready: True
Restart Count: 3
Limits:
ephemeral-storage: 30G
memory: 1Gi
Requests:
cpu: 50m
ephemeral-storage: 100M
memory: 409Mi
Environment:
DUMB_INIT_SETSID: 0
AIRFLOW__CORE__FERNET_KEY: <> Optional: false
Mounts:
<>
git-sync:
Container ID: <>
Image: <>
Image ID: <>
Port: <none>
Host Port: <none>
State: Running
Started: <>
Ready: True
Restart Count: 0
Limits:
ephemeral-storage: 30G
memory: 1Gi
Requests:
cpu: 50m
ephemeral-storage: 100M
memory: 409Mi
Environment:
GIT_SYNC_REV: HEAD
Mounts:
<>
worker-log-groomer:
Container ID: <>
Image: <>
Image ID: <>
Port: <none>
Host Port: <none>
Args:
bash
/clean-logs
State: Running
Started: <>
Ready: True
Restart Count: 0
Limits:
ephemeral-storage: 30G
memory: 1Gi
Requests:
cpu: 50m
ephemeral-storage: 100M
memory: 409Mi
Environment:
AIRFLOW__LOG_RETENTION_DAYS: 5
Mounts:
<>
I am pretty much sure you know the answer. Read all your articles on airflow. Thank you :)
https://stackoverflow.com/users/1376561/marc-lamberti
The issues occurs due to placing a limit in "resources" under helm chart - values.yaml in any of the pods.
By default it is -
resources: {}
but this causes an issue as pods can access unlimited memory as required.
By changing it to -
resources:
limits:
cpu: 200m
memory: 2Gi
requests:
cpu: 100m
memory: 512Mi
It makes the pod clear on how much it can access and request.
This solved my issue.

Dask Helm Chart - How to create Dask-CUDA-Worker nodes

I installed the Dask Helm Chart with a revised values.yaml to have 10 workers, however instead of Dask Workers I want to create Dash CUDA Workers to take advantage of the NVIDIA GPUs on my multi-node cluster.
I tried to modify the values.yaml as follows to get Dask CUDA workers instead of Dask Workers, but the worker pods are able to start. I did already install the NVIDIA GPUs on all my nodes on the Kubernetes per the official instructions so I'm not sure what DASK needs to see in order to create the Dask-Cuda-Workers.
worker:
name: worker
image:
repository: "daskdev/dask"
tag: 2.19.0
pullPolicy: IfNotPresent
dask_worker: "dask-cuda-worker"
#dask_worker: "dask-worker"
pullSecrets:
# - name: regcred
replicas: 15
default_resources: # overwritten by resource limits if they exist
cpu: 1
memory: "4GiB"
env:
# - name: EXTRA_APT_PACKAGES
# value: build-essential openssl
# - name: EXTRA_CONDA_PACKAGES
# value: numba xarray -c conda-forge
# - name: EXTRA_PIP_PACKAGES
# value: s3fs dask-ml --upgrade
resources: {}
# limits:
# cpu: 1
# memory: 3G
# nvidia.com/gpu: 1
# requests:
# cpu: 1
# memory: 3G
# nvidia.com/gpu: 1
As dask-cuda-worker is not yet officially in the dask image you will need to pull the image a different image: rapidsai/rapidsai:latest

Error in running DPDK L2FWD application on a container managed by Kubernetes

I am trying to run DPDK L2FWD application on a container managed by Kubernetes.
To achieve this I have done the below steps -
I have created single node K8s setup where both master and client are running on host machine. As network plug-in, I have used Calico Network.
To create customized DPDK docker image, I have used the below Dockerfile
FROM ubuntu:16.04 RUN apt-get update
RUN apt-get install -y net-tools
RUN apt-get install -y python
RUN apt-get install -y kmod
RUN apt-get install -y iproute2
RUN apt-get install -y net-tools
ADD ./dpdk/ /home/sdn/dpdk/
WORKDIR /home/sdn/dpdk/
To run DPDK application inside POD, below host's directories are mounted to POD with privileged access:
/mnt/huge
/usr
/lib
/etc
Below is k8s deployment yaml used to create the POD
apiVersion: v1
kind: Pod
metadata:
name: dpdk-pod126
spec:
containers:
- name: dpdk126
image: dpdk-test126
imagePullPolicy: IfNotPresent
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
resources:
requests:
memory: "2Gi"
cpu: "100m"
volumeMounts:
- name: hostvol1
mountPath: /mnt/huge
- name: hostvol2
mountPath: /usr
- name: hostvol3
mountPath: /lib
- name: hostvol4
mountPath: /etc
securityContext:
privileged: true
volumes:
- name: hostvol1
hostPath:
path: /mnt/huge
- name: hostvol2
hostPath:
path: /usr
- name: hostvol3
hostPath:
path: /home/sdn/kubernetes-test/libtest
- name: hostvol4
hostPath:
path: /etc
Below configurations are already done in host -
Huge page mounting.
Interface binding in user space.
After successful creation of POD, when trying to run a DPDK L2FWD application inside POD, I am getting the below error -
root#dpdk-pod126:/home/sdn/dpdk# ./examples/l2fwd/build/l2fwd -c 0x0f -- -p 0x03 -q 1
EAL: Detected 16 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: No free hugepages reported in hugepages-1048576kB
EAL: 1007 hugepages of size 2097152 reserved, but no mounted hugetlbfs found for that size
EAL: FATAL: Cannot get hugepage information.
EAL: Cannot get hugepage information.
EAL: Error - exiting with code: 1
Cause: Invalid EAL arguments
According to this, you might be missing
medium: HugePages from your hugepage volume.
Also, hugepages can be a bit finnicky. Can you provide the output of:
cat /proc/meminfo | grep -i huge
and check if there's any files in /mnt/huge?
Also maybe this can be helpful. Can you somehow check if the hugepages are being mounted as mount -t hugetlbfs nodev /mnt/huge?
First of all, you have to verify, that you have enough hugepages in your system. Check it with kubectl command:
kubectl describe nodes
where you could see something like this:
Capacity:
cpu: 12
ephemeral-storage: 129719908Ki
hugepages-1Gi: 0
hugepages-2Mi: 8Gi
memory: 65863024Ki
pods: 110
If your hugepages-2Mi is empty, then your k8s don't see mounted hugepages
After mounting hugepages into your host, you can prepare your pod to work with hugepages. You don't need to mount hugepages folder as you shown. You can simply add emptyDir volume like this:
volumes:
- name: hugepage-2mi
emptyDir:
medium: HugePages-2Mi
HugePages-2Mi is a specific resource name that corresponds with hugepages of 2Mb size. If you want to use 1Gb size hugepages then there is another resource for it - hugepages-1Gi
After defining the volume, you can use it in volumeMounts like this:
volumeMounts:
- mountPath: /hugepages-2Mi
name: hugepage-2mi
And there is one additional step. You have to define resource limitations for hugepages usage:
resources:
limits:
hugepages-2Mi: 128Mi
memory: 128Mi
requests:
memory: 128Mi
After all this steps, you can run your container with hugepages inside container
As #AdamTL mentioned, you can find additional info here