Kubernetes - free claimed resources of a Pod in a failed state - kubernetes

I have got the following template for a job:
apiVersion: batch/v1
kind: Job
metadata:
name: "gpujob"
spec:
completions: 1
backoffLimit: 0
ttlSecondsAfterFinished: 600000
template:
metadata:
name: batch
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: "test"
containers:
- name: myhub
image: smat-jupyterlab
env:
- name: JUPYTERHUB_COOKIE_SECRET
value: "sdadasdasda"
resources:
requests:
memory: 500Gi
limits:
nvidia.com/gpu: 1
command: ["/bin/bash", "/usr/local/bin/jobscript.sh", smat-job]
volumeMounts:
- name: data
mountPath: /data
restartPolicy: Never
nodeSelector:
dso-node-role: "inference"
As you can see, I claim a lot of memory for the job. My Question is: Does the failed pod free the claimed resources, as soon as it is on a failed state? Due to regulations, I have to keep pods for one week in the cluster, otherwise I would just set a very low ttlSecondsAfterFinished. I read a lot of contradicting stuff in articles, but found nothing in the official docs.
TDLR: Does a failed Pod free claimed resources of a cluster? If no, what is a good way to do it?

Yes, a failed or completed job will produce a container in Terminated state, and therefore the resources allocated to it are freed.
You can easily confirm this by using the command:
kubectl top pod
You should not see any pod associated with the failed job consuming resources.

Related

Ephemeral volume limit making pod errored out

I'm working on the task where I want to limit the Ephemeral volume to a certain Gi.
This is my deployment configuration file.
`
apiVersion: apps/v1
kind: Deployment
metadata:
name: vol
namespace: namespace1
labels:
app: vol
spec:
replicas: 1
selector:
matchLabels:
app: vol
template:
metadata:
labels:
app: vol
spec:
containers:
- name: vol
image: <my-image>
ports:
- containerPort: 5000
resources:
limits:
ephemeral-storage: "1Gi"
volumeMounts:
- name: ephemeral
mountPath: "/volume"
volumes:
- name: ephemeral
emptyDir: {}
`
The expected behaviour is when the volume limit is met it should evict the pod, which is happening as expected.
The only problem I have is after the default termination grace period the pod is getting into an error state with a warning ExceededGracePeriod. Now, I have one pod running and one error in my deployment.
I have tried solutions like increased the terminationGracePeriodSeconds, tried to use preStop hook, settings limit on emptyDir: {} as well but nothing worked out for me.
You can increase the ephemeral storage limit to 2Gi ephemeral storage. This might resolve your error. Refer to this doc for quotas and limit ranges of ephemeral storage. In kubernetes documentation you can find more details how Ephemeral storage consumption management works here.

Flink task manager failed for volume "hadoop-config-volume" with Flink Kubernetes Operator

I'm developing an application using Flink Kubernetes Operator version 1.1.0 but receiving the below error message in spawned taskmanager pods:
MountVolume.SetUp failed for volume "hadoop-config-volume" : "hadoop-config-name" not found
Unable to attach or mount volumes: unmounted volumes=[hadoop-config-volume], unattached volumes=[hadoop-xml hadoop-config-volume flink-config-volume flink-token-kk558]: timed out waiting for the condition
My link app.yaml
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: "${DEPLOYMENT_NAME}"
namespace: data
spec:
flinkConfiguration:
taskmanager.numberOfTaskSlots: "2"
flinkVersion: v1_15
image: "${IMAGE_TAG}"
imagePullPolicy: Always
job:
jarURI: local:///opt/flink/opt/executable.jar
parallelism: 2
state: running
upgradeMode: stateless
jobManager:
resource:
cpu: 1
memory: 1024m
podTemplate:
apiVersion: v1
kind: Pod
metadata:
namespace: bigdata
spec:
containers:
-
env:
-
name: HADOOP_CONF_DIR
value: /hadoop/conf
envFrom:
-
configMapRef:
name: data-connection
name: flink-main-container
volumeMounts:
-
mountPath: /hadoop/conf
name: hadoop-xml
imagePullSecrets:
-
name: registry
serviceAccount: flink
volumes:
-
configMap:
name: hadoop-conf
name: hadoop-xml
serviceAccount: flink
taskManager:
resource:
cpu: 2
memory: 5000m
From the documentation, I believe hadoop-config-name is an internal configmap created by flink to ship hdfs configurations to taskmanager. I already mounted my configmap (contains *core-site.xml" and "hdfs-site.xml" to $HADOOP_CONF_DIR dir).
Is this a flink bug or I did something wrong with my set up?
For anyone facing the same issue, I fixed it by changing HADOOP_CONF_DIR -> HADOOP_CLASSPATH!
Flink would detect the env HADOOP_CONF_DIR and create a hadoop conf configmap if it exists (see https://github.com/apache/flink/blob/2851fac9c4c052876c80440b6b0b637603de06ea/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/HadoopConfMountDecorator.java#L86).
I guess you ran into the error because the operator could not access the HADOOP_CONF_DIR.

How to mount PVC to a (katib) Job specification?

I'd like to mount a PVC to a (katib) Job specification but can't find anything in the documentation nor any example?
I'm pretty sure that this should be possible as a Job is orchestrating pods and pods can do so. Or am I missing something?
Please find below the respective (katib) job specification
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- env:
- name: training-container
image: docker.io/romeokienzler/claimed-train-mobilenet_v2:0.3
command:
- "ipython"
- "/train-mobilenet_v2.ipynb"
- "optimizer=${trialParameters.optimizer}"
restartPolicy: Never
You can add the volume and volume mount to your Katib job template so that all the HPO jobs on Katib can share the same volumes. e.g.
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: docker.io/romeokienzler/claimed-train-mobilenet_v2:0.4
command:
- "ipython"
- "/train-mobilenet_v2.ipynb"
- "optimizer=${trialParameters.optimizer}"
volumeMounts:
- mountPath: /data/
name: data-volume
restartPolicy: Never
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
Also make sure your pvc is read write many so the pod on different nodes can mount on the same volume at the same time.

Running Pod takes a long time for internal service to be accessible

I have implemented a gRPC service, build it into a container, and deployed it using k8s, in particular AWS EKS, as a DaemonSet.
The Pod starts and turns to be in Running status very soon, but it takes very long, typically 300s, for the actual service to be accessible.
In fact, when I run kubectl logs to print the log of the Pod, it is empty for a long time.
I have logged something at the very starting of the service. In fact, my code looks like
package main
func init() {
log.Println("init")
}
func main() {
// ...
}
So I am pretty sure when there are no logs, the service is not started yet.
I understand that there may be a time gap between the Pod is running and the actual process inside it is running. However, 300s looks too long for me.
Furthermore, this happens randomly, sometimes the service is ready almost immediately. By the way, my runtime image is based on chromedp headless-shell, not sure if it is relevant.
Could anyone provide some advice for how to debug and locate the problem? Many thanks!
Update
I did not set any readiness probes.
Running kubectl get -o yaml of my DaemonSet gives
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
creationTimestamp: "2021-10-13T06:30:16Z"
generation: 1
labels:
app: worker
uuid: worker
name: worker
namespace: collection-14f45957-e268-4719-88c3-50b533b0ae66
resourceVersion: "47265945"
uid: 88e4671f-9e33-43ef-9c49-b491dcb578e4
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: worker
uuid: worker
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "2112"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app: worker
uuid: worker
spec:
containers:
- env:
- name: GRPC_PORT
value: "22345"
- name: DEBUG
value: "false"
- name: TARGET
value: localhost:12345
- name: TRACKER
value: 10.100.255.31:12345
- name: MONITOR
value: 10.100.125.35:12345
- name: COLLECTABLE_METHODS
value: shopping.ShoppingService.GetShop
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: DISTRIBUTABLE_METHODS
value: collection.CollectionService.EnumerateShops
- name: PERFORM_TASK_INTERVAL
value: 0.000000s
image: xxx
imagePullPolicy: Always
name: worker
ports:
- containerPort: 22345
protocol: TCP
resources:
requests:
cpu: 1800m
memory: 1Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- env:
- name: CAPTCHA_PARALLEL
value: "32"
- name: HTTP_PROXY
value: http://10.100.215.25:8080
- name: HTTPS_PROXY
value: http://10.100.215.25:8080
- name: API
value: 10.100.111.11:12345
- name: NO_PROXY
value: 10.100.111.11:12345
- name: POD_IP
image: xxx
imagePullPolicy: Always
name: source
ports:
- containerPort: 12345
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/ssl/certs/api.crt
name: ca
readOnly: true
subPath: tls.crt
dnsPolicy: ClusterFirst
nodeSelector:
api/nodegroup-app: worker
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: ca
secret:
defaultMode: 420
secretName: ca
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 2
desiredNumberScheduled: 2
numberAvailable: 2
numberMisscheduled: 0
numberReady: 2
observedGeneration: 1
updatedNumberScheduled: 2
Furthermore, there are two containers in the Pod. Only one of them is exceptionally slow to start, and the other one is always fine.
When you use HTTP_PROXY for your solution, watchout how it may route differently from your underlying cluster network - which often result to unexpected timeout.
I have posted community wiki answer to summarize the topic:
As gohm'c has mentioned in the comment:
Do connections made by container "source" always have to go thru HTTP_PROXY, even if it is connecting services in the cluster - do you think possible long time been taken because of proxy? Can try kubectl exec -it <pod> -c <source> -- sh and curl/wget external services.
This is an good observation. Note that some connections can be made directly and that adding extra traffic through the proxy may result in delays. For example, a bottleneck may arise. You can read more information about using an HTTP Proxy to Access the Kubernetes API in the documentation.
Additionally you can also create readiness probes to know when a container is ready to start accepting traffic.
A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.
The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.

K8S cronjob scheduling on existing pod?

I have my application running in K8S pods. my application writes logs to a particular path for which I already have volume mounted on the pod. my requirement is to schedule cronjob which will trigger weekly once and read the logs from that pod volume and generate a report base on my script (which is basically filtering the logs based on some keywords). and send the report via mail.
unfortunately I am not sure how I will proceed on this as I couldn't get any doc or blog which talks about integrating conrjob to existing pod or volume.
apiVersion: v1
kind: Pod
metadata:
name: webserver
spec:
volumes:
- name: shared-logs
emptyDir: {}
containers:
- name: nginx
image: nginx
volumeMounts:
- name: shared-logs
mountPath: /var/log/nginx
- name: sidecar-container
image: busybox
command: ["sh","-c","while true; do cat /var/log/nginx/access.log /var/log/nginx/error.log; sleep 30; done"]
volumeMounts:
- name: shared-logs
mountPath: /var/log/nginx
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: "discovery-cronjob"
labels:
app.kubernetes.io/name: discovery
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: log-report
image: busybox
command: ['/bin/sh']
args: ['-c', 'cat /var/log/nginx/access.log > nginx.log']
volumeMounts:
- mountPath: /log
name: shared-logs
restartPolicy: Never
I see two things here that you need to know:
Unfortunately, it is not possible to schedule a cronjob on an existing pod. Pods are ephemeral and job needs to finish. It would be impossible to tell if the job completed or not. This is by design.
Also in order to be able to see the files from one pod to another you must use a PVC. The logs created by your app have to be persisted if your job wants to access it. Here you can find some examples of how to Create ReadWriteMany PersistentVolumeClaims on your Kubernetes Cluster:
Kubernetes allows us to provision our PersistentVolumes dynamically
using PersistentVolumeClaims. Pods treat these claims as volumes. The
access mode of the PVC determines how many nodes can establish a
connection to it. We can refer to the resource provider’s docs for
their supported access modes.