Is there a DataDog metric to report the space used or remaining in a GCP PersistentVolume. I have found disk use metrics for the container itself, but not for a PersistentVolume.
I am working in GoogleCloudPlatform.
Found the answer:
The metric kubernetes.kubelet.volume.stats.used_bytes will allow you to get the disk usage on PersistentVolumes. This can be achieved using the tag persistentvolumeclaim on this metric.
You can view this metric and tag combination in your account here: https://app.datadoghq.com/metric/summary?filter=kubernetes.kubelet.volume.stats.used_bytes&metric=kubernetes.kubelet.volume.stats.used_bytes
Documentation on this metric can be found here: https://docs.datadoghq.com/agent/kubernetes/data_collected/#kubelet
Related
I want to check the maximum and average of kubernetes Pod. and I tried to find it but cannot get any relevant information. Also, I checked the Lens (third-party software) but only get the current usage and it only shows usage, limit for past 1 hour.
How to find the maximum usage of Pod?
If you don't specify the ressources Limits in your config creation yaml file, it will takes by default those values :
Create the Pod. The output shows that the Pod's container has a memory
request of 256 MiB and a memory limit of 512 MiB. These are the
default values specified by the LimitRange
You have more infos here
You have an article here, if you want to specify your pods limit manually.
PS: if you enable promotheus in your Lens, you can visualize your different metrics (pods usage and limits for the cpu, memory, network, and filessytem)
kubectl describe quota
Or within a different namespace:
kubectl describe quota --namespace=<your-namespace>
In this blog here: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
There is a blurb:
For object metrics and external metrics, a single metric is fetched, which describes the object in question. This metric is compared to the target value, to produce a ratio as above. In the autoscaling/v2beta2 API version, this value can optionally be divided by the number of pods before the comparison is made.
I need to do exactly this; divide my current metric by the current number of pods.
Where can I find the specification for this API? I have googled frantically to see what the autoscaling yaml specification is to do this but I cannot find it. IE I need to write the autoscaler resource as part of our helm chart.
The specification for k8s API can be found here: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/
The above is for k8s version 1.18, you'll have to switch to the right version for you.
The spec for HPA v2beta2 would be here: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#horizontalpodautoscaler-v2beta2-autoscaling
Google publishes a tutorial for using custom metrics to drive the HorizontalPodAutoscaler here, and this tutorial contains instructions for:
Using a Kubernetes manifest to deploy the custom metrics adapter into a custom-metrics namespace.
Deploying a dummy application to generate metrics.
Configuring the HPA to use custom metrics.
We are deploying into a default cluster without any special VPC rules, and we have roughly followed the tutorial's guidance, with a few exceptions:
We're using Helm v2, and rather than grant cluster admin role to Tiller, we have granted all of the necessary cluster roles and role bindings to allow the custom-metrics-adapter-deploying Kubernetes manifest to work. We see no issues there; at least the custom metrics adapter spins up and runs.
We have defined some custom metrics that are based upon data extracted from a jsonPayload in Stackdriver logs.
We have deployed a minute-by-minute CronJob that reads the above metrics and publishes a derived metric, which is the value we want to use to drive the autoscaler. The CronJob is working, and we can see the metric in the derived metric, on a per-Pod basis, in the log metric explorer:
We're configuring the HPA to scale based on the average of the derived metric across all of the pods belonging to a stateful set (The HPA has a metrics entry with type Pods). However, the HPA is unable to read our derived metric. We see this error message:
failed to get object metric value: unable to get metric xxx_scaling_metric: no metrics returned from custom metrics API
Update
We were seeing DNS errors, but these were apparently false alarms, perhaps in the log while the cluster was spinning up.
We restarted the Stackdriver metrics adapter with the command line option --v=5 to get some more verbose debugging. We see log entries like these:
I0123 20:23:08.069406 1 wrap.go:47] GET /apis/custom.metrics.k8s.io/v1beta1/namespaces/defaults/pods/%2A/xxx_scaling_metric: (56.16652ms) 200 [kubectl/v1.13.11 (darwin/amd64) kubernetes/2e298c7 10.44.1.1:36286]
I0123 20:23:12.997569 1 translator.go:570] Metric 'xxx_scaling_metric' not found for pod 'xxx-0'
I0123 20:23:12.997775 1 wrap.go:47] GET /apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/%2A/xxx_scaling_metric?labelSelector=app%3Dxxx: (98.101205ms) 200 [kube-controller-manager/v1.13.11 (linux/amd64) kubernetes/56d8986/system:serviceaccount:kube-system:horizontal-pod-autoscaler 10.44.1.1:36286]
So it looks to us as if the HPA is making the right query for pods-based custom metrics. If we ask the custom metrics API what data it has, and filter with jq to our metric of interest, we see:
{"kind":"MetricValueList",
"apiVersion":"custom.metrics.k8s.io/v1beta1",
"metadata: {"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/xxx_scaling_metric"},
"items":[]}
That the items array is empty is troubling. Again, we can see data in the metrics explorer, so we're left to wonder if our CronJob app that publishes our scaling metric is supplying the right fields in order for the data to be saved in Stackdriver or exposed through the metrics adapter.
For what it's worth the resource.labels map for the time series that we're publishing in our CronJob looks like:
{'cluster_name': 'test-gke',
'zone': 'us-central1-f',
'project_id': 'my-project-1234',
'container_name': '',
'instance_id': '1234567890123456789',
'pod_id': 'xxx-0',
'namespace_id': 'default'}
We finally solved this. Our CronJob that's publishing the derived metric we want to use is getting its raw data from two other metrics that are extracted from Stackdriver logs, and calculating a new value that it publishes back to Stackdriver.
We were using the resource labels that we saw from those metrics when publishing our derived metric. The POD_ID resource label value in the "input" Stackdriver metrics we are reading is the name of the pod. However, the stackdriver custom metrics adapter at gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.0 is enumerating pods in a namespace and asking stackdriver for data associated with pods' UIDs, not their names. (Read the adapter's source code to figure this out...)
So our CronJob now builds a map of pod names to pod UIDs (which requires it to have RBAC pod list and get roles), and publishes the derived metric we use for HPA with the POD_ID set to the pod's UID instead of its name.
The reason that published examples of custom metrics for HPA (like this) work is that they use the Downward API to get a pod's UID, and provide that value as "POD_ID". In retrospect, that should have been obvious, if we had looked at how the "dummy" metrics exporters got their pod id values, but there are certainly examples (as in Stackdriver logging metrics) where POD_ID ends up being a name and not a UID.
I have a kubernetes 1.13 cluster running on Azure and I'm using multiple persistent volumes for multiple applications.
I have setup monitoring with Prometheus, Alertmanager, Grafana.
But I'm unable to get any metrics related to the PVs.
It seems that kubelet started to expose some of the metrics from kubernetes 1.8 , but again stopped since 1.12
I have already spoken to Azure team about any workaround to collect the metrics directly from the actual FileSystem (Azure Disk in my case). But even that is not possible.
I have also heard some people using sidecars in the Pods to gather PV metrics. But I'm not getting any help on that either.
It would be great even if I get just basic details like consumed / available free space.
I'm was having the same issue and solved it by joining two metrics:
avg(label_replace(
1 - node_filesystem_free_bytes{mountpoint=~".*pvc.*"} / node_filesystem_size_bytes,
"volumename", "$1", "mountpoint", ".*(pvc-[^/]*).*")) by (volumename)
+ on(volumename) group_left(namespace, persistentvolumeclaim)
(0 * kube_persistentvolumeclaim_info)
As an explanation, I'm adding a label volumename to every time-series of
node_filesystem*, cut out of the existing mountpoint label and then joining with the other metrics containing the additional labels. Multiplying by 0 ensures this is otherwise a no-op.
Also quick warning: I or you may be using some relabeling configs making this not work out immediately without adaptation.
I've set up prometheus to monitor kubernetes metrics by following the prometheus documentation.
A lot of useful metrics now show up in prometheus.
However, I can't see any metrics referencing the status of my pods or nodes.
Ideally - I'd like to be able to graph the pod status (Running, Pending, CrashLoopBackOff, Error) and nodes (NodeReady, Ready).
Is this metric anywhere? If not, can I add it somewhere? And how?
The regular kubernetes setup does not expose these metrics - further discussion here.
However, another service can be used to collect these cluster level metrics: https://github.com/kubernetes/kube-state-metrics.
This currently provides node_status_ready and pod_container_restarts which sound like what I want.
I don't think such metrics exist.
You have to modify the source code to add them. Take a look at this file on how to register a metric: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/metrics/metrics.go,
and take a look at this line on how to record a metric: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pleg/generic.go#L180
I've found that I can monitor these metrics using heapster & snap, which is a plausible workaround for my case. Let me know if that's something you're also using and I'll give you the proper metrics to get this data.