I have a kubernetes 1.13 cluster running on Azure and I'm using multiple persistent volumes for multiple applications.
I have setup monitoring with Prometheus, Alertmanager, Grafana.
But I'm unable to get any metrics related to the PVs.
It seems that kubelet started to expose some of the metrics from kubernetes 1.8 , but again stopped since 1.12
I have already spoken to Azure team about any workaround to collect the metrics directly from the actual FileSystem (Azure Disk in my case). But even that is not possible.
I have also heard some people using sidecars in the Pods to gather PV metrics. But I'm not getting any help on that either.
It would be great even if I get just basic details like consumed / available free space.
I'm was having the same issue and solved it by joining two metrics:
avg(label_replace(
1 - node_filesystem_free_bytes{mountpoint=~".*pvc.*"} / node_filesystem_size_bytes,
"volumename", "$1", "mountpoint", ".*(pvc-[^/]*).*")) by (volumename)
+ on(volumename) group_left(namespace, persistentvolumeclaim)
(0 * kube_persistentvolumeclaim_info)
As an explanation, I'm adding a label volumename to every time-series of
node_filesystem*, cut out of the existing mountpoint label and then joining with the other metrics containing the additional labels. Multiplying by 0 ensures this is otherwise a no-op.
Also quick warning: I or you may be using some relabeling configs making this not work out immediately without adaptation.
Related
I am not really sure, if this is a prometheus issue, or just Longhorn, or maybe a combination of the two.
Setup:
Kubernetes K3s v1.21.9+k3s1
Rancher Longhorn Storage Provider 1.2.2
Prometheus Helm Chart 32.2.1 and image: quay.io/prometheus/prometheus:v2.33.1
Problem:
Infinitely growing PV in Longhorn, even over the defined max size. Currently using 75G on a 50G volume.
Description:
I have a really small 3 node cluster with not too many deployments running. Currently only one "real" application and the rest is just kubernetes system stuff so far.
Apart from etcd, I am using all the default scraping rules.
The PV is filling up a bit more than 1 GB per day, which seems fine to me.
The problem is, that for whatever reason, the data used inside longhorn is infinitely growing. I have configured retention rules for the helm chart with a retention: 7d and retentionSize: 25GB, so the retentionSize should never be reached anyway.
When I log into the containers shell and do a du -sh in /prometheus, it shows ~8.7GB being used, which looks good to me as well.
The problem is that when I look at the longhorn UI, the used spaced is growing all the time. The PV does exist now for ~20 days and is currently using almost 75GB of a defined max of 50GB. When I take a look at the Kubernetes node itself and inspect the folder, which longhorn uses to store its PV data, I see the same values of space being used as in the Longhorn UI, while inside the prometheus container, everything looks good to me.
I hope someone has an idea what the problem could be. I have not experienced this issue with any other deployment so far, all others are good and really decrease in size used, when something inside the container gets deleted.
Google publishes a tutorial for using custom metrics to drive the HorizontalPodAutoscaler here, and this tutorial contains instructions for:
Using a Kubernetes manifest to deploy the custom metrics adapter into a custom-metrics namespace.
Deploying a dummy application to generate metrics.
Configuring the HPA to use custom metrics.
We are deploying into a default cluster without any special VPC rules, and we have roughly followed the tutorial's guidance, with a few exceptions:
We're using Helm v2, and rather than grant cluster admin role to Tiller, we have granted all of the necessary cluster roles and role bindings to allow the custom-metrics-adapter-deploying Kubernetes manifest to work. We see no issues there; at least the custom metrics adapter spins up and runs.
We have defined some custom metrics that are based upon data extracted from a jsonPayload in Stackdriver logs.
We have deployed a minute-by-minute CronJob that reads the above metrics and publishes a derived metric, which is the value we want to use to drive the autoscaler. The CronJob is working, and we can see the metric in the derived metric, on a per-Pod basis, in the log metric explorer:
We're configuring the HPA to scale based on the average of the derived metric across all of the pods belonging to a stateful set (The HPA has a metrics entry with type Pods). However, the HPA is unable to read our derived metric. We see this error message:
failed to get object metric value: unable to get metric xxx_scaling_metric: no metrics returned from custom metrics API
Update
We were seeing DNS errors, but these were apparently false alarms, perhaps in the log while the cluster was spinning up.
We restarted the Stackdriver metrics adapter with the command line option --v=5 to get some more verbose debugging. We see log entries like these:
I0123 20:23:08.069406 1 wrap.go:47] GET /apis/custom.metrics.k8s.io/v1beta1/namespaces/defaults/pods/%2A/xxx_scaling_metric: (56.16652ms) 200 [kubectl/v1.13.11 (darwin/amd64) kubernetes/2e298c7 10.44.1.1:36286]
I0123 20:23:12.997569 1 translator.go:570] Metric 'xxx_scaling_metric' not found for pod 'xxx-0'
I0123 20:23:12.997775 1 wrap.go:47] GET /apis/custom.metrics.k8s.io/v1beta2/namespaces/default/pods/%2A/xxx_scaling_metric?labelSelector=app%3Dxxx: (98.101205ms) 200 [kube-controller-manager/v1.13.11 (linux/amd64) kubernetes/56d8986/system:serviceaccount:kube-system:horizontal-pod-autoscaler 10.44.1.1:36286]
So it looks to us as if the HPA is making the right query for pods-based custom metrics. If we ask the custom metrics API what data it has, and filter with jq to our metric of interest, we see:
{"kind":"MetricValueList",
"apiVersion":"custom.metrics.k8s.io/v1beta1",
"metadata: {"selfLink":"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/xxx_scaling_metric"},
"items":[]}
That the items array is empty is troubling. Again, we can see data in the metrics explorer, so we're left to wonder if our CronJob app that publishes our scaling metric is supplying the right fields in order for the data to be saved in Stackdriver or exposed through the metrics adapter.
For what it's worth the resource.labels map for the time series that we're publishing in our CronJob looks like:
{'cluster_name': 'test-gke',
'zone': 'us-central1-f',
'project_id': 'my-project-1234',
'container_name': '',
'instance_id': '1234567890123456789',
'pod_id': 'xxx-0',
'namespace_id': 'default'}
We finally solved this. Our CronJob that's publishing the derived metric we want to use is getting its raw data from two other metrics that are extracted from Stackdriver logs, and calculating a new value that it publishes back to Stackdriver.
We were using the resource labels that we saw from those metrics when publishing our derived metric. The POD_ID resource label value in the "input" Stackdriver metrics we are reading is the name of the pod. However, the stackdriver custom metrics adapter at gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.0 is enumerating pods in a namespace and asking stackdriver for data associated with pods' UIDs, not their names. (Read the adapter's source code to figure this out...)
So our CronJob now builds a map of pod names to pod UIDs (which requires it to have RBAC pod list and get roles), and publishes the derived metric we use for HPA with the POD_ID set to the pod's UID instead of its name.
The reason that published examples of custom metrics for HPA (like this) work is that they use the Downward API to get a pod's UID, and provide that value as "POD_ID". In retrospect, that should have been obvious, if we had looked at how the "dummy" metrics exporters got their pod id values, but there are certainly examples (as in Stackdriver logging metrics) where POD_ID ends up being a name and not a UID.
We have a kubernetes cluster with ~100 nodes with istio and want to enable the Locality LoadBalancing feature. This will save us up to 70k USD/year because our interzone data traffic is too high.
I've followed the docs and setup the istio configmap like this:
...
meshNetworks: {}
localityLbSetting:
enabled: true
distribute:
- from: us-east-1/us-east-1a/*
to:
"us-east-1/us-east-1a/*": 100
- from: us-east-1/us-east-1b/*
to:
"us-east-1/us-east-1b/*": 100
...
And then deployed 2 apps, one of them just responds with the zone where the node is deployed (we are using a VirtualService) and the another one just do the requests.
The requests that are coming from node in us-east-1a should only be replied by the nodes in the same zone, right?
But it's not happening.
We also tried to set this variable inside pilot pods:
PILOT_ENABLE_LOCALITY_LOAD_BALANCING
When I get logs from one pod that is deployed in zone "us-east-1a" it shows replies from both zones.
Istio Version: 1.2.8
Kubernetes Version: 1.14
Any help is appreciated! Thank you!
I'm afraid your configuration is invalid in case of 'Locality' weights between regions/zones in context of 'Locality Load Balancing' feature in 'distribute' mode.
The logs of your istio-pilot should give you a clue about it, in the form of warning similar to this one:
<timestamp> warn failed to read mesh configuration, using default: 1 error occurred:
* locality weight must not be in range [1, 100]
I don't think you can find it documented anywhere in Istio documentation, but the logic behind the weights' validation can be found here.
From #panicked's comment:
The pods where the requests are generated (src pods) have to belong to a K8s service themselves too, even if the service is not directly involved in the request.
As a side note, K8s recommends:
If the goal of the operator is not to distribute load across zones and regions but rather to restrict the regionality of failover to meet other operational requirements an operator can set a ‘failover’ policy instead of a ‘distribute’ policy.
but the distribute seems to work just fine.
For failover (and failoverPriority) you must also have outlierDetection defined.
The github repo of Prometheus Operator https://github.com/coreos/prometheus-operator/ project says that
The Prometheus Operator makes the Prometheus configuration Kubernetes native and manages and operates Prometheus and Alertmanager clusters. It is a piece of the puzzle regarding full end-to-end monitoring.
kube-prometheus combines the Prometheus Operator with a collection of manifests to help getting started with monitoring Kubernetes itself and applications running on top of it.
Can someone elaborate this?
I've always had this exact same question/repeatedly bumped into both, but tbh reading the above answer didn't clarify it for me/I needed a short explanation. I found this github issue that just made it crystal clear to me.
https://github.com/coreos/prometheus-operator/issues/2619
Quoting nicgirault of GitHub:
At last I realized that prometheus-operator chart was packaging
kube-prometheus stack but it took me around 10 hours playing around to
realize this.
**Here's my summarized explanation:
"kube-prometheus" and "Prometheus Operator Helm Chart" both do the same thing:
Basically the Ingress/Ingress Controller Concept, applied to Metrics/Prometheus Operator.
Both are a means of easily configuring, installing, and managing a huge distributed application (Kubernetes Prometheus Stack) on Kubernetes:**
What is the Entire Kube Prometheus Stack you ask? Prometheus, Grafana, AlertManager, CRDs (Custom Resource Definitions), Prometheus Operator(software bot app), IaC Alert Rules, IaC Grafana Dashboards, IaC ServiceMonitor CRDs (which auto-generate Prometheus Metric Collection Configuration and auto hot import it into Prometheus Server)
(Also when I say easily configuring I mean 1,000-10,000++ lines of easy for humans to understand config that generates and auto manage 10,000-100,000 lines of machine config + stuff with sensible defaults + monitoring configuration self-service, distributed configuration sharding with an operator/controller to combine config + generate verbose boilerplate machine-readable config from nice human-readable config.
If they achieve the same end goal, you might ask what's the difference between them?
https://github.com/coreos/kube-prometheus
https://github.com/helm/charts/tree/master/stable/prometheus-operator
Basically, CoreOS's kube-prometheus deploys the Prometheus Stack using Ksonnet.
Prometheus Operator Helm Chart wraps kube-prometheus / achieves the same end result but with Helm.
So which one to use?
Doesn't matter + they achieve the same end result + shouldn't be crazy difficult to start with 1 and switch to the other.
Helm tends to be faster to learn/develop basic mastery of.
Ksonnet is harder to learn/develop basic mastery of, but:
it's more idempotent (better for CICD automation) (but it's only a difference of 99% idempotent vs 99.99% idempotent.)
has built-in templating which means that if you have multiple clusters you need to manage / that you want to always keep consistent with each other. Then you can leverage ksonnet's templating to manage multiple instances of the Kube Prometheus Stack (for multiple envs) using a DRY code base with lots of code reuse. (If you only have a few envs and Prometheus doesn't need to change often it's not completely unreasonable to keep 4 helm values files in sync by hand. I've also seen Jinja2 templating used to template out helm values files, but if you're going to bother with that you may as well just consider ksonnet.)
Kubernetes operator are kubernetes specific application(pods) that configure, manage and optimize other Kubernetes deployments automatically. They are implemented as a custom controller.
According to official coreOS website:
Operators were introduced by CoreOS as a class of software that operates other software, putting operational knowledge collected by humans into software.
The prometheus operator provides the easy way to deploy configure and monitor your prometheus instances on kubernetes cluster. To do so, prometheus operator introduces three types of custom resource definition(CRD) in kubernetes.
Prometheus
Alertmanager
ServiceMonitor
Now, with the help of above CRD's, you can directly create a prometheus instance by providing kind: Prometheus and the prometheus instance is ready to serve, likewise you can do for AlertManager. Without this you would have to setup the deployment for prometheus with its image, configuration and many more things.
The Prometheus Operator serves to make running Prometheus on top of Kubernetes as easy as possible, while preserving Kubernetes-native configuration options.
Now, kube-prometheus implemented the prometheus operator and provides you minimum yaml files to create your basic setup of prometheus, alertmanager and grafana by running a single command.
git clone https://github.com/coreos/prometheus-operator.git
kubectl apply -f prometheus-operator/contrib/kube-prometheus/manifests/
By running above command in kube-prometheus directory, you will get a monitoring namespace which will have an instance of alertmanager, prometheus and grafana for UI. This is enough setup for most of the basic implementation and if you need any more specifics according to your application, you can add more yamls of exporter you need.
Kube-prometheus is more of a contribution to prometheus-operator project, which implements the prometheus operator functionality very well and provide you a complete monitoring setup for your kubernetes cluster. You can start with kube-prometheus and extend the functionality of your monitoring setup according to your application from there.
You can learn more about prometheus-operator here
As of today, 28-09-2020, this is the way to install Prometheus in a Kubernetes cluster
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#kube-prometheus-stack
According to official documentation, kube-prometheus-stack is a rename of prometheus-operator.
As I understood, kube-prometheus-stack also has preinstalled grafana dashboards and prometheus rules.
Note: This chart was formerly named prometheus-operator chart, now
renamed to more clearly reflect that it installs the kube-prometheus
project stack, within which Prometheus Operator is only one component.
Taken from https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
Architecturally the container runs docker
The default container logs are managed by Docker, and the default log driver uses JSON-file
log-driver": "json-file
https://docs.docker.com/config/containers/logging/configure/
If the default jSON-file is used to manage container logs, log rotation is not performed by default. Therefore, the default JSON-file log driver the log files stored by the log driver can result in a large amount of disk space for containers that generate a large amount of output, which can cause disk space to run out.
In this case, save the log to ES, store it separately, and periodically delete the index using curator kubernetes
And run a scheduled task in K8S to delete the index periodically
Another solution for disk space is to periodically delete old logs from jSON-files
Typically we set the size and number of logs
This will set up a maximum of 10 log files, each with a maximum size of 20 Mb. Therefore, the container has a maximum of 200 Mb of logs
"log-driver": "json-file", "log-opts": { "max-size": "20m", "max-file": "10" },
Note: In general, the default Docker log is placed
/var/lib/docker/containers/
But in the same case kubernetes also saves logs and creates a directory structure to help you find pods-based logs, so you can find container logs for each Pod running on a node
/var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/
When removing pod, / var/lib/container under the docker/containers/log and k8s created under/var/log/pods/pod log will be deleted
For example, if the POD is restarted during production, the pod log will be deleted whether it is on the original node or jumped to another node
Therefore, this log needs to be saved in ES for centralized management. Many R&D projects will check the log for troubleshooting in most cases
I've set up prometheus to monitor kubernetes metrics by following the prometheus documentation.
A lot of useful metrics now show up in prometheus.
However, I can't see any metrics referencing the status of my pods or nodes.
Ideally - I'd like to be able to graph the pod status (Running, Pending, CrashLoopBackOff, Error) and nodes (NodeReady, Ready).
Is this metric anywhere? If not, can I add it somewhere? And how?
The regular kubernetes setup does not expose these metrics - further discussion here.
However, another service can be used to collect these cluster level metrics: https://github.com/kubernetes/kube-state-metrics.
This currently provides node_status_ready and pod_container_restarts which sound like what I want.
I don't think such metrics exist.
You have to modify the source code to add them. Take a look at this file on how to register a metric: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/metrics/metrics.go,
and take a look at this line on how to record a metric: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pleg/generic.go#L180
I've found that I can monitor these metrics using heapster & snap, which is a plausible workaround for my case. Let me know if that's something you're also using and I'll give you the proper metrics to get this data.