Grafana is showing wrong disk use percentage in graph. Currently my glusterfs disk usage is 8%, but on graph its showing 7%.
Below is the metrics which I am currently using.
{
"hide": true,
"target": "sumSeries(collectd.gls--01.df-gluster.df_complex-used)",
"refId": "A"
},
{
"hide": true,
"target": "sumSeries(collectd.gls--01.df-gluster.df_complex-{free,used})",
"refId": "B"
},
{
"hide": false,
"target": "asPercent(#A, #B)",
"refId": "C
Also I am unable to see percent_bytes-used metrics in collectd directory.
Depending on the ReportReserved in your collectd configuration you might need to take account for reserved disk space. If it is true (default on collectd > 4) you will have to change your second metric to: 'df_complex-{free,used,reserved}'
I am unable to save the json file. If I am making changes in the file after some time it is getting back to the same values.
Related
As part of a durable function app deployment, I am deploying azure storage.
On deploying the fileServices/shares, I am getting the following error:
error": {
"code": "InvalidHeaderValue",
"message": "The value for one of the HTTP headers is not in the correct format.\nRequestId:6c0b3fb0-701a-0058-0509-a8af5d000000\nTime:2022-08-04T13:49:24.6378224Z"
}
I would appreciate any advice as this is eating up a lot of time and I am no closer to resolving it.
Section of arm template for the share deployment is below:
{
"type": "Microsoft.Storage/storageAccounts/fileServices/shares",
"apiVersion": "2021-09-01",
"name": "[concat(parameters('storageAccount1_name'), '/default/FuncAppName')]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts/fileServices', parameters('storageAccount1_name'), 'default')]",
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccount1_name'))]"
],
"properties": {
"accessTier": "TransactionOptimized",
"shareQuota": 5120,
"enabledProtocols": "SMB"
}
}
Answer to this: removing the property "accessTier": "TransactionOptimized" resolves the issue. The default value for this is TransactionOptimized.
Although the template exported from azure portal includes this property, deployment fails if this parameter is present.
I have grafana Dashboards, Pods drop down coming None within namespace, however we have pods running in namespace and pulling data prometheus.
Screenshot:
Query:
"datasource": "Prometheus",
"definition": "",
"description": null,
"error": null,
"hide": 0,
"includeAll": false,
"label": "Pod",
"multi": false,
"name": "pod",
"options": [],
"query": {
"query": "query_result(sum(container_memory_working_set_bytes{namespace=\"$namespace\"}) by (pod_name))",
"refId": "Prometheus-pod-Variable-Query"
},
"refresh": 1,
"regex": "/pod_name=\\\"(.*?)(\\\")/",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
I am imported json code:
https://grafana.com/grafana/dashboards/6879
Edit your dashboard's JSON:
Rename "pod_name" to "pod" in the 2 places (and save)
Looks like this grafana dashboard was created with older kubernetes version,
and metrics internals since changed.
Probably will also need similar edits for "container_name" changing to "container" in these older dashboards
This might not be a full answer, but I cannot yet comment.
The linked dashboard imports and works for me fine. So I suspect one of these:
Prometheus scraping is not running (correctly). You could enter directly into the Prometheus app and check whether if the container_memory_working_set_bytes metric has any value at all, anywhere.
The kube_system system namespace might be restricted with respect to scraping and such. If another namespace works and only this one doesn't, then this is the case.
Is it possible to trigger some alerts on the Prometheus dashboard by manually stopping respective services on the Kubernetes cluster in order to verify that I'm receiving alert for issues on Prometheus dashboard ?
I would recommend using tools such as chaos toolkit to do this declaratively and automatically instead of doing it manually. This is called chaos engineering more generally.
{
"title": "Do we remain available in face of pod going down?",
"description": "We expect Kubernetes to handle the situation gracefully when a pod goes down",
"tags": ["kubernetes"],
"steady-state-hypothesis": {
"title": "Verifying service remains healthy",
"probes": [
{
"name": "all-our-microservices-should-be-healthy",
"type": "probe",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": "myapp"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "terminate-db-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=my-app",
"name_pattern": "my-app-[0-9]$",
"rand": true
}
},
"pauses": {
"after": 5
}
}
]
}
You can use Gremlin to achieve this goal too. First, install the Gremlin agent on your Kubernetes cluster using the helm chart: https://github.com/gremlin/helm/
Next, shutdown the specific services using the Kubernetes features within Gremlin. You can control the blast radius by selecting 1 pod/1 service or many pods/services. This is a tutorial that I wrote on this topic: https://www.gremlin.com/community/tutorials/how-to-install-and-use-gremlin-with-kubernetes/.
Validating monitoring and alerting is a great use case for Chaos Engineering. As you said, triggering alerts on the Prometheus dashboard by manually stopping respective services on the Kubernetes cluster. This will enable you to verify alerts for issues on your Prometheus dashboard. This tutorial explains how to use Gremlin webhooks with Grafana and Prometheus: https://www.gremlin.com/community/tutorials/visualize-chaos-experiments-in-grafana-with-gremlin-webhooks/
Last night my Kubernetes cluster on GKE was upgraded to 1.16.8-gke.9. Since then the logs show error: unable to find container named fluentd-gcp every minute. Logging from my applications still works, but I'd like to know what causes this error and how to get rid of this.
Expanding the error yields slightly more details:
{
"textPayload": "error: unable to find container named fluentd-gcp\n",
"insertId": "v1b2u2ldrnswujhz2",
"resource": {
"type": "k8s_container",
"labels": {
"project_id": "foo",
"pod_name": "fluentd-gke-scaler-cd4d654d7-tgg27",
"cluster_name": "foo-cluster",
"container_name": "fluentd-gke-scaler",
"namespace_name": "kube-system",
"location": "us-east1-d"
}
},
"timestamp": "2020-04-24T16:15:40.224944500Z",
"severity": "ERROR",
"labels": {
"gke.googleapis.com/log_type": "system",
"k8s-pod/k8s-app": "fluentd-gke-scaler",
"k8s-pod/pod-template-hash": "cd4d654d7"
},
"logName": "projects/foo/logs/stderr",
"receiveTimestamp": "2020-04-24T16:15:45.923960735Z"
}
kubectl get all --all-namespaces shows fluentd-gke pods with a fluentd-gke container, not fluentd-gcp.
Any advice would be appreciated and I'm happy to post more details, if you tell me where to look for them.
Edit: More details and related problems on the GKE issue tracker: https://issuetracker.google.com/issues/156965162
This will be fixed in GKE 1.16.9-gke.6 according to the issue tracker: https://issuetracker.google.com/issues/156965162
1.16.8-gke.9 is currently being offered through rapid channel. Keep in mind that such a channel is offered on an early access basis for people to test new releases, as such the version offered may be subject to unresolved issues with no known workaround. That said a possible fix could be to drain and migrate your workloads to another node. If issue persists, then create an issue here.
I have a JSON that can be imported for grafana dashboard. I'd like to know in which version it was exported. The "schemaVersion" doesn't give the exact the grafana version. Anyway I can find the exact grafana version used for exporting?
There is a __requires list, where Grafana version is available:
"__requires": [
...
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "5.2.4"
},
...