Dataproc VM memory and local disk usage metrics - google-cloud-dataproc

I'm trying to monitor local disk usage (percentage) on Dataproc 2.0 using cloud metrics. This would be useful for monitoring situations where Spark temporary files fill up disk.
By default Dataproc seems to send only local disk performance metrics, CPU etc.. metrics and cluster level HDFS metrics but not local disk usage.
There seems to be a stackdriver agent installed on the Dataproc image but it is not running so apparently Dataproc uses a different way of collecting metrics. I checked that df plugin is enabled in /etc/stackdriver/collectd.conf. However, starting the agent fails:
Jul 16 03:01:57 metrics-test-m systemd[1]: Starting LSB: start and stop Stackdriver Agent...
Jul 16 03:01:57 metrics-test-m stackdriver-agent[3829]: Starting Stackdriver metrics collection agent: stackdriver-agentThe instance has neither the application default credentials file nor the correct monitoring scopes; Exiting. ... failed!
Jul 16 03:01:57 metrics-test-m stackdriver-agent[3829]: not starting, configuration/credentials error. ... failed!
Jul 16 03:01:57 metrics-test-m stackdriver-agent[3829]: (warning).
Jul 16 03:01:57 metrics-test-m systemd[1]: Started LSB: start and stop Stackdriver Agent.
Is it possible to somehow monitor local disk usage in Dataproc and push the metrics to Google Cloud Metrics?

Google Cloud Monitoring Agent is installed on Dataproc cluster VMs, but disabled by default.
Adding --properties dataproc:dataproc.monitoring.stackdriver.enable=true when creating the cluster will enable it. The agent collects guest OS metrics including memory and disk usage, so you can view them in Cloud Metrics. See the property in this doc.
BTW, the reason why CPU usage is collected by default and doesn't depend on the agent is that, it is collected by GCE from the VM host. But for memory and local disk usage, VM host doesn't have knowledge about them, they have to be collected from inside the guest OS, hence it depends on the agent. When you enable the agent, there will be two CPU usage metrics with different types, one (compute) is from the VM host perspective, the other (agent) is from the guest OS perspective.
Pricing: these metrics are not free of charge, check Cloud Monitoring pricing for the pricing.

Related

How kube-apiserver memory cleanup mechanism works

I would like to ask about a strange memory behavior that we encountered in some of our clusters.
After a spike in the memory consumption of the api server, the ram remains in the same level of the top of the spike which means that the kube api server does not free any memory.
Is this behavior normal? Can you guide us to a document that describes the kube api server memory cleanup mechanism?
Cluster information:
Kubernetes version: openshift 4.6.35 / kubernetes version 1.19
Cloud being used: openstack 13
Installation method: openshift IPI installation
Host OS: coreos
UPDATE:
We upgraded the cluster to openshift version 4.8 and now the api server can free up memory.

How do I stream multiple logs to AWS CloudWatch from inside a Docker instance?

I am setting up Debian-based containers via AWS ECS on EC2 instances. The container has a number of logs I want in separate CloudWatch streams.
The "expected" setup is to simply stream stdout to CloudWatch, but that only permits for one stream.
I tried to install the cloudwatch agent, but ran into myriad problems. System has not been booted with systemd as init system (PID 1). Can't operate. being the starting point.
Is this possible?

Gather resource usage by process in a kubernetes cluster

I am searching for a tool similar to Prometheus + Grafana that gather and record resource usage especially memory usage by process-ID or process-name.
We have two components that are running different processes and they have memory leak and I want to find which process is leaking.
This is from Weave Scope and it shows all the processes of each pod and their resource usage but it is just live, I want something similar but storing it over time like a Prometheus graph.
There is a solution where you can monitor it on a container level based on Zabbix.
Dockbix Agent XXL is an agent for Zabbix capable to monitor all Docker containers on your host.
You need to deploy it on all nodes and it will collect data of your containers and sent it to your Zabbix Server.
No classic rpm/deb package installation or Zabbix module compilation.
Just start the dockbix-agent-xxl container and Docker container
metrics will be collected from the Docker daemon API or cgroups.

Kubernetes Engine: Node keeps getting unhealthy and rebooted for no apparent reason

My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster

No cpu metrics from running pods on stackdriver

Hi im trying to setup stackdriver to monitor my containers but the cpu metrics dont seem to work, im working with the following versions
Master Version 1.2.5
Node Version 1.2.4
heapster-v1.0.2-594732231-sil32
this is a group a create for the databases (it also happens for the wildfly pod and modcluster), i have a couple of other questions,
is it posible to monitor postgres or i have to install the agent on
the docker image
can i monitor the images on kubernetes, or the disks on Google cloud?
Do your containers have CPU limits specified on them? The CPU Usage graph on that page is supposed to show utilization, which is defined as cores used / cores reserved. If a container hasn't specified a maximum number of cores, then it won't have a utilization either, as mentioned in the description of the CPU utilization metric.