Google cloud VM instance CPU usage very high, and growing month by month - google-cloud-console

I have configured a VM Instance in Google Cloud with nodered, mqtt, grafana and influxdb, and wireguard tunnel.
It is used only by me, for my home domotic system.
Only few devices.
CPU is growing month by month.
I have already read about that you can't compare TOP command with CPU utilization in google cloud monitoring service because of: "The CPU usage shown in Google Cloud Console is not that of the instance, but the CPU usage of the container managing the instance. This container is in charge of providing the virtualization services to the instance and collecting all the metrics used for load balancing, auto-scaling, cloud monitoring, etc. As such, high numbers of I/O or network operations will cause the CPU utilization shown in Cloud Console to spike."
But how could I get managed what is the real reason ?
With TOP I can't see anything, as the CPU is under 5%, but VM instance is over 80%. (one month ago, about 50%)
Thanks a lot.
NOTE: For example, last 30 days:
CPU USAGE

Related

Monitor Nestjs backend

In the old days, when we wanted to monitor a "Daemon" / Service, we were asking the software editor the list of all the services running in the background in Windows.
If a "Daemon / service" would be down, it would be restarted.
On top of that, we would use a software like NAGIOS or Centreon to monitore this particular "Daemon / service".
I have a team of Software developper in charge of implementing a nice Nest JS.
Here is what we are going to implement:
2 differents VMs running on a high availability VMWARE cluster with a SAN
the two VMs has Vmotion / High availabity settings
an HA Proxy is setup in order to provide load balancing and additional high availability
Our questions are, how can we detect that :
one of our backend is down ?
one of our backend moving from 50ms average response time to 800ms ?
one of our backend consumes more that 15Gb of ram ?
etc
When we were using "old school" daemon, it was enough, when it comes to JS backend, I am a bit clue less.
Cheers
Kynes
nb : the datacenter in charge of our infrastructure is not "docker / kubernetes / ansible etc compliant)
To be fair, all of these seem doable out of the box for Centreon/Nagios. I'd say check the documentation...
one of our backend is down ?
VM DOWN: the centreon-vmware plugins provides monitoring of VM status.
VM UP but Backend DOWN : use the native http/https url checks provided by Centreon/Nagios to load the web page.
Or use the native SNMP plugins to monitor the status of your node process.
one of our backend moving from 50ms average response time to 800ms ?
Ping Response time: Use the native ping check
Status of the network interfaces of the VM: the centreon-vmware plugin has network interface checks for VMs.
Page loading time: use the native http/https url checks provided by Centreon/Nagios.
You may go even further and use a browser automation tool like selenium to run scenarios on your pages and monitor the time for each step.
one of our backend consumes more that 15Gb of ram ?
Total RAM consumed on server: use the native SNMP memory checks from centreon/nagios.
RAM consumed by a specific process: possible through the native SNMP memory plugin.
Like so:
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin os::linux::snmp::plugin --mode processcount --hostname=127.0.0.1 --process-name="centengine" --memory --cpu
OK: Number of current processes running: 1 - Total memory usage: 8.56 MB - Average memory usage: 8.56 MB - Total CPU usage: 0.00 % | 'nbproc'=1;;;0; 'mem_total'=8978432B;;;0; 'mem_avg'=8978432.00B;;;0; 'cpu_total'=0.00%;;;0;`

Monitoring persistent volume performance

Use case
I am operating a kafka cluster in Kubernetes which is heavily dependent on a a proper disk performance (IOPS, throughput etc.). I am using Google's compute engine disks + Google kubernetes engine. Thus I know that the disks I created have the following approx limits:
IOPS (Read/Write): 375 / 750
Throughput in MB/s (Read/Write): 60 / 60
The problem
Even though I know the approx IOPS and throughput limits I have no idea what I am actually using at the moment. I'd like to monitor it with prometheus + grafana but I couldn't find anything which would export disk io stats for persistent volumes. The best I found were disk space stats from kubelet:
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_available_bytes
The question
What possibilities do I have to monitor (preferably via prometheus) the disk io usage for my kafka persistent volumes attached in Kubernetes?
Edit:
Another find I made is using node-exporter's node_disk_io metric:
rate(node_disk_io_time_seconds_total[5m]) * 100
Unfortunately the result doesn't contain a nodename, or even a persistent volume (claim) name. Instead it has device (e.g. 'sdb') and an instance ( e.g. '10.90.206.10') label which are the only labels which would somehow allow me to monitor a specific persistent volume. The downside of these labels are that they are dynamic and can change with a pod restart or similiar.
You should be able to get the metrics that you are looking for using Stackdriver. Check the new Stackdriver Kubernetes Monitoring.
You can use this QWikiLab to test the tools without install in your environment.
You can use Stackdriver monitoring to see I/O disk of an instance. You can use Cloud Console and go to the VM instance--> monitoring page to find it.

How to view google cloud VM instance RAM utilisation

I'm using Google Cloud VM instances but I can't find the RAM Utilisation chart.
When accessing through https://console.cloud.google.com, go to Hamburger Menu >> Compute Engine >> VM instances, click on one instance; there is one select box on top-left area with the following options:
CPU utilization
Disk bytes
Disk operations
Network bytes
Network packets
But no RAM utilization option, where can I find this chart?
You can use Stackdriver monitoring for that. You can find it here: Hamburger Menu --> Stackdriver --> Monitoring. You will be directed to the Stackdriver site and the first time it will create a Stackdriver project for you.
Stackdriver is an advanced monitoring and alerting tool but the basic (free) tier includes basic metrics like memory utilization.

AWS EB should create new instance once my docker reached its maximum memory limit

I have deployed my dockerized micro services in AWS server using Elastic Beanstalk which is written using Akka-HTTP(https://github.com/theiterators/akka-http-microservice) and Scala.
I have allocated 512mb memory size for each docker and performance problems. I have noticed that the CPU usage increased when server getting more number of requests(like 20%, 23%, 45%...) & depends on load, then it automatically came down to the normal state (0.88%). But Memory usage keeps on increasing for every request and it failed to release unused memory even after CPU usage came to the normal stage and it reached 100% and docker killed by itself and restarted again.
I have also enabled auto scaling feature in EB to handle a huge number of requests. So it created another duplicate instance only after CPU usage of the running instance is reached its maximum.
How can I setup auto-scaling to create another instance once memory usage is reached its maximum limit(i.e 500mb out of 512mb)?
Please provide us a solution/way to resolve these problems as soon as possible as it is a very critical problem for us?
CloudWatch doesn't natively report memory statistics. But there are some scripts that Amazon provides (usually just referred to as the "CloudWatch Monitoring Scripts for Linux) that will get the statistics into CloudWatch so you can use those metrics to build a scaling policy.
The Elastic Beanstalk documentation provides some information on installing the scripts on the Linux platform at http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-cw.html.
However, this will come with another caveat in that you cannot use the native Docker deployment JSON as it won't pick up the .ebextensions folder (see Where to put ebextensions config in AWS Elastic Beanstalk Docker deploy with dockerrun source bundle?). The solution here would be to create a zip of your application that includes the JSON file and .ebextensions folder and use that as the deployment artifact.
There is also one thing I am unclear on and that is if these metrics will be available to choose from under the Configuration -> Scaling section of the application. You may need to create another .ebextensions config file to set the custom metric such as:
option_settings:
aws:elasticbeanstalk:customoption:
BreachDuration: 3
LowerBreachScaleIncrement: -1
MeasureName: MemoryUtilization
Period: 60
Statistic: Average
Threshold: 90
UpperBreachScaleIncrement: 2
Now, even if this works, if the application will not lower its memory usage after scaling and load goes down then the scaling policy would just continue to trigger and reach max instances eventually.
I'd first see if you can get some garbage collection statistics for the JVM and maybe tune the JVM to do garbage collection more often to help bring memory down faster after application load goes down.

Constant CPU usage on Compute Engine Instance

Recently I have deployed a Compute Engine instance developed from LAMP template. A few days after deployment I started to see constant CPU usage (~8%). I have not performed any API activity (haven't created any applications) and I see ZERO CPU usage inside the VM (top/mpstat and etc.).
Any ideas what is happening?
as mentioned here: Idle CPU utilization on Google Compute Engine
"The CPU usage in Google Developer Console is not that of the instance, but the CPU usage of the container managing it. This
container is in charge of providing the virtualization services to the
instance and collecting all the metrics. So, the Google Developer
Console CPU utilization shows the aggregate CPU usage for both the
container and the instance."