why mrtg showing 100 percent CPU load - usage-statistics

I have configured MRTG to monitor network traffic, CPU load and memory. Network traffic statistics are ok but CPU load statistics show that CPU is 100% used while actually it is as I check with top command. Following is the MRTG configuration for CPU (mrtg.cfg).
# 10.12.2.1 CPU configuration
Target[CPU]: .1.3.6.1.4.1.2021.10.1.5.1&.1.3.6.1.4.1.2021.10.1.5.2:public#10.12.2.1
MaxBytes[CPU]: 100
Unscaled[CPU]: dwmy
Options[CPU]: gauge, growright, nopercent
YLegend[CPU]: Load Average
ShortLegend[CPU]: (%)
LegendI[CPU]: Load Average 1 min
LegendO[CPU]: Load Average 5 min
Legend1[CPU]: Load Average 1 min
Legend2[CPU]: Load Average 5 min
Title[CPU]: CPU Load Average
PageTop[CPU]: <h1>10.12.2.1 CPU Load Average</h1>
Where is the problem in configuration ? Here is snapshot of CPU statistics output.

This is the problem with using SNMP to collect CPU and Load Avg statistics. Depending on your OS and SNMP implementation, and the number of CPUs you have, you may find that the SNMP query erroneously returns a high value because, at the point in time when you check, one CPU is in use by the SNMP daemon.
If you can, it is better to use an OID which returns the average use over the last 5min rather than the point in time usage, as this prevents the problem. Usually, you can find a LoadAvg5min OID, but some SNMP implementations do not have this, though.
Another alternative is to use an external plugin. You can have MRTG use mrtg-nrpe to call the Nagios NRPE agent on the remote host, which then calls the Nagios check-cpu and check-load plugins to get the real CPU usage. This is a bit complex to set up, though, and in some cases can suffer from the same issue.

Related

Ballpark value for `--jobs` in `pg_restore` command

I'm using pg_restore to restore a database to its original state for load testing. I see it has a --jobs=number-of-jobs option.
How can I get a ballpark estimate of what this value should be on my machine? I know this is dependent on a bunch of factors, e.g. machine, dataset, but it would be great to get a conceptual starting point.
I'm on a MacBook Pro so maybe I can use the number of physical CPU cores:
sysctl -n hw.physicalcpu
# 10
If there is no concurrent activity at all, don't exceed the number of parallel I/O requests your disk can handle. This applies to the COPY part of the dump, which is likely I/O bound. But a restore also creates indexes, which uses CPU time (in addition to I/O). So you should also not exceed the number of CPU cores available.

PowerShell based Process specific resource monitoring

I am using PowerShell for some benchmarking of Autodesk Revit, and I would like to add some resource monitoring. I know I can get historical CPU utilization and RAM utilization in general, but I would like to be able to poll Process specific CPU and RAM utilization every 5 seconds or so.
In addition, I would love to be able to poll how many cores a process is currently using, the clock speed of those specific cores, and the frame rate of the screen that process is currently displayed on.
Are those things even accessible via PowerShell/.NET? Or is that low level stuff I just can't get to with PS?

How to calculate the number of CPU, memory and storage that my Google Cloud SQL needs

My DB is reaching the 100% of CPU utilization and increasing the number of CPU is not working anymore.
What kind of information should I consider to create my Google Cloud SQL? How do you set up the DB configuration?
Info I have:
For 10-50 minute a day I have 120 request/seconds and the CPU reaches 100% of utilization
Memory usage is the maximum 2.5GB during this critical period
Storage usage is currently around 1.3GB
Current configuration:
vCPUs: 10
Memory: 10 GB
SSD storage: 50 GB
Unfortunately, there is no magic formula for determining the correct database size. This is because queries have variable load - some are small and simple and take no time at all, others are complex or huge and take lots of resources to complete.
There are generally two strategies to dealing with high load: Reduce your load (use connection pooling, optimize your queries, cache results), or increase the size of your database (add additional CPUs, Storage, or Read replicas).
Usually, when we have CPU utilization, it is because the CPU is overloaded or we have too many database tables in the same instances. Here are some common issues and fixes provided by Google’s documentation:
If CPU utilization is over 98% for 6 hours, your instance is not properly sized for your workload, and it is not covered by the SLA.
If you have 10,000 or more database tables on a single instance, it could result in the instance becoming unresponsive or unable to perform maintenance operations, and the instance is not covered by the SLA.
When the CPU is overloaded, it is recommended to use this documentation to view the percentage of available CPU your instance is using on the Instance details page in the Google Cloud Console.
It is also recommended to monitor your CPU usage and receive alerts at a specified threshold, set up a Stackdriver alert.
Increasing the number of CPUs for your instance should reduce the strain of your instance. Note that changing CPUs requires an instance restart. If your instance is already at the maximum number of CPUs, shard your database to multiple instances.
Google has this very interesting documentation about investigating high utilization and determining whether a system or user task is causing high CPU utilization. You could use it to troubleshoot your instance and find what's causing the high CPU utilization.

Weird cgroups behaviour ( cpu shares

I have a cpu hungry process A which is taking too much of cpu load(100%) which causes process B to take not enough cycles...B is related to web response...so when i did a benchmark of web response with both the processes without cgroups, the result was 5 seconds...now when i create two groups and give both the processes equal amount of cpu.shares the time taken increases to 15 seconds.
i am getting good results with a high share ratio of cpu to the process wich has to be given more priority...but really curious about this weird behaviour at default values...
Why would the response time increase with the default share values of 1024 to both the groups, shouldn't it be the same as without cgroups ???
Now when i put both the processes in the same group, the response again goes back to 5 seconds...
Is it something related to the scheduler...
[ If you have cpuacct cgroup mounted with cpu, you can look at the usage numbers to check if both cgroups are getting equal shares. ]
My guess is that your setup has some processes running outside cgroups. When you move some of the processes under cgroups, the ones still outside will get higher cpu shares (in total) than two cgroups. (Each top-level process gets an equivalent of 1024 shares). To achieve what you want, all processes should be under some cgroup.

httperf for bechmarking web-servers

I am using httperf to benchmark web-servers. My configuration, i5 processor and 4GB RAM. How to stress this configuration to get accurate results...? I mean I have to put 100% load on this server(12.04 LTS server).
you can use httperf like this
$httperf --server --port --wsesslog=200,0,urls.log --rate 10
Here the urls.log contains the different uri/path to be requested. Check the documention for details.
Now try to change the rate value or session value, then see how many RPS you can achieve and what is the reply time. Also in mean time monitor the cpu and memory utilization using mpstat or top command to see if it is reaching 100%.
What's tricky about httperf is that it is often saturating the client first, because of 1) the per-process open files limit, 2) TCP port number limit (excluding the reserved 0-1024, there are only 64512 ports available for tcp connections, meaning only 1075 max sustained connections for 1 minute), 3) socket buffer size. You probably need to tune the above limit to avoid saturating the client.
To saturate a server with 4GB memory, you would probably need multiple physical machines. I tried 6 clients, each of which invokes 300 req/s to a 4GB VM, and it saturates it.
However, there are still other factors impacting hte result, e.g., pages deployed in your apache server, workload access patterns. But the general suggestions are:
1. test the request workload that is closest to your target scenarios.
2. add more physical clients to see if the changes of response rate, response time, error number, in order to make sure you are not saturating the clients.