Ubuntu crashing - How to diagnose - server

I have a dedicated server running Ubuntu 20.04, with cPanel 106.11, MySQL 8, PHP 8.1, Elasticsearch 7.17.8 and i run magento 2.4.5-p1. Config Server Security & Firewall is enabled.
Every couple of days i get an monitoring alert to say my server doesnt respond to ping and the host has to do a hard reboot, they are getting frustrated with this and say they will turn off monitoring unless i sort this as they have checked all hardware which is fine.
This happens at different times and usually overnight.
I have looked through syslog, mysql log, elasticsearch log, magento 2 logs, apache log, kern.log and i cant find the cause of the issue.
I have enabled "sar" and the RAM usage around the time is 64%, cpu usage is between 5-10%.
What else can i look at to try and diagnose this issue?
Additional info requested by Wilson:
select count - https://justpaste.it/6zc95
show global status - https://justpaste.it/6vqvg
show global variables - https://justpaste.it/cb52m
full process list - https://justpaste.it/d41lt
status - https://justpaste.it/9ht1i
show engine innodb status - https://justpaste.it/a9uem
top -b -n 1 - https://justpaste.it/4zdbx
top -b -n 1 -H - https://justpaste.it/bqt57
ulimit -a - https://justpaste.it/5sjr4
iostat -xm 5 3 - https://justpaste.it/c37to
df -h, df -i, free -h and cat /proc/meminfo - https://justpaste.it/csmwh
htop - https://freeimage.host/i/HAKG0va
Server is using nvme drives, 32GB RAM, 6 cores, MySQL is running on same server as litespeed.
Server has not gone down again since posting this but the datacentre usually reboot within 15 - 20 mins and 99% of the time happens overnight. The server is not accessible over ssh when it crashes.

Rate Per Second = RPS
Suggestions to consider for your instance (should be available in your cpanel as they are all dynamic variables)
connect_timeout=30 # from 10 seconds to reduce aborted_connects RPHr of 75
innodb_io_capacity=900 # from 200 to use more of NVME IOPS capacity
thread_cache_size=36 # from 9 to reduce threads_created RPHr of 75
read_rnd_buffer_size=32768 # from 256K to reduce handler_read_rnd_next RPS of 5,805
read_buffer_size=524288 # from 128K to reduce handler_read_next RPS of 5,063
Many more opportunities exist to improve performance of your instance.
View profile for contact info, please. We are pushing the one question/one answer planned for this platform.

Related

Monitor Nestjs backend

In the old days, when we wanted to monitor a "Daemon" / Service, we were asking the software editor the list of all the services running in the background in Windows.
If a "Daemon / service" would be down, it would be restarted.
On top of that, we would use a software like NAGIOS or Centreon to monitore this particular "Daemon / service".
I have a team of Software developper in charge of implementing a nice Nest JS.
Here is what we are going to implement:
2 differents VMs running on a high availability VMWARE cluster with a SAN
the two VMs has Vmotion / High availabity settings
an HA Proxy is setup in order to provide load balancing and additional high availability
Our questions are, how can we detect that :
one of our backend is down ?
one of our backend moving from 50ms average response time to 800ms ?
one of our backend consumes more that 15Gb of ram ?
etc
When we were using "old school" daemon, it was enough, when it comes to JS backend, I am a bit clue less.
Cheers
Kynes
nb : the datacenter in charge of our infrastructure is not "docker / kubernetes / ansible etc compliant)
To be fair, all of these seem doable out of the box for Centreon/Nagios. I'd say check the documentation...
one of our backend is down ?
VM DOWN: the centreon-vmware plugins provides monitoring of VM status.
VM UP but Backend DOWN : use the native http/https url checks provided by Centreon/Nagios to load the web page.
Or use the native SNMP plugins to monitor the status of your node process.
one of our backend moving from 50ms average response time to 800ms ?
Ping Response time: Use the native ping check
Status of the network interfaces of the VM: the centreon-vmware plugin has network interface checks for VMs.
Page loading time: use the native http/https url checks provided by Centreon/Nagios.
You may go even further and use a browser automation tool like selenium to run scenarios on your pages and monitor the time for each step.
one of our backend consumes more that 15Gb of ram ?
Total RAM consumed on server: use the native SNMP memory checks from centreon/nagios.
RAM consumed by a specific process: possible through the native SNMP memory plugin.
Like so:
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin os::linux::snmp::plugin --mode processcount --hostname=127.0.0.1 --process-name="centengine" --memory --cpu
OK: Number of current processes running: 1 - Total memory usage: 8.56 MB - Average memory usage: 8.56 MB - Total CPU usage: 0.00 % | 'nbproc'=1;;;0; 'mem_total'=8978432B;;;0; 'mem_avg'=8978432.00B;;;0; 'cpu_total'=0.00%;;;0;`

How to mass insert keys in redis running as pod in kubernetes

I have referred to https://redis.io/topics/mass-insert and tried the Luke protocol,
and did
cat data.txt | redis-cli -a <pass> -h <events-k8s-service> --pipe-timeout 100 > /dev/null
The redirection to /dev/null is to ignore the replies. The CLIENT REPLY of redis can't serve its purpose here from CLI and it may turn into a blocking command.
The data.txt has around 18 Million records/commands like
SELECT 1
SET key1 '"field1":"val1","field2":"val2","field3":"val3","field4":"val4","field5":val5,"field6":val6'
SET key2 '"field1":"val1","field2":"val2","field3":"val3","field4":"val4","field5":val5,"field6":val6'
.
.
.
This command is executed from a cronJob which execs into the redis pod, and executes the above command from within the pod, to understand the footprint, the redis pod had no resources limit and following are the observation:
Keys loaded: 18147292
Time taken: ~31 minutes
Peak CPU: 2063 m
Peak Memory: 4745 Mi
The resources consumed are way too high and the time taken is too long.
The questions:
How do we load mass load data of the order 50 Million keys using redis pipe, is there an alternate approach to this problem ?
Is there a golang/python library that does the same mass loading efficiently(less time , little footprint of mem and cpu) ?
Do we need to fine tune redis here ?
The help is appreciated, Thanks in advance.
If you are using the redis-cli inside the pod to move the millions of key into Redis POD won't be able to handle it.
Also, you have not specified any resources that you are giving to Redis however it's a memory store so it will be better to give proper memory to redis 2-3 GB depends on usegae.
You can try out the Redis-riot : https://github.com/redis-developer/riot
to add data into the Redis.
There is also good video across loading the Big foot data into the redis : https://www.youtube.com/watch?v=TqTg6RijfaU
Do we need to fine tune redis here.
Increase the memory for redis if it's getting OOMkilled.

What is the best way to monitor Heroku Postgres memory and cpu

We're on Heroku and trying to understand if it's time to upgrade our Postgres database or not. I have two questions:
Is there any tools you know of that track heroku postgres logs to track their memory and cpu
usage stats over time?
Are those (Memory and CPU usage) even the best metrics to look at to determine if we should upgrade to a larger instance or not?
The most useful tool I've found for monitoring heroku postgres instancs is the logs associated with the database's dyno, which you can monitor using heroku logs -t -d heroku-postgres. This spits out some useful stats every 5 minutes, so if you fill your logs up quickly, this might not output anything right away — use -t to wait for the next log line.
Output will look something like this:
2022-06-27T16:34:49.000000+00:00 app[heroku-postgres]: source=HEROKU_POSTGRESQL_SILVER addon=postgresql-fluffy-55941 sample#current_transaction=81770844 sample#db_size=44008084127bytes sample#tables=1988 sample#active-connections=27 sample#waiting-connections=0 sample#index-cache-hit-rate=0.99818 sample#table-cache-hit-rate=0.9647 sample#load-avg-1m=0.03 sample#load-avg-5m=0.205 sample#load-avg-15m=0.21 sample#read-iops=14.328 sample#write-iops=15.336 sample#tmp-disk-used=543633408 sample#tmp-disk-available=72435159040 sample#memory-total=16085852kB sample#memory-free=236104kB sample#memory-cached=15075900kB sample#memory-postgres=223120kB sample#wal-percentage-used=0.0692420374380985
The main stats I pay attention to are table-cache-hit-rate which is a good proxy for how much of your active dataset fits in memory, and load-avg-1m, which tells you how much load per CPU the server is experiencing.
You can read more about all these metrics here.

Centos - my Centos 5 (hosted asterisk) always have a large CPU usage process

i have a Centos 5 server which hosted asterisk 13.
server works fine last week but now top command always show me a process with large amount of CPU usage. when i kill the process a few second later another command with large CPU usage started. many times processes command is ".syslog" but have other command like "qjennjifes", "vnvebynufu" and another unknown commands like that.
1) Check you have firewall and fail2ban recomended settings
2) Check you have no DoS/DDoS by "sip show channels"
3) Check your system not hacked/no broken soft on your host.

What are some useful tips/tools for monitoring/tuning memcached health?

Yesterday, I found this cool script 'memcache-top' which nicely prints out stats of memcached live. It looks like,
memcache-top v0.6 (default port: 11211, color: on, refresh: 3 seconds)
INSTANCE USAGE HIT % CONN TIME EVICT/s READ/s WRITE/s
127.0.0.1:11211 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
AVERAGE: 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
TOTAL: 1.8GB/ 2.0GB 20 0.8ms 9.0 311.3K 162.8K
(ctrl-c to quit.)
it even makes certain text red when you should pay attention to something!
Q. Broadly, what are some useful tools/techniques you've used to check that memcached is set up well?
Good interface to accessing Memcached server instances is phpMemCacheAdmin.
I prefer access from the command line using telnet.
To make a connection to Memcached using Telnet, use the following telnet localhost 11211 command from the command line.
If at any time you wish to terminate the Telnet session, simply type quit and hit return.
You can get an overview of the important statistics of your Memcached server by running the stats command once connected.
Memory is allocated in chunks internally and constantly reused. Since memory is broken into different size slabs, you do waste memory if your items do not fit perfectly into the slab the server chooses to put it in.
So Memcached allocates your data into different "slabs" (think of these as partitions) of memory automatically, based on the size of your data, which in turn makes memory allocation more optimal.
To list the slabs in the instance you are connected to, use the stats slab command.
A more useful command is the stats items, which will give you a list of slabs which includes a count of the items store within each slab.
Now that you know how to list slabs, you can browse inside each slab to list the items contained within by using the stats cachedump [slab ID] [number of items, 0 for all items] command.
If you want to get the actual value of that item, you can use the get [key] command.
To delete an item from the cache you can use the delete [key] command.
For a production systems, you should really set up active monitoring (with downtime alerts, automated restarts etc.) of Memcache using something like Monit. Here is an example config: Monitoring Memcache with Monit
It is good to monitor overall memory usage of memcached for resource planning.
Track the eviction statistics counter to know how often cached items are getting evicted due to lack of memory.
Track cache hit/misses, reclaims(The number of expired items removed to allow space for new writes), current connections, flush cmd which is available in stats.
Memcached stats (can be read from telnet, libmemcached, language specific library)
stats
stats slabs
stats items
stats sizes
stats detail
stats settings
run the above commands using telnet
or simply run using netcat
echo "stats settings" | nc 127.0.0.1 11211
Other scripts/tools
https://github.com/memcached/memcached/tree/master/scripts
memcached top
memcached metrics per slab
This is what memcached metrics per slab looks like
desc for some fields can be found here.
Memcached Prometheus exporter - Exports metrics from memcached servers for consumption by Prometheus.