host availability for a user with limits consideration - lsf

Say I have a host with 10 slotsn
I have applied a user limit to that host as following:
SLOT = 5
JOB = 3
Is there a command to find whether this host can run more jobs from that user?
OR, Is there a command to find whether the said host is closed for the said user.
Thanks in advance!

Taking your description of your limit, I imagine its definition looks something like this:
Begin Limit
NAME = L1
USERS = user1
SLOTS = 5
JOBS = 3
HOSTS = hostA
End Limit
If I then submit 3 jobs as user1, I can run the blimits -u user1 to see if there are any limits imposed on the user1, or blimits -m hostA to see if there are any limits relevant to hostA. I can also combine these filters to see if there are any relevant limits imposed on user1 on hostA:
$ blimits -u user1 -m hostA
INTERNAL RESOURCE LIMITS:
NAME USERS QUEUES HOSTS PROJECTS SLOTS MEM TMP SWP JOBS
L1 user1 - hostA - 3/5 - - - 3/3
The last column of this output (JOBS) shows me that this user has reached his limit for jobs on that host (running 3 jobs out of a possible 3 allowed).

Related

Ubuntu crashing - How to diagnose

I have a dedicated server running Ubuntu 20.04, with cPanel 106.11, MySQL 8, PHP 8.1, Elasticsearch 7.17.8 and i run magento 2.4.5-p1. Config Server Security & Firewall is enabled.
Every couple of days i get an monitoring alert to say my server doesnt respond to ping and the host has to do a hard reboot, they are getting frustrated with this and say they will turn off monitoring unless i sort this as they have checked all hardware which is fine.
This happens at different times and usually overnight.
I have looked through syslog, mysql log, elasticsearch log, magento 2 logs, apache log, kern.log and i cant find the cause of the issue.
I have enabled "sar" and the RAM usage around the time is 64%, cpu usage is between 5-10%.
What else can i look at to try and diagnose this issue?
Additional info requested by Wilson:
select count - https://justpaste.it/6zc95
show global status - https://justpaste.it/6vqvg
show global variables - https://justpaste.it/cb52m
full process list - https://justpaste.it/d41lt
status - https://justpaste.it/9ht1i
show engine innodb status - https://justpaste.it/a9uem
top -b -n 1 - https://justpaste.it/4zdbx
top -b -n 1 -H - https://justpaste.it/bqt57
ulimit -a - https://justpaste.it/5sjr4
iostat -xm 5 3 - https://justpaste.it/c37to
df -h, df -i, free -h and cat /proc/meminfo - https://justpaste.it/csmwh
htop - https://freeimage.host/i/HAKG0va
Server is using nvme drives, 32GB RAM, 6 cores, MySQL is running on same server as litespeed.
Server has not gone down again since posting this but the datacentre usually reboot within 15 - 20 mins and 99% of the time happens overnight. The server is not accessible over ssh when it crashes.
Rate Per Second = RPS
Suggestions to consider for your instance (should be available in your cpanel as they are all dynamic variables)
connect_timeout=30 # from 10 seconds to reduce aborted_connects RPHr of 75
innodb_io_capacity=900 # from 200 to use more of NVME IOPS capacity
thread_cache_size=36 # from 9 to reduce threads_created RPHr of 75
read_rnd_buffer_size=32768 # from 256K to reduce handler_read_rnd_next RPS of 5,805
read_buffer_size=524288 # from 128K to reduce handler_read_next RPS of 5,063
Many more opportunities exist to improve performance of your instance.
View profile for contact info, please. We are pushing the one question/one answer planned for this platform.

Load generated by distributed worker processes is not equivalent generated by single process

I start another test trying to figure how users are allocated to worker node.
Here is my locust file.
#task
def mytask(self):
self.client.get("/")
class QuickstartUser(HttpUser):
wait_time = between(1, 2)
tasks = [mytask]
It is nothing but access a Chinese search engine website that never failed to visit.
When I start 30 users running in single node, and RPS is 20.
locust -f try_baidu.py
and got running status and result as below.
I switch to distributed running mode using command in 3 terminal of my computer.
locust -f try_baidu.py --master #for master node
locust -f try_baidu.py --worker --master-host=127.0.0.1 #for worker node each
and I input same amount of users and hatch rate in locust UI as above, say 30 users and hatch rate 10.
I've got same RPS which is 20 or around, and each worker node runs 15 users.
This explains that number of user input in UI is total amount to simulated and dispersed around worker node. It is something like load balance to burden load generation.
But I don't know why same amount of users gives 2 different RPS when running in single node (Scenario 1) and distributed (Scenario 2). They shall be result into same or closed RPS as above test.
The only difference I can tell is above comparison is in same computer while Scenario 2 have worker nodes in 2 remote linux VMs. But is it real reason?
Question may be asked not very clearly and I add some testing result here trying to depict what I have when running distributed and in single node with specified users.
Scenario 1: Single Node
Scenario 2: Trying to simulate 3 worker process each of which running 30 users but get lower RPS even.
from console I can see that each worker process starts 30 users as expected but have no idea why RPS is only 1/3 or single node.
Scenario 3: Adding triple times users to 90 for each worker process and get almost same RPS as running in single node.
It seems Scenario 3 is what I expected for triple simulation amount. But why locust graphic panel gives each worker process is running 90 users?
Scenario 4: To make sure locust truly distribute users specified to each worker node, I put 30 users for single worker node and get the same RPS as single node (not distributed)
Do I have to adding up total users distributed to worker node and input this total amount?
The problem is solved to some extend by set master node to another system. I guess there should be something interferes master to collect requests sent from one of workers which results into half of expected RPS. This may suggest a possible solution to lower RPS than it should be.

How to set the max running jobs number in lsf

I use IBM lsf. I want to configure the max running jobs number.
for example, per-user can submit 100000 jobs, but only 10 job can be runned, the other jobs are pending.
I have tried to set UJOB_LIMIT in lsb.queues
UJOB_LIMIT:Specifies the per-user job slot limit for the queue
and MAX_JOBS in lsb.users
MAX_JOBS: Per-user or per-group job slot limit for the cluster. Total number of job slots that
each user or user group can use in the cluster.
and MAX_PEND_JOBS in lsb.users
MAX_PEND_JOBS: Per-user or per-group pending job limit.
But they are wrong. I don't know how to set , who can give me a help?
To allow a user to submit 100000 jobs but only run 10 of them at a time, set this in lsb.users:
Begin User
USER_NAME MAX_JOBS MAX_PEND_JOBS
myuser 10 99990
End User
(99990 = 100000 - 10).
These limits will apply to the whole cluster

Is there a sane way to stagger a cron job across 4 hosts with Ansible?

I've been experimenting with writing playbooks for a few days and I'm writing a playbook to deploy an application right now. It's possible that I may be discovering it's not the right tool for the job.
The application is deployed HA across 4 systems on 2 sites and has a worst case SLA of 1 hour. That's being accomplished with a staggered cron that runs every 15 minutes. i.e. s1 runs at 0, s2 runs at 30 s3 runs at 15, ...
I've looked through all kinds of looping and cron and other modules that Ansible supports and can't really find a way that it supports incrementing an integer by 15 as it moves across a list of hosts, and maybe that's a silly way of doing things.
The only communication that these 4 servers have with each other is a directory on a non-HA NFS share. So the reason I'm doing it as a 15 minute staggered cron is to survive network partitions and the death of the NFS connection.
My other thoughts are ... I can just bite the bullet, make it a */15, and have an architecture that relies on praying that NFS never dies which would make writing the Ansible playbook trivial. I'm also considering deploying this with Fabric or a Bash script, it's just that the process for getting implementation plans approved, and for making changes by following them is very heavy, and I just want to simplify the steps someone has to take late at night.
Solution 1
You could use host_vars or group_vars, either in separate files, or directly in the inventory.
I will try to produce a simple example, that fits your description, using only the inventory file (and the playbook that applies the cron):
[site1]
host1 cron_restart_minute=0
host2 cron_restart_minute=30
host3 cron_restart_minute=15
host4 cron_restart_minute=45
[site2]
host5 cron_restart_minute=0
host6 cron_restart_minute=30
host7 cron_restart_minute=15
host8 cron_restart_minute=45
This uses host variables, you could also create other groups and use group variables, if the repetition became a problem.
In a playbook or role, you can simply refer to the variable.
On the same host:
- name: Configure the cron job
cron:
# your other options
minute: "{{ cron_restart_minute }}"
On another host, you can access other hosts variables like so:
hostvars[host2].cron_restart_minute
Solution 2
If you want a more dynamic solution, for example because you keep adding and removing hosts, you could set a variable in a task using register or set_fact, and calculate, for example by the number of hosts in the only group that the current host is in.
Example:
- name: Set fact for cron_restart_minute
set_fact:
cron_restart_minute: "{{ 60 / groups[group_names[0]].length * (1 + groups[group_names[0]].index(inventory_hostname)) | int }}"
I have not tested this expression, but I am confident that it works. It's Python / Jinja2. group_names is a 1 element array, given above inventory, since no host is in two groups at the same time. groups contains all hosts in a group, and then we find its length or the index of the current host by its inventory_hostname (0, 1, 2, 3).
Links to relevant docs:
Inventory
Variables, specifically this part.

Configuring replica set in a multi-data center

We have the following multi data-center Scenario
Node1 --- Node3
| |
| |
| |
--- ---
Node2 Node4
Node1 and Node3 form a Replica (sort of) Set ( for high availability )
Node 2/Node 4 are Priority 0 members (They should never become Primaries - Solely for read purpose)
Caveat -- what is the best way to design such a situation, since Node 2 and Node4 are not accessible to one another, given the way we configured our VPN/Firewalls;
essentially ruling out any heartbeat between Node2 and Node4.
Thanks Much
Here's what I got in mind:
Don't keep even members in a set. Thus you need another arbiter or set one of node2/4 to non-voting member.
As I'm using C# driver, I'm not sure you are using the same technology to build your application. Anyway, it turns out C# driver obtain a complete available server list from seeds (servers you provided in connection string) and tries to load-balancing requests to all of them. In your situation, I guess you would have application servers running in all 3 data centers. However, you probably don't want, for example, node 1 to accept connections from a different data center. That would significantly slow down the application. So you need some further settings:
Set node 3/4 to hidden nodes.
For applications running in the same data center with node 3/4, don't config the replicaSet parameter in connection string. But config the readPreference=secondary. If you need to write, you'll have to config another connection string to primary node.
If you make the votes of 2 and 4 also 0 then it should act, in failover as though 1 and 2 are only eligible. If you set them to hidden you have to forceably connect to them, MongoDB drivers will intentionally avoid them normally.
Other than that node 2 and 4 have direct access to whatever would be the primary as such I see no other problem.