Rundeck - any command execution fails when running on 5.8k nodes - rundeck

I'm running a rundeck server to delegate a simple script to 5.8k other linux servers.
The very simple script is bellow
!/bin/bash
A=$(hostname)
echo $A
When i run the same job with a smaller number of targets (4089 nodes)
the comands work fine
I tried looking at my service.log page and its not incrementing anything
Any ideas on how to be able to run on all the 5.8k nodes? And where should i look for errors?

Rundeck does not have limits to nodes, certainly depends on how many executions you want to run, how much ram, how many processors and disk space.
Maybe you need to increase the Java heap size:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#java-heap-size
And how to adapt this to your SSH plugin:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#built-in-ssh-plugins

Related

Does assigning more nodes to a job on a SLURM server increase available RAM?

I am working with a program that needs a lot RAM. Currently I am running it on a SLURM cluster. Each node has 125GB RAM. When submitting the job to a single node it eventually fails as it runs out of memory. My rather naive question, as I am new to working on servers, is:
Does assigning more nodes with the command --nodes flag increase available RAM for the submitted job?
For example:
When assigning 10 nodes instead of 1, with the command below, the program fails at the same point as with with one node.
#SBATCH --nodes=10
Is there some other way to combine RAM from multiple nodes for a single job?
Any and all advice is welcome!
That depends on your program, but most likely no.
To use multiple nodes on a Slurm Cluster (or any cluster, for that matter), your program needs to be set up in very specific way, ie. you need inter node communictaion. This is usually done via MPI and the whole program has to be designed around it.
So if your program uses MPI it may be able to split the workload over several nodes. And even that does not guarantee lower memory as that is usually not the goal of such a parallelization.

Running tests in parallel on multiple machines using py.test

I know that UI tests can be run in parallel on multiple machines using selenium grid. How about API tests?
I looked at pytest-xdist plugin and it can run tests in parallel on the local machine using py.test -n NUM, which will send tests to multiple CPUs and run them in parallel. This may not be as effective and fast, if the number of tests that we would like to run in parallel is much more than the no of CPUs on the machine. For example: If the machine has 4 CPUs and we would like to run 50 tests in parallel.
And it seems to run the tests on remote machine we need to do something like
py.test -d --tx socket=192.168.1.102:8888 --rsyncdir mypkg mypkg
I am wondering if there is a way to distribute the tests to multiple remote machines and run them in parallel. For example: If i have 1000 tests and 50 remote machines, then i would like each remote machine to run 1 or more tests at the same time so that tests complete faster. Which means, all the 1000 tests will complete in the time it takes for 20 tests or less.
Thanks.
It looks like you want the load distribution mode, followed by multiple invocations of the --tx argument:
py.test --dist=load --tx socket=192.168.1.110:8888 --tx socket=192.168.1.111:8888 --tx socket=192.168.1.112:8888 --rsyncdir mypkg mypkg
I'm sure you've looked at CPU usage of the python processes when running the tests. If you are doing what what I expect you are doing (running an integration test suite against a single instance of a network service with high response times), your test suite isn't CPU bound but is actually I/O bound. For this type of workload, CPU usage may appear high, but actually includes the amount of time the test runner spent waiting for a response from the system under test.
The biggest problem I've encountered when parallelizing that type of test suite is that the order tests complete sometimes matters, and when run in parallel tests finish in a different order when they run in series just due to variation in response times, causing intermittent and difficult to troubleshoot test failures.
If that doesn't happen with multiple cores on a single machine, that's a good sign your plan will work. That having been said, because there is operational overhead involved with keeping any pool of hosts around - patching with updates, dealing with configuration, provisioning, and networking, not to mention other unexpected issues, I suggest you try something different.
I think you should consider refactoring your test code to use asynchronous IO instead of setting up the test grid. When you do this correctly, multiple tests will be able to run on one core at the same time. Your sysadmin (which may be you!) will thank you.

Spark fails with too many open files on HDInsight YARN cluster

I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.

Can I configure icecream (icecc) to do zero local jobs

I'm trying to build a project on a rather underpowered system (intel compute stick with 1GB of RAM). Some of the compilation steps run out of memory. I've configured icecc so that it can send some jobs to a more powerful machine, but it seems that icecc will always do at least one job on the local machine.
I've tried setting ICECC_MAX_JOBS="0" in /etc/icecc/icecc.conf (and restarting iceccd), but the comments in this file say:
# Note: a value of "0" is actually interpreted as "1", however it
# also sets ICECC_ALLOW_REMOTE="no".
I also tried disabling the icecc daemon on the compute stick by running /etc/init.d/icecc stop. However, it seems that icecc is still putting one job on the local machine (perhaps if the daemon is off it's putting all jobs on the local machine?).
The project is makefile based and it appears that I'm stuck on a bottleneck step where calling make with -j > 1 still only issues one job, and this compilation is expiring the system memory.
The only work around I can think of is to actually compile on a different system and then ship the binaries back over but I expect to enter a tweak/build/evaluate cycle on this platform so I'd like to be able to work from the compute stick directly.
Both systems are running ubuntu 14.04 if that helps.
I believe it is not supported since if there are network issues, icecc resorts to compiling on the host machine itself. Best solution would be to compile on the remote machine and copy back the resulting binary.
Have you tried setting ICECC_TEST_REMOTEBUILD in client's terminal (where you run make)?
export ICECC_TEST_REMOTEBUILD=1
In my tests this always forces all sources to be compiled remotely.
Just remember that linking is always done on local machine.

PBS - nodes are free, but they do not start a job

I am new administrator of PBS. I downloaded and installed torque-4.2.6 version. I used default configuration that is provided by torque.setup.
The OS is CentOS with kernel 2.6.18.
I stopped all the firewall. I confirmed that all the ssh/scp works bi-directionally between server and nodes.
after configuration, everything looks fine. small number of jobs have finished well.
When I submitted 10000 jobs, they finished about 70% of the jobs, but the remainders do not start to work. I found that the server_priv/jobs directory contains the jobs.
I checked the log fines... but I could not find any clue to the problem.
I checked disk space by using df, and there is 10% (more than 100GB) of free space and it looks enough to run PBS jobs.
Before I check other things, I ask help to the others in this site.