PBS - nodes are free, but they do not start a job - centos

I am new administrator of PBS. I downloaded and installed torque-4.2.6 version. I used default configuration that is provided by torque.setup.
The OS is CentOS with kernel 2.6.18.
I stopped all the firewall. I confirmed that all the ssh/scp works bi-directionally between server and nodes.
after configuration, everything looks fine. small number of jobs have finished well.
When I submitted 10000 jobs, they finished about 70% of the jobs, but the remainders do not start to work. I found that the server_priv/jobs directory contains the jobs.
I checked the log fines... but I could not find any clue to the problem.
I checked disk space by using df, and there is 10% (more than 100GB) of free space and it looks enough to run PBS jobs.
Before I check other things, I ask help to the others in this site.

Related

How to wait for full cloud-initialization before VM is marked as running

I am currently configuring a virtual machine to work as an agent within Azure (with Ubuntu as image). In which the additional configuration is running through a cloud init file.
In which, among others, I have the below 'fix' within bootcmd and multiple steps within runcmd.
However the machine already gives the state running within the azure portal, while still running the cloud configuration phase (cloud_config_modules). This has as a result pipelines see the machine as ready for usage while not everything is installed/configured yet and breaks.
I tried a couple of things which did not result in the desired effect. After which I stumbled on the following article/bug;
The proposed solution worked, however I switched to a rhel image and it stopped working.
I noticed this image is not using walinuxagent as the solution states but waagent, so I tried to replacing that like the example below without any success.
bootcmd:
- mkdir -p /etc/systemd/system/waagent.service.d
- echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/waagent.service.d/override.conf
- sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
- systemctl daemon-reload
After this, also tried to set the runcmd steps to the bootcmd steps. This resulted in a boot which took ages and eventually froze.
Since I am not that familiar with rhel and Linux overall, I wanted to ask help if anyone might have some suggestions which I can additionally try.
(Apply some other configuration to ensure await on the cloud-final.service within a waagent?)
However the machine already had the state running, while still running the cloud configuration phase (cloud_config_modules).
Could you please be more specific? Where did you read the machine state?
The reason I ask is that cloud-init status will report status: running until cloud-init is done running, at which point it will report status: done
I what is the purpose of waiting until cloud-init is done? I'm not sure exactly what you are expecting to happen, but here are a couple of things that might help.
If you want to execute a script "at the end" of cloud-init initialization, you could put the script directly in runcmd, and if you want to wait for cloud-init in an external script you could do cloud-init status --wait, which will print a visual indicator and eventually return once cloud-init is complete.
On not too old Azure Linux VM images, cloud-init rather than WALinuxAgent acts as the VM provisioner. The VM is marked provisioned by the Azure cloud-init datasource module very early during cloud-init processing (source), before any cloud-init modules configurable with user data. WALinuxAgent is only responsible for provisioning Azure VM extensions. It does not appear to be possible to delay sending the 'VM ready' signal to Azure without modifying the VM image and patching the source code of cloud-init Azure datasource.

MDT step by step deployment capture not generating wim

New to MDT.
So I am following through the MS step by step guides:
https://learn.microsoft.com/en-us/windows/deployment/windows-10-poc
https://learn.microsoft.com/en-us/windows/deployment/windows-10-poc-mdt
I am at step 28 in (in the second guide):
Deploy Windows 10 in a test lab using Microsoft Deployment Toolkit
Where the deployment wizard has been launched in a VM on the host system and have watched the process continue for an hour. It finally finishes but it does not create the .wim on the the server share as
expected and as referred to in the bootstrap.ini:
Bootstrap.ini
[Settings]
Priority=Default
[Default]
DeployRoot=\\SRV1\MDTBuildLab$
UserDomain=CONTOSO
UserID=MDT_BA
UserPassword=pass#word1
SkipBDDWelcome=YES
I have verified that the share "DeployRoot" exists and can be connected to using the provided credentials and that the share has the correct permissions to create/delete files.
Not sure what I'm missing but my expectation was a .wim should have been created in \srv1\MDTBuildLab$\Captures but there is nothing in that folder.
Just before stopping the deployment wizard reboots several times in quick succession, which to me doesn't appear correct but as I have never witnessed a successful capture I can't say for sure this isn't what's supposed to happen.
I'm not even sure where I can view any log files to figure out why it fails.
Any assistance appreciated!
Further Info:
Activated monitoring. It gets to step 86 of 93. The last thing I see is "Applying WinPE (BD)" or something similar and then it restarts. Then several quick reboots occur (the loading bar appears for a second or two and then reboots) (Which I think are failing) finally it gives up! The process never completes!
When I attempt to mount the client REFW10X64-001.vhdx to check the logs I am greeted with this message
The disk image isn't initialized, contains partitions that aren't recognizable, or contains volumes that haven't been assigned drive letters. Please use the Disk Management snap-in to make sure that the disk, partitions, and volumes are in a usable state.
So it looks like the last step totally screwed the disk! Which would explain the last several boots failing to load anything.
So no errors no warnings, no logs, no finish and no wim generated.
How do I troubleshoot this?
I know this post is old, but the normal behavior would be as follows:
Using the boot image, you boot into WinPE
The task sequence is started and the OS gets applied to the disk
Reboot
Boot into full Windows where the task sequence also continues
Under full Windows, one of the last steps is that WinPE gets applied again
Reboot
Computer boots automatically into WinPE
The wim file gets created (WinPE is running on the RAM disk and the regular C: drive (and any additional drives) is being mirrored into the wim file)
Computer performs the FINISHACTION.
We would need at least BDD.log and smsts.log to further troubleshoot. My guess is that WinPE was not applied correctly.

FabricDCA and MaxDiskQuotaInMB Configuration

There's two parts to this question. First, what falls under the purview of the Diagnostics---MaxDiskQuotaInMB configuration? Is it everything under SvcFab/Log? Just SvcFab/Log/AppInstanceData/? Having more info on this would be nice.
Second, what is the proper course of action if the FabricDCA.exe is running but the SvcFab/Log and SvcFab/Log/AppInstanceData/ folders exceed the limits we've set on their size? My team set them to 10,000 MB, but SvcFab/Log regularly takes up 12-16 GB.
The cluster configuration on Azure recognizes the change to the MaxDiskQuotaInMB configuration but there seems to be no impact on the node itself. I've tried resetting FabricDCA.exe as well and so far it has not helped either (after several hours).
One node in our cluster had so much space taken up by logs (over our limit) that remaining storage space was reduced to 1 MB.
Posting a more complete answer since it may be helpful to other people.
Most of the things under SvcFab/Log folder should fall under the quota set by MaxDiskQuotaInMB. There are a few things that may not, but the majority of things that usually take disk space are included. Keep in mind also that the task cleaning the disk usually runs every 5 minutes so you may see usage go over the quota within this timeframe.
If FabricDCA.exe is not properly cleaning files from this folder it is possible that you are hitting a bug in .Net runtime where all system.threading.timers stop firing and the disk to not be cleaned because FabricDCA relies on these timers to do so.
This is the bug on the .NET core side tracking the issue: (https://github.com/dotnet/coreclr/issues/26771). It seems to happen when the machine is running out of memory intermittently.
There is an auto-mitigation added in FabricDCA in Service Fabric 7.0.
The manual mitigation is usually to kill FabricDCA.exe process.
The process should start again and after a few minutes it will start cleaning again.
You mentioned that you already tried killing FabricDCA.exe so maybe the solution above does not work for you. In this case, try taking a look at the Service Fabric cluster manifest directly, it might be the case where your new configurations seem to be accepted by the ARM template deployment but the new configuration doesn't reach the cluster manifest which is the source of truth in this case.
Update:
There was a regression introduced as part of the auto-mitigation above which caused The AppInstanceFolder to fill up the disk. This is fixed in SF version 7.0.466

Rundeck - any command execution fails when running on 5.8k nodes

I'm running a rundeck server to delegate a simple script to 5.8k other linux servers.
The very simple script is bellow
!/bin/bash
A=$(hostname)
echo $A
When i run the same job with a smaller number of targets (4089 nodes)
the comands work fine
I tried looking at my service.log page and its not incrementing anything
Any ideas on how to be able to run on all the 5.8k nodes? And where should i look for errors?
Rundeck does not have limits to nodes, certainly depends on how many executions you want to run, how much ram, how many processors and disk space.
Maybe you need to increase the Java heap size:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#java-heap-size
And how to adapt this to your SSH plugin:
https://rundeck.org/docs/administration/maintenance/tuning-rundeck.html#built-in-ssh-plugins

Can I configure icecream (icecc) to do zero local jobs

I'm trying to build a project on a rather underpowered system (intel compute stick with 1GB of RAM). Some of the compilation steps run out of memory. I've configured icecc so that it can send some jobs to a more powerful machine, but it seems that icecc will always do at least one job on the local machine.
I've tried setting ICECC_MAX_JOBS="0" in /etc/icecc/icecc.conf (and restarting iceccd), but the comments in this file say:
# Note: a value of "0" is actually interpreted as "1", however it
# also sets ICECC_ALLOW_REMOTE="no".
I also tried disabling the icecc daemon on the compute stick by running /etc/init.d/icecc stop. However, it seems that icecc is still putting one job on the local machine (perhaps if the daemon is off it's putting all jobs on the local machine?).
The project is makefile based and it appears that I'm stuck on a bottleneck step where calling make with -j > 1 still only issues one job, and this compilation is expiring the system memory.
The only work around I can think of is to actually compile on a different system and then ship the binaries back over but I expect to enter a tweak/build/evaluate cycle on this platform so I'd like to be able to work from the compute stick directly.
Both systems are running ubuntu 14.04 if that helps.
I believe it is not supported since if there are network issues, icecc resorts to compiling on the host machine itself. Best solution would be to compile on the remote machine and copy back the resulting binary.
Have you tried setting ICECC_TEST_REMOTEBUILD in client's terminal (where you run make)?
export ICECC_TEST_REMOTEBUILD=1
In my tests this always forces all sources to be compiled remotely.
Just remember that linking is always done on local machine.