How can I find the jobId that caused node failure in slurm? - hpc

I know that sacctmgr command can list the event history of nodes with the reason.
sacctmgr show event Start=09/01-00:00 format=nodename,timestart,timeend,state,reason,user
This command gives the following output
gnodeXX 2020-09-04T20:21:34 2020-09-05T01:21:38 DRAIN Kill task failed root(ZZ)
gnodeXX 2020-09-09T16:44:55 2020-09-09T17:50:21 DOWN* Not responding slurm(DDDD)
Is there any way I can get the jobId or username that caused the Node Failure or any info on the task for which kill failed? The user column gives either of two outputs for all results root(ZZ) and slurm(DDDD) and I am not sure what these imply.

Related

How to resolve "Invalid Sequence Token" when using cloudwatch agent?

I'm seeing the following warning in the /var/log/amazon/amazon-cloudwatch-agent/amazon-cloudwatch-agent.log:
2021-10-06T06:39:23Z W! [outputs.cloudwatchlogs] Invalid SequenceToken used, will use new token and retry: The given sequenceToken is invalid. The next expected sequenceToken is: 49619410836690261519535138406911035003981074860446093650
But there is no mention about which file is really the one that it's failing. Not even when I add "debug": true to the /opt/aws/amazon-cloudwatch-agent/bin/config.json.
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq .agent
{
"metrics_collection_interval": 60,
"debug": true,
"run_as_user": "root"
}
I have many (28) files in my .logs.logs_collected.files.collect_list section of the config.json file, so how can I find which file is exactly causing trouble?
As of 2021-11-29 a PR to improve the log messages has been merged to the cloudwatch-agent but a new version of the cloudwatch-agent has not been released yet, the next version after v1.247349.0 will likely include a fix for this.
The fix will change the log statements to say
INFO: First time sending logs to %v/%v since startup so sequenceToken is nil, learned new token: xxxx: yyyy: This is an INFO message, as this behaviour is expected at startup for example.
WARN: Invalid SequenceToken used (%v) while sending logs to %v/%v, will use new token and retry: xxxxxv: This on the other hand is not expected and may mean that someone else is writing to the loggroup/logstream concurrently.
If those warnings come right after a restart of the cloudwatch agent (cwagent) then you can safely ignore them, it's expected behaviour . The cloudwatch agent does not save the next sequence token in its persistent state so on restart it will "learn" the correct sequence number by issuing a PutLogEvent with no sequence token at all, that returns an InvalidSequenceTokenException with the next sequence token to use. So it's expected to see those at startup, anyway I proposed a PR to amazon-cloudwatch-agent to improve those log messages.
If the "Invalid SequenceToken used" is seen long after the restart then you may have other issues.
The "Invalid SequenceToken used" error usually means that two entities/sources are trying to write to the same log group/log stream as mentioned in 2 (which is really for the old awslogs agent but still useful):
Caught exception: An error occurred (InvalidSequenceTokenException)
when calling the PutLogEvents operation: The given sequenceToken is
invalid[…] -or- Multiple agents might be sending log events to log
stream[…] – You can't push logs from multiple log files to a single
log stream. Update your configuration to push each log to a log
stream-log group combination.
I could be that the amazon cloudwatch agent itself it's trying to upload the same file twice because you have duplicates in your config.json.
So first print all your log group / log stream pairs in your config.json with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort
which should give an output similar to:
/tableauserver/apigateway apigateway_node5-0.log
/tableauserver/apigateway control_apigateway_node5-0.log
/tableauserver/appzookeeper appzookeeper-discovery_node5-1.log
...
/tableauserver/vizqlserver vizqlserver_node5-3.log
Then you can use uniq -d to find the duplicates in that list with:
cat /opt/aws/amazon-cloudwatch-agent/bin/config.json|jq -r '.logs.logs_collected.files.collect_list[]|"\(.log_group_name) \(.log_stream_name)"'|sort|uniq -d
# The list should be empty otherwise you have duplicates
If that command produces any output it means that you have duplicates in your config.json collect_list.
I personally think that cwagent itself should print the "offending" loggroup/logstream in the logs so I opened in issue in amazon-cloudwatch-agent GitHub page.

Spring Batch Job Stop Using jobOperator

I have Started my job using jobLauncher.run(processJob,jobParameters); and when i try stop job using another request jobOperator.stop(jobExecution.getId()); then get exeption :
org.springframework.batch.core.launch.JobExecutionNotRunningException:
JobExecution must be running so that it can be stopped
Set<JobExecution> jobExecutionsSet= jobExplorer.findRunningJobExecutions("processJob");
for (JobExecution jobExecution:jobExecutionsSet) {
System.err.println("job status : "+ jobExecution.getStatus());
if (jobExecution.getStatus()== BatchStatus.STARTED|| jobExecution.getStatus()== BatchStatus.STARTING || jobExecution.getStatus()== BatchStatus.STOPPING){
jobOperator.stop(jobExecution.getId());
System.out.println("###########Stopped#########");
}
}
when print job status always get job status : STOPPING but batch job is running
its web app, first upload some CSV file and start some operation using spring batch and during this execution if user need stop then stop request from another controller method come and try to stop running job
Please help me for stop running job
If you stop a job while it is running (typically in a STARTED state), you should not get this exception. If you have this exception, it means you have stopped your job while it is currently stopping (that is what the STOPPING status means).
jobExplorer.findRunningJobExecutions returns only running executions, so if in the next line right after this one you have a job in STOPPING status, this means the status changed right after calling jobExplorer.findRunningJobExecutions. You need to be aware that this is possible and your controller should handle this case.
When you tell spring batch to stop a job it goes into STOPPING mode. What this means is it will attempt to complete the unit of work chunk it is currently processing but then stop working. Likely what's happening is you are working on a long running task that is not finishing a unit of work (is it hung?) so it can't move from STOPPING to STOPPED.
Doing it twice rightly leads to an Exception because your job is already STOPPING by the time you did it the first time.

How to stop reporting FAILED of systemctl unit

I have an systemctl service unit which has some runtime dependencies which get resolved during boot. Many times it reports "FAILED" state during boot. This service unit has "Restart=always", so ultimately after boot this unit starts successfully. But, during boot around 3-4 times it reports FAILED which I want to avoid.
Is there a way to ignore the "FAILED" state of service unit being reported?
(As I know it will succeed once the dependency is resolved or will keep retrying)
I found that the return value (including error) reported by failure of service unit can be ignored prepending an hyphen while configuring the ExecStart.
From manual:
https://www.freedesktop.org/software/systemd/man/systemd.service.html#BusName=
VIZ:
"-" If the executable path is prefixed with "-", an exit code of the command normally considered a failure (i.e. non-zero exit status or abnormal exit due to signal) is recorded, but has no further effect and is considered equivalent to success.
ExecStart=-/sbin/getty

Starting Parpool in MATLAB

I tried starting parpool in MATLAB 2015b. Command as follows,
parpool('local',3);
This command should allocate 3 workers. Whereas I received an error stating failure to start parpool. The error message as follows,
Error using parpool (line 94)
Failed to start a parallel pool. (For information in addition to
the causing error, validate the profile 'local' in the Cluster Profile
Manager.)
A similar query was posted in (https://nl.mathworks.com/matlabcentral/answers/196549-failed-to-start-a-parallel-pool-in-matlab2015a). I followed the same procedure, to validate the local profile as per the suggestions.
Using distcomp.feature( 'LocalUseMpiexec', false); or distcomp.feature( 'LocalUseMpiexec', true) in startup.m didn't create any improvement. Also when attempting to validate local profile still gives error message as follows,
VALIDATION DETAILS
Profile: local
Scheduler Type: Local
Stage: Cluster connection test (parcluster)
Status: Passed
Description:Validation Passed
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
Stage: Job test (createJob)
Status: Failed
Description:The job errored or did not reach state finished.
Command Line Output:
Failed to determine if job 24 belongs to this cluster because: Unable to
read file 'C:\Users\varad001\AppData\Roaming\MathWorks\MATLAB
\local_cluster_jobs\R2015b\Job24.in.mat'. No such file or directory..
Error Report:(none)
Debug Log:(none)
Stage: SPMD job test (createCommunicatingJob)
Status: Failed
Description:The job errored or did not reach state finished.
Command Line Output:
Failed to determine if job 25 belongs to this cluster because: Unable to
read file 'C:\Users\varad001\AppData\Roaming\MathWorks\MATLAB
\local_cluster_jobs\R2015b\Job25.in.mat'. No such file or directory..
Error Report:(none)
Debug Log:(none)
Stage: Pool job test (createCommunicatingJob)
Status: Skipped
Description:Validation skipped due to previous failure.
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
Stage: Parallel pool test (parpool)
Status: Skipped
Description:Validation skipped due to previous failure.
Command Line Output:(none)
Error Report:(none)
Debug Log:(none)
I am receiving these error only in my cluster machine. But launching parpool in my standalone PC is working perfectly. Is there a way to rectify this issue?

Inspect and retry resque jobs via redis-cli

I am unable to run the resque-web on my server due to some issues I still have to work on but I still have to check and retry failed jobs in my resque queues.
Has anyone any experience on how to peek the failed jobs queue to see what the error was and then how to retry it using the redis-cli command line?
thanks,
Found a solution on the following link:
http://ariejan.net/2010/08/23/resque-how-to-requeue-failed-jobs
In the rails console we can use these commands to check and retry failed jobs:
1 - Get the number of failed jobs:
Resque::Failure.count
2 - Check the errors exception class and backtrace
Resque::Failure.all(0,20).each { |job|
puts "#{job["exception"]} #{job["backtrace"]}"
}
The job object is a hash with information about the failed job. You may inspect it to check more information. Also note that this only lists the first 20 failed jobs. Not sure how to list them all so you will have to vary the values (0, 20) to get the whole list.
3 - Retry all failed jobs:
(Resque::Failure.count-1).downto(0).each { |i| Resque::Failure.requeue(i) }
4 - Reset the failed jobs count:
Resque::Failure.clear
retrying all the jobs do not reset the counter. We must clear it so it goes to zero.