TypeError--using slurm queue to submit pyiron jobs - hpc

I'm facing some issues while running pyiron jobs on my HPC via the pysqa adapter. I had accidentally erased the main pyiron directory containing pyiron, projects and resources folders. I had copied all the three from another cluster. The only thing that I think will cause problem is sqlite.db file in the resources folder. Previously, I had no issues running interactive VASP jobs through the adapter. I'm guessing something happened after the deletion incident.
The pyiron version I'm using is: 0.2.17
Here is a minimal example using an Interactive vasp job that I have tried:
from pyiron import Project
pr = Project('Al-test')
structure = pr.create_structure('Al', 'fcc', 4.05)
pr.remove_jobs(recursive=True)
from pysqa import QueueAdapter
sqa = QueueAdapter(directory='~/pyiron/resources/queues/')
sqa.queue_view
pr.job_table()
job = pr.create_job(pr.job_type.Vasp, 'job_int')
job.structure = structure
job.server.run_mode.interactive = True
job.executable.executable_path = '~/pyiron/resources/vasp/bin/run_vasp_5.4.4_std_mpi.sh'
job.input.incar['NCORE']=4
job.server.queue = 'slurm'
job.server.cores=16
job.server.view_queues()
sqa.get_queue_status()
job.run(run_again=True)
end of the error log:
~/pyiron/pyiron/pyiron/base/server/generic.py in queue_id(self, qid)
208 qid (int): queue ID
209 """
--> 210 self._queue_id = int(qid)
211
212 #property
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Some inputs/feedback on this would be greatly appreciated.
Thanks!

We updated the queuing system interface in pyiron 0.3.X you can read more about this here:
https://pyiron.org/news/releases/2020/09/06/pyiron-0-3-X-HPC-release.html
For pyiron 0.3.X we have a detailed installation guide available on readthedocs.org:
https://pyiron.readthedocs.io/en/latest/source/installation.html#remote-hpc-cluster
So I highly recommend updating to pyiron 0.3.13.
Apart from this the error message basically says that the submission was not successful. If you navigate to the jobs working directory job.working_directory you should find a run_queue.sh script in the working directory. This is the script pyiron is using to submit the job to the queuing system. You can try to submit it manually using sbatch run_queue.sh this should print the queue id if successful and otherwise the error message from your queuing system.

Related

While running recorded simulation in Gatling, I got file not found error

now I'm trying to conduct load test with gatling.
I have been trying create gatling's simulation script via gatling recorder and
it was going well.
but, when I executed that, I encountered file not found error below although there are files
(in this case /step1/0006_request.json is exist)
request_6: Failed to build request: Resource /step1/0006_request.json not found
there are many json files and that error occurred some of specific post requests.
every requests which is failed are composed following setting.
using request header 'headers_6' below
using RawFileBody method
(even if this is obvious thing because content-type is 'application/json')
I already have been using 12 hour over and I have to finish my task in a timely manner.
I'm so sorry about that I can't share my application which is target of this issue.
if anyone have any idea or kindly want to more detail please ask me.
val headers_6 = Map(
"Content-Type" -> "application/json",
"Origin" -> "applicationServerURL")
.exec(http("request_6")
.post("requestUrl")
.headers(headers_6)
.body(RawFileBody("/step1/0006_request.json"))
.resources(http("request_7")
additionally, I checked json files which were created by gatling recorder on 'resource' directory, them contains only one char 'X'.
this means Gatling recorder doesn't capture request json file's content?
env:
Gatling version gatling 3.5.1
Used Browser: FireFox
I just solved this issue.
this issue was caused by very simple fact.
firstly, I encountered error message below
request_6: Failed to build request: Resource /step1/0006_request.json not found
and I have to check the file paths although this script was auto created by gatling recorder(do not trust gatling tool).
The auto created file path was absolute path and start with ‘/’. And the script started to work once I remove the ‘/’.
I referred other post on stack overflow, and it was written that file path must be written with absolute path if you are using gatling version 3 or later version.
I hope my reply can be informative to someone who may be having same issue as me.
Using 'relative path'.
//As is
.exec(http("request_6")
.post("requestUrl")
.headers(headers_6)
.body(RawFileBody("/step1/0006_request.json"))
.resources(http("request_7")
//To be
.exec(http("request_6")
.post("requestUrl")
.headers(headers_6)
.body(RawFileBody("step1/0006_request.json"))
.resources(http("request_7")

Why is the generation of my TYPO3 documentation failing without a proper error?

I have a new laptop and I try to render the Changelogs of TYPO3 locally based on the steps on https://docs.typo3.org/m/typo3/docs-how-to-document/master/en-us/RenderingDocs/Quickstart.html#render-documenation-with-docker. It continues until the end but show some non-zero exit codes at the end.
project : 0.0.0 : Makedir
makedir /ALL/Makedir
2021-02-16 10:32:50 654198, took: 173.34 seconds, toolchain: RenderDocumentation
REBUILD_NEEDED because of change, age 448186.6 of 168.0 hours, 18674.4 of 7.0 days
OK:
------------------------------------------------
FINAL STATUS is: FAILURE (exitcode 255)
because HTML builder failed
------------------------------------------------
exitcode: 0 39 ms
When I run the command in another documentation project, it renders just fine.
I found the issue with this. It seemed the docker container did not have enough memory allocated. I changed the available memory from 2 Gb to 4 Gb in Docker Desktop and this issue is solved with that.
You already solved the problem. But in case of similar errors: To get more information on a failure, you can also use this trick:
Create a directory tmp-GENERATED-temp before rendering. Usually, this is automatically created and then removedd after rendering. If you create it before rendering, you will find logfiles with more details in this directory.
See the Troubleshooting page.
I had some errors where I found the output in the console insufficient and this helped me to narrow down the problem.
In case of other problems, I would file an issue in the GitHub repo: https://github.com/t3docs/docker-render-documentation
Note: This is specific to TYPO3 docs rendering and may change in the future.

How to wait until a job is done or a file is updated in airflow

I am trying to use apache-airflow, with google cloud-composer, to shedule batch processing that result in the training of a model with google ai platform. I failed to use airflow operators as I explain in this question unable to specify master_type in MLEngineTrainingOperator
Using the command line I managed to launch a job successfully.
So now my issue is to integrate this command in airflow.
Using BashOperator I can train the model but I need to wait for the job to be completed before creating a version and setting it as the default. This DAG create a version before the job is done
bash_command_train = "gcloud ai-platform jobs submit training training_job_name " \
"--packages=gs://path/to/the/package.tar.gz " \
"--python-version=3.5 --region=europe-west1 --runtime-version=1.14" \
" --module-name=trainer.train --scale-tier=CUSTOM --master-machine-type=n1-highmem-16"
bash_train_operator = BashOperator(task_id='train_with_bash_command',
bash_command=bash_command_train,
dag=dag,)
...
create_version_op = MLEngineVersionOperator(
task_id='create_version',
project_id=PROJECT,
model_name=MODEL_NAME,
version={
'name': version_name,
'deploymentUri': export_uri,
'runtimeVersion': RUNTIME_VERSION,
'pythonVersion': '3.5',
'framework': 'SCIKIT_LEARN',
},
operation='create')
set_version_default_op = MLEngineVersionOperator(
task_id='set_version_as_default',
project_id=PROJECT,
model_name=MODEL_NAME,
version={'name': version_name},
operation='set_default')
# Ordering the tasks
bash_train_operator >> create_version_op >> set_version_default_op
The training result in updating of a file in Gcloud storage. So I am looking for an operator or a sensor that will wait until this file is updated, I noticed GoogleCloudStorageObjectUpdatedSensor, but I dont know how to make it retry until this file is updated.
An other solution would be to check for the job to be completed, but I can't find how too.
Any help would be greatly appreciated.
The Google Cloud documentation for the --stream-logs flag:
"Block until job completion and stream the logs while the job runs."
Add this flag to bash_command_train and I think it should solve your problem. The command should only release once the job is finished, then Airflow will mark it as success. It will also let you monitor your training job's logs in Airflow.

Problems with Chronicle Map on Windows

I am trying to use ChronicleMap for my index structure, this seems to work fine on Linux but when I am running my JUnit test on Windows (which is my development environment), I keep getting an error: java.io.IOException: Unable to wait until the file is ready, likely the process which created the file crashed or hung for more than 1 minute.
Here's the code snippet that is problematic:
File file = new File(idxFullPath);
ChronicleMap<Integer, int[]> idx =
ChronicleMapBuilder.of(Integer.class, int[].class)
.averageValue(getSampleIdxList())
.entries(IDX_MAX_SIZE)
.createPersistedTo(file);
The following exception is thrown:
[2016-06-17 14:32:47.779] ERROR main com.mcm.op.persistence.Persistence ERR java.io.IOException: Unable to wait until the file is ready, likely the process which created the file crashed or hung for more than 1 minute
at net.openhft.chronicle.map.ChronicleMapBuilder.waitUntilReady(ChronicleMapBuilder.java:1520)
at net.openhft.chronicle.map.ChronicleMapBuilder.openWithExistingFile(ChronicleMapBuilder.java:1583)
at net.openhft.chronicle.map.ChronicleMapBuilder.createWithFile(ChronicleMapBuilder.java:1444)
at net.openhft.chronicle.map.ChronicleMapBuilder.createPersistedTo(ChronicleMapBuilder.java:1405)
at com.mcm.op.persistence.Persistence.initIdx(Persistence.java:131)
at com.mcm.op.persistence.Persistence.init(Persistence.java:177)
at com.mcm.op.persistence.PersistenceTest.initPersist(PersistenceTest.java:47)
at com.mcm.op.persistence.PersistenceTest.setUp(PersistenceTest.java:29)
Indeed, it is likely that the process which created the file has crashed, or stopped terminated debugging, or something like that.
If it's ok to have a fresh index from unit test-to-test runs, I recommend to try either delete the file at idxFullPath before creating a Chronicle Map, or randomize the mapping file via something like File.createTempFile(). In either case File.deleteOnExit() could appear to be helpful.
If you want to keep the index between unit test runs and always use the same file at idxFullPath for persistence, you could try to use builder.createOrRecoverPersistedTo() instead of plain createPersistedTo() map creation method. However this might slow down the map creation.

Buildbot slaves priority

Problem
I have set up a latent slave in buildbot to help avoid congestion.
I've set up my builds to run either in permanent slave or latent one. The idea is the latent slave is waken up only when needed but the result is that buildbot randomly selectes one slave or the other so sometimes I have to wait for the latent slave to wake even if the permanent one is idle.
Is there a way to prioritize buildbot slaves?
Attempted solutions
1. Custom nextSlave
Following #david-dean suggestion, I've created a nextSlave function as follows (updated to working version):
from twisted.python import log
import traceback
def slave_selector(builder, builders):
try:
host = None
support = None
for builder in builders:
if builder.slave.slavename == 'host-slave':
host = builder
elif builder.slave.slavename == 'support-slave':
support = builder
if host and support and len(support.slave.slave_status.runningBuilds) < len(host.slave.slave_status.runningBuilds):
log.msg('host-slave has many running builds, launching build in support-slave')
return support
if not support:
log.msg('no support slave found, launching build in host-slave')
elif not host:
log.msg('no host slave found, launching build in support-slave')
return support
else:
log.msg('launching build in host-slave')
return host
except Exception as e:
log.err(str(e))
log.err(traceback.format_exc())
log.msg('Selecting random slave')
return random.choice(buildslaves)
And then passed it to BuilderConfig.
The result is that I get this in twistd.log:
2014-04-28 11:01:45+0200 [-] added buildset 4329 to database
But the build never starts, in the web UI it always appear as Pending and none of the logs I've put appear in twistd.log
2. Trying to mimic default behavior
I've having a look to buildbot code, to see how it is done by default.
in file ./master/buildbot/process/buildrequestdistributor.py, class BasicBuildChooser you have:
self.nextSlave = self.bldr.config.nextSlave
if not self.nextSlave:
self.nextSlave = lambda _,slaves: random.choice(slaves) if slaves else None
So I've set exactly that lambda function in my BuilderConfig and I'm getting exactly the same build not starting result.
You can set up a nextSlave function to assign slaves to a builder in a custom manner see: http://docs.buildbot.net/current/manual/cfg-builders.html#builder-configuration