Torque PBS_Server - hpc

does anyone know how to get the process id (pid) of a submitted job on a computing node.
I need the pid to run some scripts for analysing the ressources of a job.
I am using Torque version 4.1.2.
Thanks!

I don't think Torque server let you know this type on information.
But, you can write smalls scripts (prologue and epilogue) that will let you know useful informations about the job on compute node(s).
See Torque admin guide about Prologue and Epilogue.

Related

Anylogic, how to simulate a machine set up that runs every time that an entering agent A has a parameter that differs from the last agent B served?

I have Agent B / Agent A / Agent A type of agents in a queue waiting for a service (delay). But I need that when it's time to change from Agent A to Agent B, ocurrs a machine set up. How do you usually handle this situation?
I tried, with no success, making variables and checking conditions On enter in the delay. I'm new to AnyLogic, so any help would be amazing.
The Downtime block allows to model changeovers for Service blocks. Read the help and explore all the tutorials and example models that use it to learn how it works.

How to speed up the Nifi streaming logs to Kafka

I'm new to nifi, trying to read files and push to kafka. From some basic reading, I'm able to do that with the following.
With this flow I'm able to achieve 0.5million records/sec, of size 100kb each. I would like to catch-up to the speed of 2millions/sec. Data from ListFile and FetchFile processors through slitText processors is great. But, getting settled at PublishKafka.
So clearly bottleneck is with the PublishKafka. How do I improve this performance? Should I tune something at Kafka end or with Nifi-PublishKafka end.
Can someone help me with this. Thanks
You can try using Record Oriented processors i.e PublishKafkaRecord_1.0 processor.
So that your flow will be:
1.ListFile
2.FetchFile
3.PublishKafkaRecord_1.0 //Configure with more than one concurrent task
By using this flow we are not going to use SplitText processors and Define RecordReader/Writer controller services in PublishKafkaRecord processor.
In addition
you can also distribute the load by using Remote Process Groups
Flow:
1.ListFile
2.RemoteProcessGroup
3.FetchFile
4.PublishKafkaRecord_1.0 //In scheduling tab keep more than one concurrent task
Refer to this link for more details regards to design/configuring the above flow.
Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.
Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.

Checking later the state of nodes during a past job execution

Is there a way to check in Azure Batch whether a node went to the unusable state while running a specific job on a specific pool? The context is that, when running a job and checking the pool on which it was running at that time, there were some nodes that went to the unusable state during the job execution, but we wouldn't have any indication that this happened if we weren't checking the heatmap of the pool during the job execution. Thus, how can I check if nodes went to the unusable state during some job run?
Also, I see that there are metrics collected about the state of nodes in Azure portal, but I am not sure why these metrics are always zero for me even though I am running jobs and tasks that fail?
I had a quick look for you: (I hope this helps :))
For the nodes state monitoring you can do something mentioned here:
https://learn.microsoft.com/en-us/azure/batch/batch-efficient-list-queries
PoolOperations: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.pooloperations?view=azurebatch-7.0.1
ListComputeNodes : Enumerates the ComputeNode of the specified pool.
I think at the detail level if you filter with correct clasue you will get ComputeNode information, then you can loop through the information and check the state.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.batch.common.computenodestate?view=azurebatch-7.0.1
Possible sample implementation: (Please note this specific code is probably for the pool health) https://github.com/Azure/azure-batch-samples/blob/master/CSharp/Common/GettingStartedCommon.cs#L31
With regards to the metrics, how are you getting the metrics back. I am sure I will get corrected if I said anything doubtful or incorrect. Thanks!

How to simulate ramp-up with Locust?

As I know Ramp-up function has been removed from locust.
Just wondering if the hatching process was the same as or similar to ramp-up? Or is there anyway to simulate the situation?
Not sure about what ramp up function you are talking about. There are plenty of options for controlling ramp-up in locust, including the new step-load function:
--step-load Enable Step Load mode to monitor how performance
metrics varies when user load increases. Requires
--step-clients and --step-time to be specified.
--step-clients STEP_CLIENTS
Client count to increase by step in Step Load mode.
Only used together with --step-load
--step-time STEP_TIME
Step duration in Step Load mode, e.g. (300s, 20m, 3h,
1h30m, etc.). Only used together with --step-load
Just found a workaround by Taurus to schedule the load.

Is there any method to get mutual exclusion in a chef node?

For example, If a process updates a node when a chef-client is running the chef-client will overwrite the node data:
chef-client gets node data (state 1)
The process A gets node data (state 1)
The process A updates locally the node data (state 2)
This process saves node data (state 2)
chef-client updates locally the node data (state 2*)
chef-client saves node data, and this node data does not contains the changes from the process A (state 2). The chef-client overwrite the node data. (state 2*)
The same problem occurs, if we have two processes saving node data in the same moment
EDIT
We need to external modification because we have a nice UI of Chef server to manage remotely a lot of computers, showing like a tree (similar to LDAP). An administrator can update the value of the recipes from here. This project is OpenSource: https://github.com/gecos-team/
Although we had a semaphore system, we have detected that if we have two or more simultaneous requests, we can have a concurrence problem:
The regular case is that the system works
But sometimes the system does not work
EDIT 2
I have added a document with a lot of information about our problem.
Throwing what I would do for this case as an answer:
Have a distributed lock mechanism like
This
I'm not using it myself, it is just for the idea
Build a start/report/error handler which will
at start acquire a lock on the node name in the DLM in 1.
if it can't abort the run or wait untill the lock is free
at end (report or error) release the lock.
Modify the External system to do the same as the handler above, aquire a lock before modifying and release when done.
Pay attention to the lock lifetime !!! It should be longer than your Chef Run plus a margin, and the UI should ensure its lock is still there before writing and abort if not.
A way to get rid of the handler (but you still need a lock for the UI) is to take advantage of the reporting api (premium feature of chef 12, free under 25 nodes, license needed upward)
This turn a bit convoluted and need the node to do reporting (so the chef-server url should end with organizations/ and the client version should be above 11.16 or use the backport)
Then your can ask about the runs for a node and check if there's one at started status for this node, and wait until it is ended.
Chef doesn't implement a transaction feature and also it does not re-converge nodes on updates automatically by default. It's open for race conditions which you can try to reduce by updated node attributes from within a chef-client run (right before you do something critical) but you will never end up in a reliable, working setup.
The longer the converge runs, the higher the gap and risk of corruption.
Chef's node attributes are only useful for debugging or modification by the chef-client running on the node itself and pretty much useless in highly concurrent/dynamic environments.
I would use Consul.io to coordinate semaphores and key/value configuration data in realtime. Access it using chef recipes or LWRPs using one of the various interfaces consul provides (http, DNS, …).
You can implement a very easy push-job task to run chef-client (IMHO easier and more powerful than the chef "push jobs" feature, however not integrated in Chefs' ACL/user management) which also is guarded by a distributed semaphore or using the "Leader Election" feature. Of course you'll have to add this logic to your node update script, too.
Chef-client will then retrieve a lock on start and block you from manipulating data while it converges and vice versa.
Discovered this one in production and came to the conclusion that there is no safe way to edit the node attributes directly. Leave it to the chef-client :-)
Good news is that there are other more reliable ways to set node attributes. Chef roles and environments can both be edited safely while a client is running and only take effect during the next chef run. Additionally node attribute precedence rules ensure that any settings you make override those that might be made by a recipe.
I suggest to avoid Chef node data updates from your external app, and move that desired node configuration to a Chef databag.
So nodes will read Chef node data and configuration databag and write only in node data. And you external app read both but only writes in the databag.
If you want to avoind a dependency on another external service, perhaps you could use some kind of time slicing.
Roughly: nodes only start a chef-client on odd minutes. Api only update chef data on even minutes (distribute these even minutes if you have more than a queue).