Mojolicious: long subroutine have no effect on my variable - perl

I have some trouble with very long subroutine with hypnotoad.
I need to run 1 minute subroutine (hardware connected requirements).
Firsly, i discover this behavior with:
my $var = 0;
Mojo::IOLoop->recurring(60 => sub {
$log->debug("starting loop: var: $var");
if ($var == 0) {
#...
#make some long (30 to 50 seconds) communication with external hardware
sleep 50; #fake reproduction of this long communication
$var = 1;
}
$log->debug("ending loop: var: $var");
})
Log:
14:13:45 2018 [debug] starting loop: var: 0
14:14:26 2018 [debug] ending loop: var: 1 #duration: 41 seconds
14:14:26 2018 [debug] starting loop: var: 0
14:15:08 2018 [debug] ending loop: var: 1 #duration: 42 seconds
14:15:08 2018 [debug] starting loop: var: 0
14:15:49 2018 [debug] ending loop: var: 1 #duration: 42 seconds
14:15:50 2018 [debug] starting loop: var: 0
14:16:31 2018 [debug] ending loop: var: 1 #duration: 41 seconds
...
3 problems:
1) Where do these 42 seconds come from? (yes, i know, 42 seconds is the answer of the universe...)
2) Why the IOLoop recuring loses his pace?
3) Why my variable is setted to 1, and just one second after, the if get a variable equal to 0?
When looped job needs 20 seconds or 25 seconds, no problem.
When looped job needs 60 secondes and used with morbo, no problem.
When looped job needs more than 40 seconds and used with hypnotoad (1 worker), this is the behavior explained here.
If i increase the "not needed" time (e.g. 120 seconds IOLoop for 60 seconds jobs, the behaviour is always the same.
It's not a problem about IOLoop, i can reproduce the same behavior outside loop.
I suspect a problem with worker killed and heart-beat, but i have no log about that.

Related

Wildlfy Schedule overlapping

I use a scheduler in WildFly 9, with this EJBs:
import javax.ejb.Singleton;
import javax.ejb.Startup;
import javax.ejb.Schedule;
I get loads of these warnings:
2020-01-21 12:35:59,000 WARN [org.jboss.as.ejb3] (EJB default - 6) WFLYEJB0043: A previous execution of timer [id=3e4ec2d2-cea9-43c2-8e80-e4e66593dc31 timedObjectId=FiloJobScheduler.FiloJobScheduler.FiskaldatenScheduler auto-timer?:true persistent?:false timerService=org.jboss.as.ejb3.timerservice.TimerServiceImpl#71518cd4 initialExpiration=null intervalDuration(in milli sec)=0 nextExpiration=Tue Jan 21 12:35:59 GMT+02:00 2020 timerState=IN_TIMEOUT info=null] is still in progress, skipping this overlapping scheduled execution at: Tue Jan 21 12:35:59 GMT+02:00 2020.
But when I measure the elapsed times, they are allways < 1 minute.
The Scheduling is:
#Schedule(second = "*", minute = "*/5", hour = "*", persistent = false)
Has anyone an idea what is going on?
A little logging would help you. This runs every second because that's what you're telling it to do with the second="*" section. If you want to only run every 5 minutes of every hour, change the schedule to:
#Schedule(minute = "*/5", hour="*", persistent = false)

TORQUE jobs hang on Ubuntu 16.04

I have TORQUE installed on Ubuntu 16.04, and I am having trouble because my jobs hang. I have a test script test.pbs:
#PBS -N test
#PBS -l nodes=1:ppn=1
#PBS -l walltime=0:01:00
cd $PBS_O_WORKDIR
touch done.txt
echo "done"
And I run it with
qsub test.pbs
The job writes done.txt and echoes "done" just fine, but the job hangs in the C state.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
46.localhost test wlandau 00:00:00 C batch
Edit: some diagnostic info on another job from qstat -f 55
qstat -f 55
Job Id: 55.localhost
Job_Name = test
Job_Owner = wlandau#localhost
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = haggunenon
Checkpoint = u
ctime = Mon Oct 30 07:35:00 2017
Error_Path = localhost:/home/wlandau/Desktop/test.e55
exec_host = localhost/2
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Oct 30 07:35:00 2017
Output_Path = localhost:/home/wlandau/Desktop/test.o55
Priority = 0
qtime = Mon Oct 30 07:35:00 2017
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 00:01:00
session_id = 5115
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOST=localhost,
PBS_O_HOME=/home/wlandau,PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=wlandau,
PBS_O_PATH=/home/wlandau/bin:/home/wlandau/.local/bin:/usr/local/sbin
:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/ga
mes:/snap/bin,PBS_O_SHELL=/bin/bash,PBS_SERVER=localhost,
PBS_O_WORKDIR=/home/wlandau/Desktop
comment = Job started on Mon Oct 30 at 07:35
etime = Mon Oct 30 07:35:00 2017
exit_status = 0
submit_args = test.pbs
start_time = Mon Oct 30 07:35:00 2017
Walltime.Remaining = 60
start_count = 1
fault_tolerant = False
comp_time = Mon Oct 30 07:35:00 2017
And a similar tracejob -n2 62:
/var/spool/torque/server_priv/accounting/20171029: No matching job records located
/var/spool/torque/server_logs/20171029: No matching job records located
/var/spool/torque/mom_logs/20171029: No matching job records located
/var/spool/torque/sched_logs/20171029: No matching job records located
Job: 62.localhost
10/30/2017 17:20:25 S enqueuing into batch, state 1 hop 1
10/30/2017 17:20:25 S Job Queued at request of wlandau#localhost, owner =
wlandau#localhost, job name = jobe945093c2e029c5de5619d6bf7922071,
queue = batch
10/30/2017 17:20:25 S Job Modified at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:00
10/30/2017 17:20:25 L Job Run
10/30/2017 17:20:25 S Job Run at request of Scheduler#Haggunenon
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 S Not sending email: User does not want mail of this type.
10/30/2017 17:20:25 M job was terminated
10/30/2017 17:20:25 M obit sent to server
10/30/2017 17:20:25 A queue=batch
10/30/2017 17:20:25 M scan_for_terminated: job 62.localhost task 1 terminated, sid=17917
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00
10/30/2017 17:20:25 A user=wlandau group=wlandau
jobname=jobe945093c2e029c5de5619d6bf7922071 queue=batch
ctime=1509398425 qtime=1509398425 etime=1509398425 start=1509398425
owner=wlandau#localhost exec_host=localhost/0 Resource_List.ncpus=1
Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=01:00:00 session=17917
end=1509398425 Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=0kb resources_used.vmem=0kb
resources_used.walltime=00:00:00
EDIT: jobs now hanging in E
After some tinkering, I am now using these settings. I have moved on to this tiny pipeline workflow, where some TORQUE jobs wait for other TORQUE jobs to finish. Unfortunately, all the jobs hang in the E state, and any number of jobs more than 4 will just stay queued. To keep things from hanging indefinitely, I have to sudo qdel -p each one, which I think is causing legitimate problems with the project's filesystem as well as an inconvenience.
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
113.localhost ...b73ec2cda6dca wlandau 00:00:00 E batch
114.localhost ...b6c8e6da05983 wlandau 00:00:00 E batch
115.localhost ...9123b8e20850b wlandau 00:00:00 E batch
116.localhost ...e6d49a3d7d822 wlandau 00:00:00 E batch
117.localhost ...8c3f6cb68927b wlandau 0 Q batch
118.localhost ...40b1d0cab6400 wlandau 0 Q batch
qmgr -c "list server" shows
Server haggunenon
server_state = Active
scheduling = True
max_running = 300
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:1 Exiting:3
acl_hosts = localhost
managers = root#localhost
operators = root#localhost
default_queue = batch
log_events = 511
mail_from = adm
query_other_jobs = True
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
scheduler_iteration = 600
node_check_rate = 150
tcp_timeout = 6
mom_job_sync = True
pbs_version = 2.4.16
keep_completed = 0
submit_hosts = SERVER
allow_node_submit = True
next_job_number = 119
net_counter = 118 94 93
And qmgr -c "list queue batch"
Queue batch
queue_type = Execution
total_jobs = 5
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:4
max_running = 300
resources_max.ncpus = 4
resources_max.nodes = 2
resources_min.ncpus = 1
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_default.walltime = 01:00:00
mtime = Wed Nov 1 07:40:45 2017
resources_assigned.ncpus = 4
resources_assigned.nodect = 4
keep_completed = 0
enabled = True
started = True
C state means the job has completed and its status is kept in the system. Usually the status is kept after job completion for a period of time specified by the keep_completed parameter. However certain types of failure may result in the job being kept in this state to provide the information necessary to examine the cause of failure.
Check the output of qstat -f 46 to see if there is anything indicating an error.
To tune the keep_completed parameter you can execute the following command to check the value of this parameter on your system.
qmgr -c "print queue batch keep_completed"
If you have administrative privileges on the Torque server you could also change this value with
qmgr -c "set queue batch keep_completed=120"
To keep jobs in completed state for 2 minutes after completion.
In general having keep_completed set is a useful feature. Advanced schedulers use the information on completed jobs to schedule around failures.

Marathon backoff - is it really exponential?

I'm trying to figure out Marathon's exponential backoff configuration. Here's the documentation:
The backoffSeconds and backoffFactor values are multiplied until they reach the maxLaunchDelaySeconds value. After they reach that value, Marathon waits maxLaunchDelaySeconds before repeating this cycle exponentially. For example, if backoffSeconds: 3, backoffFactor: 2, and maxLaunchDelaySeconds: 3600, there will be ten attempts to launch a failed task, each three seconds apart. After these ten attempts, Marathon will wait 3600 seconds before repeating this cycle.
The way I think of exponential backoff is that the wait periods should be:
3*2^0 = 3
3*2^1 = 6
3*2^2 = 12
3*2^3 = 24 and so on
so every time the app crashes, Marathon will wait a longer period of time before retrying. However, given the description above, Marathon's logic for waiting looks something like this:
int retryCount = 0;
while(backoffSeconds * (backoffFactor ^ retryCount) < maxLaunchDelaySeconds)
{
wait(backoffSeconds);
retryCount++;
}
wait(maxLaunchDelaySeconds);
This matches the explanation in the documentation, since 3*2^x < 3600 for values of x fewer than or equal to 10. However, I really don't see how it can be called an exponential backoff, since the wait time is constant.
Is there a way to make Marathon wait progressively longer times with every restart of the app? Am I misunderstand the doc? Any help would be appreciated!
as far as I understand the code in the RateLimiter.scala, it is like you described, but then capped to the maxLaunchDelay waiting period. Let`s say maxLaunchDelay is one hour (3600s)
3*2^0 = 3
3*2^1 = 6
3*2^2 = 12
3*2^3 = 24
3*2^4 = 48
3*2^5 = 96
3*2^6 = 192
3*2^7 = 384
3*2^8 = 768
3*2^9 = 1536
3*2^10 = 3072
3*2^11 = 3600 (6144)
3*2^12 = 3600 (12288)
3*2^13 = 3600 (24576)
Which brings us a typically 2^n graph, see
You would get a bigger increase, if you would other backoffFactors,
for example backoff factor 10:
or backoff factor 20:
Additionally I saw a re-work of this topic, code review currently open here: https://phabricator.mesosphere.com/D1007
What do you think?
Thanks
Johannes

Does powershell's parallel foreach use at most 5 thread?

The throttlelimit parameter of foreach -parallel can control how many processes are used when executing the script. But I can't have more than 5 processes even if I set throttlelimit greater than 5.
The script is executed in multiple powershell processes. So I check PID inside the script. And then group the PIDs so that I can know how many processes are used to execute the script.
function GetPID() {
$PID
}
workflow TestWorkflow {
param($throttlelimit)
foreach -parallel -throttlelimit $throttlelimit ($i in 1..100) {
GetPID
}
}
foreach ($i in 1..8) {
$pids = TestWorkflow -throttlelimit $i
$measure = $pids | group | Measure-Object
$measure.Count
}
The output is
1
2
3
4
5
5
5
5
For $i less or equal to 5, I have $i processes. But for $i greater than 5, I have only 5 processes. Is there any way to increase the number of processes when executing the script?
Edit: In response to #SomeShinyObject's answer, I added another test case. It's a modification of the example given by #SomeShinyObject. I added a function S, which does nothing but sleeping for 10 seconds.
function S($n) {
$s = Get-Date
$s.ToString("HH:mm:ss.ffff") + " start sleep " + $n
sleep 10
$e = Get-Date
$e.ToString("HH:mm:ss.ffff") + " end sleep " + $n + " diff " + (($e-$s).TotalMilliseconds)
}
Workflow Throttle-Me {
[cmdletbinding()]
param (
[int]$ThrottleLimit = 10,
[int]$CollectionLimit = 10
)
foreach -parallel -throttlelimit $ThrottleLimit ($n in 1..$CollectionLimit){
$s = Get-Date
$s.ToString("HH:mm:ss.ffff") + " start " + $n
S $n
$e = Get-Date
$e.ToString("HH:mm:ss.ffff") + " end " + $n + " diff " + (($e-$s).TotalMilliseconds)
}
}
Throttle-Me -ThrottleLimit 10 -CollectionLimit 20
And here's the output. I grouped the output by time (number of seconds), and did a little bit reorder within each group to make it clearer. It's pretty obvious that function S are called 5 by 5, although I set throttlelimit to 10 (first we have start sleep 6..10, 10 seconds later, we have start sleep 1..5, and 10 seconds later start sleep 11..15, and 10 seconds later start sleep 16..20).
03:40:29.7147 start 10
03:40:29.7304 start 9
03:40:29.7304 start 8
03:40:29.7459 start 7
03:40:29.7459 start 6
03:40:29.7616 start 5
03:40:29.7772 start 4
03:40:29.7772 start 3
03:40:29.7928 start 2
03:40:29.7928 start 1
03:40:35.3067 start sleep 7
03:40:35.3067 start sleep 8
03:40:35.3067 start sleep 9
03:40:35.3692 start sleep 10
03:40:35.4629 start sleep 6
03:40:45.3292 end sleep 7 diff 10022.5353
03:40:45.3292 end sleep 8 diff 10022.5353
03:40:45.3292 end sleep 9 diff 10022.5353
03:40:45.3761 end sleep 10 diff 10006.8765
03:40:45.4855 end sleep 6 diff 10022.5243
03:40:45.3605 end 9 diff 15630.1005
03:40:45.3917 end 7 diff 15645.7465
03:40:45.3917 end 8 diff 15661.3313
03:40:45.4229 end 10 diff 15708.2274
03:40:45.5323 end 6 diff 15786.3969
03:40:45.4386 start sleep 5
03:40:45.4542 start sleep 4
03:40:45.4542 start sleep 3
03:40:45.4698 start sleep 2
03:40:45.5636 start sleep 1
03:40:45.4698 start 11
03:40:45.4855 start 12
03:40:45.5011 start 13
03:40:45.5167 start 14
03:40:45.6105 start 15
03:40:55.4596 end sleep 3 diff 10005.4374
03:40:55.4596 end sleep 4 diff 10005.4374
03:40:55.4596 end sleep 5 diff 10021.0426
03:40:55.4752 end sleep 2 diff 10005.3992
03:40:55.5690 end sleep 1 diff 10005.3784
03:40:55.4752 end 3 diff 25698.0221
03:40:55.4909 end 5 diff 25729.302
03:40:55.5065 end 4 diff 25729.2523
03:40:55.5221 end 2 diff 25729.2559
03:40:55.6159 end 1 diff 25823.032
03:40:55.5534 start sleep 11
03:40:55.5534 start sleep 12
03:40:55.5690 start sleep 13
03:40:55.5846 start sleep 14
03:40:55.6784 start sleep 15
03:40:55.6002 start 16
03:40:55.6002 start 17
03:40:55.6159 start 18
03:40:55.6326 start 19
03:40:55.7096 start 20
03:41:05.5692 end sleep 11 diff 10015.8226
03:41:05.5692 end sleep 12 diff 10015.8226
03:41:05.5848 end sleep 13 diff 10015.8108
03:41:05.6004 end sleep 14 diff 10015.8128
03:41:05.6942 end sleep 15 diff 10015.8205
03:41:05.5848 end 12 diff 20099.3218
03:41:05.6004 end 11 diff 20130.5719
03:41:05.6161 end 13 diff 20114.9729
03:41:05.6473 end 14 diff 20130.5962
03:41:05.6942 end 15 diff 20083.7506
03:41:05.6317 start sleep 17
03:41:05.6317 start sleep 16
03:41:05.6473 start sleep 18
03:41:05.6629 start sleep 19
03:41:05.7411 start sleep 20
03:41:15.6320 end sleep 16 diff 10000.3608
03:41:15.6320 end sleep 17 diff 10000.3608
03:41:15.6477 end sleep 18 diff 10000.3727
03:41:15.6633 end sleep 19 diff 10000.3709
03:41:15.7414 end sleep 20 diff 10000.3546
03:41:15.6477 end 16 diff 20047.4375
03:41:15.6477 end 17 diff 20047.4375
03:41:15.6633 end 18 diff 20047.4295
03:41:15.7101 end 19 diff 20077.5169
03:41:15.7414 end 20 diff 20031.7909
I believe your test is a little flawed. According to this TechNet article, $PID is not an available variable in a Script Workflow.
A better test with an explanation can be found on this site
Basically, the ThrottleLimit is set to a theoretical [Int32]::MaxValue. The poster designed a better test to check parallel processing also (slightly modified):
Workflow Throttle-Me {
[cmdletbinding()]
param (
[int]$ThrottleLimit = 10,
[int]$CollectionLimit = 10
)
foreach -parallel -throttlelimit $ThrottleLimit ($n in 1..$CollectionLimit){
"Working on $n"
"{0:hh}:{0:mm}:{0:ss}" -f (Get-Date)
}
}
I can't speak for that specific cmdlet , but as you found PowerShell only uses one thread per PID. The common ways of getting around this are PSJobs and Runspaces. Runspaces are more difficult than PSJobs, but they also have significantly better performance, which makes them the popular choice. The downside is that your entire code has to resolve around them which means an entire rewrite in most cases.
Unfortunately as much as Microsoft pushes PowerShell, it is not made for performance and speed. But it does as good as it can while tied to .NET

Perl: Devel::Gladiator module and memory management

I have a perl script that needs to run in the background constantly. It consists of several .pm module files and a main .pl file. What the program does is to periodically gather some data, do some computation, and finally update the result recorded in a file.
All the critical data structures are declared in the .pl file with our, and there's no package variable declared in any .pm file.
I used the function arena_table() in the Devel::Gladiator module to produce some information about the arena in the main loop, and found that the SVs of type SCALAR and GLOB are increasing slowly, resulting in a gradual increase in the memory usage.
The output of arena_table (I reformat them, omitting the title. after a long enough period, only the first two number is increasing):
2013-05-17#11:24:34 36235 3924 3661 3642 3376 2401 720 201 27 23 18 13 13 10 2 2 1 1 1 1 1 1 1 1 1 1
After running for some time:
2013-05-17#12:05:10 50702 46169 36910 4151 3995 3924 2401 720 274 201 26 23 18 13 13 10 2 2 1 1 1 1 1 1 1 1 1
The main loop is something like:
our %hash1 = ();
our %hash2 = ();
# and some more package variables ...
# all are hashes
do {
my $nowtime = time();
collect_data($nowtime);
if (calculate() == 1) {
update();
}
sleep 1;
get_mem_objects(); # calls arena_table()
} while (1);
Except get_mem_objects, other functions will operate on the global hashes declared by our. In update, the program will do some log rotation, the code is like:
sub rotate_history() {
my $i = $main::HISTORY{'count'};
if ($i == $main::CONFIG{'times'}{'total'}) {
for ($i--; $i >= 1; $i--) {
$main::HISTORY{'data'}{$i} = dclone($main::HISTORY{'data'}{$i-1});
}
} else {
for (; $i >= 1; $i--) {
$main::HISTORY{'data'}{$i} = dclone($main::HISTORY{'data'}{$i-1});
}
}
$main::HISTORY{'data'}{'0'} = dclone(\%main::COLLECT);
if ($main::HISTORY{'count'} < $main::CONFIG{'times'}{'total'}) {
$main::HISTORY{'count'}++;
}
}
If I comment the calls to this function, in the final report given by Devel::Gladiator, only the SVs of type SCALAR is increasing, the number of GLOBs will finally enter a stable state. I doubt the dclone may cause the problem here.
My questions are,
what exactly does the information given by that module mean? The statements in the perldoc is a little vague for a perl newbie like me.
And, what are the common skills to lower the memory usage of long-running perl scripts?
I know that package variables are stored in the arena, but how about the lexical variables? How are the memory consumed by them managed?