Getting END_MISS notification when my Job Status - WAITING state - oozie-coordinator

I have set sla-nominal time as ${coord:nominalTime()}, so my sla starts calculating taking coordinator triggering time as the start time for sla calculations.
But the problem is I have many workflows running on my cluster and my workflow goes to waiting state if no memory on the cluster.
eg: for a particular workflow my coordinator start time is 2.00 pm but due to memory issue it goes to waiting state and workflow get triggered at say 5 pm, and if my sla duration is 2 hrs. this condition is already crossed so I get end_miss notification when my job is in still waiting state.
So is there any way I can give actual workflow triggering time as sla nominal time instead of coordinator nominal-time. so that sla starts calculating when my process actually started running and not when in waiting state.

Related

Automatic heart-beat of Cadence workflow/activity

We have registered the activities with auto-heartbeat configuration EnableAutoHeartBeat: true and also configured the activity option config HeartbeatTimeout: 15Min in the activity implementation.
Do we still need to explicitly send heart-beat using activity.heartbeat() or is it automatically taken care by the go-client library?
If its automatic, then what will happen if the Activity is waiting for external API response say >15Min delay?
What will happen during the activity heart-beat if the worker executing the activity crashes or killed?
Will Cadence retry the activities due to heart-beat failures?
No, the SDK will take care of it with this config.
The auto heartbeat will send heartbeat for every interval — the interval is 80% * Heartbeat timeout(15 minutes in your case) so that the activity won’t get timeout as long as the activity worker is still live.
So you should use a smaller heatbeat timeout, ideally 10~20s is the best.
The activity will fail with “heartbeat timeout “
Yes if you have set a retry policy .
See my other answer for retry policy
How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)?
Example
Let say your activity implementation is waiting on AWS SDK API for 2 hours (max API timeout configured) --
You should still use 10-20 s for heartbeat timeout, and also use 2 hours for activity start to close timeout.
Heartbeat timeout is for detecting the host is not live anymore so that the activity can be restarted as early as possible.
Imagine this case:
Because the API takes 2 hours, the activity worker got restarted during the 2 hours.
If the HB timeout is 15 minutes, then Cadence will retry this activity after 15 minutes.
If HB timeout is 10s, then Cadence can retry it after 10s, because it will get HB timeout within 10 seconds.

Anylogic: Measuring process time without considering waiting time during evening

I have created a discrete simulation model for our production processes in which the capacity, output, etc. should be simulated for the coming year. The model works, but I have a problem with measuring the process time. Our production only works from 7 a.m. to 3 p.m. Is there a way to set the TimeMeasureStart and TimeMeasureEnd block so that the time is only measured during the shift?
As a simplified example with a TimeMeasureStart, a service and a TimeMeasureEnd block:
The agent passes TimeMeasureStart at 2:30 p.m. and immediately enters the service block. The service time is 2 hours. The worker starts the service and goes home at 3:00 p.m. The agent waits in the service block from 3:00 p.m. to 7:00 a.m. At 7 a.m. the worker continues the service (until 8:30 a.m.). As soon as it is finished, the agent passes the TimeMeasureEnd block. The result is currently a process time of 18 hours. However, I only want to measure the time that is worked, so that I get 2 hours as the process time.
Is there a possibility to set / program the TimeMeasureStart / TimeMeasureEnd blocks accordingly so that the waiting time is not included?
My first suggestion would be to ensure that you really need calendar time, why not just run the model in hours and every hour is a working hour... then you don't need to shift schedule.
But often for reporting or having different shift patterns within your model requires you to need calendar time as the basis.
Here is a simple solution: Simply record the time a resource was seized through your own local variables.
You need to add two double variables to your agent 1 for last start and 1 for the cumulative time
previousServiceStart and cummServiceTime
and then save the times in the resource pool using the On seize and On release code
I casted the agent to my custom agent using the (MyAgent)agent code, so that I can access the variables

What does Kubernetes cronjobs `startingDeadlineSeconds` exactly mean?

In Kubernetes cronjobs, It is stated in the limitations section that
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency.
What I understand from this is that, If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
Also, If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
After investigating the code base of the Kubernetes repo, so this is how the CronJob controller works:
The CronJob controller will check the every 10 seconds the list of cronjobs in the given Kubernetes Client.
For every CronJob, it checks how many schedules it missed in the duration from the lastScheduleTime till now. If there are more than 100 missed schedules, then it doesn't start the job and records the event:
"FailedNeedsStart", "Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew."
It is important to note, that if the field startingDeadlineSeconds is set (not nil), it will count how many missed jobs occurred from the value of startingDeadlineSeconds till now. For example, if startingDeadlineSeconds = 200, It will count how many missed jobs occurred in the last 200 seconds. The exact implementation of counting how many missed schedules can be found here.
In case there are not more than a 100 missed schedules from the previous step, the CronJob controller will check if the time now is not after the time of its scheduledTime + startingDeadlineSeconds , i.e. that it's not too late to start the job (passed the deadline). If it wasn't too late, the job will continue to be attempted to be started by the CronJob Controller. However, If it is already too late, then it doesn't start the job and records the event:
"Missed starting window for {cronjob name}. Missed scheduled time to start a job {scheduledTime}"
It is also important to note, that if the field startingDeadlineSeconds is not set, then it means there is no deadline at all. This means the job will be attempted to start by the CronJob controller without checking if it's later or not.
Therefore to answer the questions above:
1. If the startingDeadlineSeconds is set to 10 and the cronjob couldn't start for some reason at its scheduled time, then it can still be attempted to start again as long as those 10 seconds haven't passed, however, after the 10 seconds, it for sure won't be started, is this correct?
The CronJob controller will attempt to start the job and it will be successfully scheduled if the 10 seconds after it's schedule time haven't passed yet. However, if the deadline has passed, it won't be started this run, and it will be counted as a missed schedule in later executions.
2. If I have concurrencyPolicy set to Forbid, does K8s count it as a fail if a cronjob tries to be scheduled, when there is one already running?
Yes, it will be counted as a missed schedule. Since missed schedules are calculated as I stated above in point 2.

How to set Akka actors run only for specific time period?

I have a big task,which i break down into smaller task and analyse them. I have a basic model.
Master,worker and listener .
Master creates the tasks,give them to worker actors. Once an worker actor completes,it asks for another task from the master. Once all task is completed ,they inform the listener. They usually take around less than 2 minutes to complete 1000 tasks.
Now,Some time the time taken for some tasks might be more than others. I want to set timer for each task,and if a task takes more time,then worker task should be aborted by the master and the task has to be resubmitted later as new one. How to implement this? I can calculate the time taken by a worker task,but how Master actor keeps tab on time taken by all worker actors in real time?
One way of handling this would be for each worker, on receipt of a task to start on, sets a timeout before changing state to process the task, eg:
context.setReceiveTimeout(5 minutes) // for the '5 minutes' notation - import scala.concurrent.duration._
If the timeout is received, the worker can abort the task (or whatever other action you deem appropriate - eg. kill itself, or pass a notification message back to the master). Don't forget to cancel the timeout (set duration = Duration.Undefined) if the task is completed or the like.

Zookeeper priority queue

My problem description is follows:
I have n state based database infinite crawlers:
Currently how it is happening:
We are using single machine for crawling.
We have three level of priority queue. High, Medium and LOW.
At starting all Database job are put into lower level queue.
Worker reads a job from queue and do operation.
After finishing job it reschedule it with a delay of 5 minutes.
Solution I found
For Priority Queue I can use:
-
http://zookeeper.apache.org/doc/r3.2.2/recipes.html#sc_recipes_priorityQueues
Problem solution I am still searching are:
How to reschedule a job in queue with future schedule time. Is there
a way to do that in zookeeper ?
Canceling a already started job. Suppose user change his database
authentication details. I want to stop already running job for that
database and restart with new details.
What I thought is while starting a worker It will subscribe for that
it's znode changes and if something happen, It will stop that job and
reschedule it.
Infinite Queue
What I thought is that after finishing it will remove it from queue and
readd it with future schdule time. (It implementation depend on point 1)
Is it correct way of doing this task infinite task?