Quartz.NET fires a job multiple times - reset

I am fairly new to Quertz.NET. I am inheriting a Windows Service that uses Quartz.NET to schedule jobs.
There is a job that schedules all other jobs (say ResetJobs: IJobs). It downloads the job list from a database nightly at 9pm, deletes all the scheduled jobs, schedules/starts the jobs in the downloaded job list. One of the jobs in the downloaded list is the ResetJobs itself.
When the windows services starts, the service downloads the job list (including the ResetJobs) and schedules them. When the ResetJob fires at the cronjob time (0 0 21 1/1 * ? *), it immediately runs 10 times. The service log shows 10 calls to the ResetJob. The service itself runs on a single physical machine, not on a clustered environment. [DisallowConcurrentExecution] attribute is on the ResetJob class but doesn't help. The ResetJob still runs 10 times when fired.
I don't know if this is the root cause but when the ResetJob fires, it deletes itself from the scheduler and schedule it again. If this is the bad design, I would like to know how to do this properly.
Thanks

Related

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

Should we unschedule Sling Jobs running within AEM after they are completed?

I am creating multiple SlingJobs on the fly using org.apache.sling.commons.scheduler.Scheduler OSGi service in AEM.
i.e. scheduler.schedule(Runnable, ScheduleOptions);
I have requirement that these Sling Jobs be run only once, so I am using ScheduleOptions.AT(Date date,int times,long period) ScheduleOptions Docs
And passing times=1 as a parameter.
(Also what is period parameter ?)
The Job successfully runs only once.
My question is am I supposed to keep a track of this Job by name and UnSchedule it using Scheduler.unschedule(String jobName) after it has finished running ?
Will completed SlingJobs that are not UnScheduled, consume memory in the AEM server ?
Will these completed BUT unscheduled jobs cause my AEM server to slow down and later on require some purge activity as maintenance?
According to https://sling.apache.org/documentation/bundles/apache-sling-eventing-and-job-handling.html#scheduled-jobs
Internally the scheduled Jobs use the Commons Scheduler Service. But in addition they are persisted (by default below /var/eventing/scheduled-jobs) and survive therefore even server restarts. When the scheduled time is reached, the job is automatically added as regular Sling Job through the JobManager.
I had a problem with a scheduled jobs before(they were triggered on the daily basis). When the server was restarted scheduled jobs wasn't un-persisted and a new job doing the same action was scheduled(job was scheduled on #Activate method). As a result, I got several jobs doing the same action at the scheduled time, so I had to unschedule them in #Deactivate method.
You may make an experiment and make sure that there is no duplicated jobs under /var/eventing/scheduled-jobs

Delay in Kubernetes Job status update when running many jobs in parallel

I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.

Prevent Cron jobs from Single Point of Failure

I have many cron jobs running on server, which include like DB Backup (daily), sending Notification to users(hourly).
Currently i have 5 API Servers, and cron jobs is setup on one of the API Server.
I want to prevent Cron jobs from Single Point of failure. What if the machine crashed in which cron jobs has been setup.
Any suggestions please.

Azure - Mobile Services - run background task every 10 seconds

I have a .net backend mobile service in which I would like to run a task every 10 seconds. I can not use a scheduled job as they are limited to 60 seconds.
I have tried running the task in loop that aborts prior to the 60 second interval re-call, but that is not working out. The service seems to hang from time to time on the mobile device side/experience.
--
Does anyone understand how I can run a .net background task without using Scheduled Jobs?
Additionally, does anyone understand what happens when Scheduled Jobs call a service end point that has not returned or completed a previous call. Can this be detected?
The Scheduler that Mobile Services does not support what you are asking. Best bet is to convert your job to a WebJob that runs continually and do the 10-second wait yourself.
See WebJobs: https://azure.microsoft.com/en-us/documentation/articles/websites-webjobs-resources/