(Laravel 5) Monitor and optionally cancel an ALREADY RUNNING job on queue - queue

I need to achieve the ability to monitor and be able to cancel an ALREADY RUNNING job on queue.
There's a lot of answers about deleting QUEUED jobs, but not on an already running one.
This is the situation: I have a "job", which consists of HUNDREDS OF THOUSANDS rows on a database, that need to be queried ONE BY ONE against a web service.
Every row needs to be picked up, queried against a web service, stored the response and its status updated.
I had that already working as a Command (launching from / outputting to console), but now I need to implement queues in order to allow piling up more jobs from more users.
So far I've seen Horizon (which doesn't runs on Windows due to missing process control libs). However, in some demos seen around it lacks (I believe) a couple things I need:
Dynamically configurable timeout (the whole job may take more than 12 hours, depending on the number of rows to process on the selected job)
Ability to CANCEL an ALREADY RUNNING job.
I also considered the option to generate EACH REQUEST as a new job instead of seeing a "job" as the whole collection of rows (this would overcome the timeout thing), but that would give me a Horizon "pending jobs" list of hundreds of thousands of records per job, and that would kill the browser (I know Redis can handle this without itching at all). Further, I guess is not possible to cancel "all jobs belonging to X tag".
I've been thinking about hitting an API route, fire the job and decouple it from the app, but I'm seeing that this requires forking processes.
For the ability to cancel, I would implement a database with job_id, and when the user hits an API to cancel a job, I'd mark it as "halted". On every loop I would check its status and if it finds "halted" then kill itself.
If I've missed any aspect just holler and I'll add it or clarify about it.
So I'm asking for an advice here since I'm new to Laravel: how could I achieve this?

So I finally came up with this (a bit clunky) solution:
In Controller:
public function cancelJob()
{
$jobs = DB::table('jobs')->get();
# I could use a specific ID and user owner filter, etc.
foreach ($jobs as $job) {
DB::table('jobs')->delete($job->id);
}
# This is a file that... well, it's self explaining
touch(base_path(config('files.halt_process_signal')));
return "Job cancelled - It will stop soon";
}
In job class (inside model::chunk() function)
# CHECK FOR HALT SIGNAL AND [OPTIONALLY] STOP THE PROCESS
if ($this->service->shouldHaltProcess()) {
# build stats, do some cleanup, log, etc...
$this->halted = true;
$this->service->stopProcess();
# This FALSE is what it makes the chunk() method to stop looping
return false;
}
In service class:
/**
* Checks the existence of the 'Halt Process Signal' file
*
* #return bool
*/
public function shouldHaltProcess() :bool
{
return file_exists($this->config['files.halt_process_signal']);
}
/**
* Stop the batch process
*
* #return void
*/
public function stopProcess() :void
{
logger()->info("=== HALT PROCESS SIGNAL FOUND - STOPPING THE PROCESS ===");
$this->deleteHaltProcessSignalFile();
return ;
}
It doesn't looks quite elegant, but it works.
I've surfed the whole web and many goes for Horizon or other tools that doesn't fit my case.
If anyone has a better way to achieve this, it's welcome to share.

Laravel queue have 3 important config:
1. retry_after
2. timeout
3. tries
See more: https://laravel.com/docs/5.8/queues
Dynamically configurable timeout (the whole job may take more than 12
hours, depending on the number of rows to process on the selected job)
I think you can config timeout + retry_after about 24h.
Ability to CANCEL an ALREADY RUNNING job.
Delete job in jobs table
Delete process by process id in your server
Hope it help you :)

Related

Anylogic: Queue TimeOut blocks flow

I have a pretty simple Anylogic DE model where POs are launched regularly, and a certain amount of material gets to the incoming Queue in one shot (See Sample Picture below). Then the Manufacturing process starts using that material at a regular rate, but I want to check if the material in the queue gets outdated, so I'm using the TimeOut option of that queue, in order to scrap the outdated material (older than 40wks).
The problem is that every time that some material gets scrapped through this Timeout exit, the downstream Manufacturing process "stops" pulling more material, instead of continuing, and it does not get restarted until a new batch of material gets received into the Queue.
What am I doing wrong here? Thanks a lot in advance!!
Kindest regards
Your situation is interesting because there doesn't seem to be anything wrong with what you're doing. So even though what you are doing seems to be correct, I will provide you with a workaround. Instead of the Queue block, use a Wait block. You can assign a timeout and link the timeout port just like you did for the queue (seem image at the end of the answer).
In the On Enter field of the wait block (which I will assume is named Fridge), write the following code:
if( MFG.size() < MFG.capacity ) {
self.free(agent);
}
In the On Enter of MFG block write the following:
if( self.size() < self.capacity && Fridge.size() > 0 ) {
Fridge.free(Fridge.get(0));
}
And finally, in the On Exit of your MFG block write the following:
if( Fridge.size() > 0 ) {
Fridge.free(Fridge.get(0));
}
What we are doing in the above, is we are manually pushing the agents. Each time an agent is processed, the model checks if there is capacity to send more, if yes, a new agent is sent.
I know this is an unpleasant workaround, but it provides you with a solution until AnyLogic support can figure it out.

MS-MPI MPI_Barrier: sometimes hangs indefinitely, sometimes doesn't

I'm using the MPI.NET library, and I've recently moved my application to a bigger cluster (more COMPUTE-NODES). I've started seeing various collective functions hang indefinitely, but only sometimes. About half the time a job will complete, the rest of the time it'll hang. I've seen it happen with Scatter, Broadcast, and Barrier.
I've put a MPI.Communicator.world.Barrier() call (MPI.NET) at the start of the application, and created trace logs (using the MPIEXEC.exe /trace switch).
C# code snippet:
static void Main(string[] args)
{
var hostName = System.Environment.MachineName;
Logger.Trace($"Program.Main entered on {hostName}");
string[] mpiArgs = null;
MPI.Environment myEnvironment = null;
try
{
Logger.Trace($"Trying to instantiated on MPI.Environment on {hostName}. Is currently initialized? {MPI.Environment.Initialized}");
myEnvironment = new MPI.Environment(ref mpiArgs);
Logger.Trace($"Is currently initialized?{MPI.Environment.Initialized}. {hostName} is waiting at Barrier... ");
Communicator.world.Barrier(); // CODE HANGS HERE!
Logger.Trace($"{hostName} is past Barrier");
}
catch (Exception envEx)
{
Logger.Error(envEx, "Could not instantiate MPI.Environment object");
}
// rest of implementation here...
}
I can see the msmpi.dll's MPI_Barrier function being called in the log, and I can see messages being sent and received thereafter for a passing and a failing example. For the passing example, messages are sent/received and then the MPI_Barrier function Leave is logged.
For the failing example it look like one (or more) of the send messages is lost - it is never received by the target. Am I correct in thinking that messages lost within the MPI_Barrier call will mean that the processes never synchronize, therefore all get stuck at the Communicator.world.Barrier() call?
What could be causing this to happen intermittently? Could poor network performance between the COMPUTE-NODES be a cause?
I'm running MS HPC Pack 2008 R2, so the version of MS-MPI is pretty old, v2.0.
EDIT - Additional information
If I keep a task running within the same node, then this issue does not happen. For example, if I run a task using 8 cores on one node then fine, but if i use 9 cores on two nodes I'll see this issue ~50% of the time.
Also, we have two clusters in use and this only happens on one of them. They are both virtualized environments, but appear to be set up identically.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

Play 1.2.3 framework - Right way to commit transaction

We have a HTTP end-point that takes a long time to run and can also be called concurrently by users. As part of this request, we update the model inside a synchronized block so that other (possibly concurrent) requests pick up that change.
E.g.
MyModel m = null;
synchronized (lockObject) {
m = MyModel.findById(id);
if (m.status == PENDING) {
m.status = ACTIVE;
} else {
//render a response back to user that the operation is not allowed
}
m.save(); //Is not expected to be called unless we set m.status = ACTIVE
}
//Long running operation continues here. It can involve further changes to instance "m"
The reason for the synchronized block is to ensure that even concurrent requests get to pick up the latest status. However, the underlying JPA does not commit my changes (m.save()) until the request is complete. Since this is a long-running request, I do not want to wait until the request is complete and still want to ensure that other callers are notified of the change in status. I tried to call "m.em().flush(); JPA.em().getTransaction().commit();" after m.save(), but that makes the transaction unavailable for the subsequent action as part of the same request. Can I just given "JPA.em().getTransaction().begin();" and let Play handle the transaction from then on? If not, what is the best way to handle this use-case?
UPDATE:
Based on the response, I modified my code as follows:
MyModel m = null;
synchronized (lockObject) {
m = MyModel.findById(id);
if (m.status == PENDING) {
m.status = ACTIVE;
} else {
//render a response back to user that the operation is not allowed
}
m.save(); //Is not expected to be called unless we set m.status = ACTIVE
}
new MyModelUpdateJob(m.id).now();
And in my job, I have the following line:
doJob() {
MyModel m = MyModel.findById(id);
print m.status; //This still prints the old status as-if m.save() had no effect...
}
What am I missing?
Put your update code in a job an call
new MyModelUpdateJob(id).now().get();
thus the update will be done in another transaction that is commited at the end of the job
ouch, as soon as you add more play servers, you will be in trouble. You may want to play with optimistic locking in your example or and I advise against it pessimistic locking....ick.
HOWEVER, looking at your code, maybe read the article Building on Quicksand. I am not sure you need a synchronized block in that case at all...try to go after being idempotent.
In your case if
1. user 1 and user 2 both call that method and it is pending, then it goes to active(Idempotent)
If user 1 or user 2 wins, well that would be like you had the synchronization block anyways.
I am sure however you have a more complex scenario not shown here, BUT READ that article Building on Quicksand as it really changes the traditional way of thinking and is how google and amazon and very large scale systems operate.
Another option for distributed transactions across play servers is zookeeper which the big large nosql guys use BUT only as a last resort ;) ;)
later,
Dean

Quartz.Net - delay a simple trigger to start

I have a few jobs setup in Quartz to run at set intervals. The problem is though that when the service starts it tries to start all the jobs at once... is there a way to add a delay to each job using the .xml config?
Here are 2 job trigger examples:
<simple>
<name>ProductSaleInTrigger</name>
<group>Jobs</group>
<description>Triggers the ProductSaleIn job</description>
<misfire-instruction>SmartPolicy</misfire-instruction>
<volatile>false</volatile>
<job-name>ProductSaleIn</job-name>
<job-group>Jobs</job-group>
<repeat-count>RepeatIndefinitely</repeat-count>
<repeat-interval>86400000</repeat-interval>
</simple>
<simple>
<name>CustomersOutTrigger</name>
<group>Jobs</group>
<description>Triggers the CustomersOut job</description>
<misfire-instruction>SmartPolicy</misfire-instruction>
<volatile>false</volatile>
<job-name>CustomersOut</job-name>
<job-group>Jobs</job-group>
<repeat-count>RepeatIndefinitely</repeat-count>
<repeat-interval>43200000</repeat-interval>
</simple>
As you see there are 2 triggers, the first repeats every day, the next repeats twice a day.
My issue is that I want either the first or second job to start a few minutes after the other... (because they are both in the end, accessing the same API and I don't want to overload the request)
Is there a repeat-delay or priority property? I can't find any documentation saying so..
I know you are doing this via XML but in code you can set the StartTimeUtc to delay say 30 seconds like this...
trigger.StartTimeUtc = DateTime.UtcNow.AddSeconds(30);
This isn't exactly a perfect answer for your XML file - but via code you can use the StartAt extension method when building your trigger.
/* calculate the next time you want your job to run - in this case top of the next hour */
var hourFromNow = DateTime.UtcNow.AddHours(1);
var topOfNextHour = new DateTime(hourFromNow.Year, hourFromNow.Month, hourFromNow.Day, hourFromNow.Hour, 0, 0);
/* build your trigger and call 'StartAt' */
TriggerBuilder.Create().WithIdentity("Delayed Job").WithSimpleSchedule(x => x.WithIntervalInSeconds(60).RepeatForever()).StartAt(new DateTimeOffset(topOfNextHour))
You've probably already seen this by now, but it's possible to chain jobs, though it's not supported out of the box.
http://quartznet.sourceforge.net/faq.html#howtochainjobs