Hangfire Postgres - re-enque job when server dies - postgresql

In our project we are using dotnet core, Hangfire and Postgres. We have some medium duration jobs (~10-15min) that we schedule on Hangfire.
The issue is that from time to time the Hangfire server that is processing a job might die and a new one is started. Obviously the job that was being processed needs re-enquing as it will never be completed otherwise. Hangfire seems to know the server that was executing the job is dead - nonetheless other servers won't pick up the job automatically and we need to re-enque it, which is not great.
Is there a way to get Hangfire to re-enque processing jobs when the server that was executing them is dead?
Thanks a lot!
Donato

Related

Talend Automation Job taking too much time

I had developed a Job in Talend and built the job and automated to run the Windows Batch file from the below build
On the Execution of the Job Start Windows Batch file it will invoke the dimtableinsert job and then after it finishes it will invoke fact_dim_combine it is taking just minutes to run in the Talend Open Studio but when I invoke the batch file via the Task Scheduler it is taking hours for the process to finish
Time Taken
Manual -- 5 Minutes
Automation -- 4 hours (on invoking Windows batch file)
Can someone please tell me what is wrong with this Automation Process
The reason of the delay in the execution would be a latency issue. Talend might be installed in the same server where database instance is installed. And so whenever you execute the job in Talend, it will complete as expected. But the scheduler might be installed in the other server, when you call the job through scheduler, it would take some time to insert the data.
Make sure you scheduler and database instance is on the same server
Execute the job directly in the windows terminal and check if you have same issue
The easiest way to know what is taking so much time is to add some logs to your job.
First, add some tWarn at the start and finish of each of the subjobs (dimtableinsert and fact_dim_combine) to know which one is the longest.
Then add more logs before/after the components inside the jobs.
This way you should have a better idea of what is responsible for the slowdown (DB access, writing of some files, etc ...)

Restarting a Spring batch job after app server failure or spring batch repository DB failure?

When spring batch DB failure happens or server is shut down, a spring batch job which was running at that time will be in a unknown started state.
In spring batch admin, we will not see an option to restart the job. Hence we are not able to resume the job.
How to restart the job from last successful commit?
The old discussions suggest that it had to be dealt manually by updating tables. I was manually able to update end time, status in batch step execution and batch job execution tables. Is it really the best option? It may not be practical to do that manually in a prod region.
As mentioned in the Aborting a Job section of the reference documentation, when a server failure happens, the job repository has no way to know that the process running the job died. Hence a manual intervention is required.
How to restart the job from last successful commit?
Change the status of the job to FAILED and restart the job instance, it should continue from where it left off.

Prevent Cron jobs from Single Point of Failure

I have many cron jobs running on server, which include like DB Backup (daily), sending Notification to users(hourly).
Currently i have 5 API Servers, and cron jobs is setup on one of the API Server.
I want to prevent Cron jobs from Single Point of failure. What if the machine crashed in which cron jobs has been setup.
Any suggestions please.

PgAgent jobs not executing on remote server

I don't understand why this isn't working, I set up a pgAgent job to send a NOTIFY from the database every hour
The steps
The schedule
Turns out that problem was that heroku doesn't support agAgent and the database was running on heroku, I ended making a work around the scheduling tasks using windows task scheduler - it's not the best solution but it does the job I needed to do...

Jobs in a queue is dropped unexpectedly in Gearman

I'm dealing with a very strange problem now.
Since I queue the jobs over 1,000 at once, Gearman doesn't work properly so far...
The problem is that, when I reserve the jobs in background mode, I could see the jobs were correctly queued from the monitoring page (gearman monitor),
but It is drained right after without delivering it to the worker. (within a few seconds)
After all, the jobs never be executed by the worker, just disappeared from the queue (job server).
So I tried rebooting the server entirely, and reinstall gearman as well as php library. (I'm using 1 CentOS, 1 Ubuntu with PHP gearman library, and version is 0.34 and 1.0.2)
But no luck yet... Job server just misbehaving as I explained in aobve.
What should I do for now?
Can I check the workers state, or see and monitor the whole process from queueing the jobs to the delivering to the worker?
When I tried gearmand with a option like: 'gearmand -vvvv' It never print anything on the screen while I register worker to the server, and run a job with client code (PHP)
Any comment will be appreciated.
For your information, I'm not considering persistent queue using MySQL or SQLite for now, because it sometimes occurs performance issue with slow execution.