Spring batch state when steps fails - spring-batch

I'm trying out spring batch. I have seen many examples when running jobs via ItemReader and ItemWriter. If a job runs without errors there is no problem.
But I haven't found out how to handle state when an job fails after processing a number of records.
My scenario is realy simple. Read records from a xml file (ItemReader) and call an external system for storing (ItemWriter). So what happens if the external system is not available in the middle of the process and after a while the job status is set to FAILED? If I restart the job again manually the next day when the external system is up and running I will get duplicates for the previously loaded records.
In some way I must have information for skipping the already loaded records.
I have tried to store a cursor via the ExecutionContext but when I restart the job I get a new JOB_EXECUTION_ID and the cursor data is lost because a get a new line in the BATCH_STEP_EXECUTION_CONTEXT.SHORT_CONTEXT. The BATCH_STEP_EXECUTION.COMMIT_COUNT and BATCH_STEP_EXECUTION.READ_COUNT is also reset when do restart.
I restart the job by using the JobOperator:
jobOperator.restart(jobExecutionId);
Is there a way of restart a job without get a new jobExecutionId or alterntive way of get state of failing jobs. If someone have found (can provide) an example contaning state and error handling I would be happy.
One alternative solution is of course to create my own table that keeps tracks of processed records but I hope really that the framework has a mechanism for this. Otherwise I don't understan the idea with spring-batch.
Regards
Mats

One of the primary features Spring Batch provides is the persistence of the state of a job in the job repository. When a job fails, upon restart, the default behavior is for the job to restart at the step that failed (skipping the steps that have already been successfully completed). Within a chunk based step, most of our readers (the StaxEventItemReader included) store what records have been processed in the job repository (specifically within the ExecutionContext). By default, when a chunk based step fails, it's restarted at the chunk that failed last time, skipping the successfully processed chunks.
An example of all of this would be if you had a three step job:
<job id="job1">
<step id="step1" next="step2">
<tasklet>
<chunk reader="reader1" writer="writer1" commit-interval="10"/>
</tasklet>
</step>
<step id="step2" next="step3">
<tasklet>
<chunk reader="reader2" writer="writer2" commit-interval="10"/>
</tasklet>
</step>
<step id="step3">
<tasklet>
<chunk reader="reader3" writer="writer3" commit-interval="10"/>
</tasklet>
</step>
</job>
And let's say this job completes step1, then step2 has 1000 records to process but fails at record 507. The chunk that consists of records 500-510 would roll back and the job would be marked as failed. The restart of that job would skip step1, skip records 1-499 in step2 and start back at record 500 of step2 (assuming you're using stateful item readers).
With regards to the jobExecutionId on a restart, Spring Batch has the concept of a job instance (a logical run) and a job execution (a physical run). For a job that runs daily, the logical run would be the Monday run, the Tuesday run, etc. Each of these would consist of their own JobInstance. If the job is successful, the JobInstance would end up with only one JobExecution associated with it. If it failed and was re-run, a new JobExecution would be created for each of the times the job is restarted.
You can read more about error handling in general and specific scenarios in the Spring Batch documentation found here: http://docs.spring.io/spring-batch/trunk/reference/html/index.html

Related

Spring Batch Job Stop Using jobOperator

I have Started my job using jobLauncher.run(processJob,jobParameters); and when i try stop job using another request jobOperator.stop(jobExecution.getId()); then get exeption :
org.springframework.batch.core.launch.JobExecutionNotRunningException:
JobExecution must be running so that it can be stopped
Set<JobExecution> jobExecutionsSet= jobExplorer.findRunningJobExecutions("processJob");
for (JobExecution jobExecution:jobExecutionsSet) {
System.err.println("job status : "+ jobExecution.getStatus());
if (jobExecution.getStatus()== BatchStatus.STARTED|| jobExecution.getStatus()== BatchStatus.STARTING || jobExecution.getStatus()== BatchStatus.STOPPING){
jobOperator.stop(jobExecution.getId());
System.out.println("###########Stopped#########");
}
}
when print job status always get job status : STOPPING but batch job is running
its web app, first upload some CSV file and start some operation using spring batch and during this execution if user need stop then stop request from another controller method come and try to stop running job
Please help me for stop running job
If you stop a job while it is running (typically in a STARTED state), you should not get this exception. If you have this exception, it means you have stopped your job while it is currently stopping (that is what the STOPPING status means).
jobExplorer.findRunningJobExecutions returns only running executions, so if in the next line right after this one you have a job in STOPPING status, this means the status changed right after calling jobExplorer.findRunningJobExecutions. You need to be aware that this is possible and your controller should handle this case.
When you tell spring batch to stop a job it goes into STOPPING mode. What this means is it will attempt to complete the unit of work chunk it is currently processing but then stop working. Likely what's happening is you are working on a long running task that is not finishing a unit of work (is it hung?) so it can't move from STOPPING to STOPPED.
Doing it twice rightly leads to an Exception because your job is already STOPPING by the time you did it the first time.

How to log job log in mongodb using talend

How to log job log whether the job is succeeded or failed into mongodb once the job has been compeleted in talend
If you want to save the joblog into table, then follow the below steps
Main job --> on subjob ok --> fixedflowinput with variables jobname, success then tdbxxoutput..
Main job --> on subjob error --> fixedflowinput with variables jobname, Fail then tdbxxoutput..

Windows Task Scheduler incorrectly spawns multiple instances per trigger (but need to be able to run multiple instances in parallel)

I have a task set to run an executable every 30 minutes via Windows Task Scheduler (OS: 64bit Windows Server Standard with SP2). This task needs to be able to run multiple instances of itself simultaneously, so this setting is selected: "If the task is already running, then the following rule applies: Run a new instance in parallel". (Reason: the task processes records in a queue table which may be empty or contain hundreds of thousands of records. Each task instance reserves a chunk of records to work on, so the instances won't collide)
The problem is, the task spawns multiple NEW instances at each trigger interval. It should only fire ONE NEW instance every 30 minutes. Often it spawns 2, 3, 4 or more new instances. At this point, the executable can handle the duplicate new instances without significant errors, but the server is doing more work than it needs to, and it just bugs me to no end that the task scheduler is misbehaving in this way. Here is what I have tried so far to fix:
Deleted and recreated the task (many times)
Rebooted the server
Installed this hotfix: http://support.microsoft.com/en-us/kb/2461249
Set to run every 30 minutes indefinitely
Set to run every 30 minutes daily, for duration of one day
Set "Synchronize across time zones" = true
Set "Run with highest privileges" = true
Set "Delay task for up to random delay of [X] seconds" = false (multiple new - instances are spawned all within the same second)
Set "Delay task for up to random delay of [30] seconds" = true (instead of firing during the same second, multiple new instances fire within a 30 second span)
Set "If the task fails, restart every 1 minute" = true
Set "If the task fails, restart every 1 minute" = false
Set "Run task as soon as possible after a scheduled start is missed" = false (if set to true, the problem is worse)
Even more puzzling: some other tasks on this server have the same or similar settings and do not have this problem. They had the problem before the hotfix, but after the hotfix it has been rare. Except for with this one task. What on earth could be the problem?
The exported task settings are below (with XXXX replacing sensitive info). I compared this point for point with another similar task that is not having the issue. The only differences: the working task has a different author, a different exe file, and runs every 5 minutes instead of every 30 minutes.
I'm about to chalk this up as a bug that Microsoft needs to fix some day, but thought I'd offer it up for review here before giving up.
<?xml version="1.0" encoding="UTF-16"?>
<Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task">
<RegistrationInfo>
<Date>2014-08-27T10:09:33.7980839</Date>
<Author>XXXX\XXXX</Author>
<Description>Links newly downloaded images to products. Resizes and uploads different sizes to XXXX. Updates relevant tables. Logs errors.</Description>
</RegistrationInfo>
<Triggers>
<CalendarTrigger>
<Repetition>
<Interval>PT30M</Interval>
<Duration>P1D</Duration>
<StopAtDurationEnd>false</StopAtDurationEnd>
</Repetition>
<StartBoundary>2015-02-11T19:06:00Z</StartBoundary>
<Enabled>true</Enabled>
<RandomDelay>PT30S</RandomDelay>
<ScheduleByDay>
<DaysInterval>1</DaysInterval>
</ScheduleByDay>
</CalendarTrigger>
</Triggers>
<Principals>
<Principal id="Author">
<UserId>XXXX\XXXX</UserId>
<LogonType>Password</LogonType>
<RunLevel>HighestAvailable</RunLevel>
</Principal>
</Principals>
<Settings>
<IdleSettings>
<Duration>PT10M</Duration>
<WaitTimeout>PT1H</WaitTimeout>
<StopOnIdleEnd>true</StopOnIdleEnd>
<RestartOnIdle>false</RestartOnIdle>
</IdleSettings>
<MultipleInstancesPolicy>Parallel</MultipleInstancesPolicy>
<DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries>
<StopIfGoingOnBatteries>true</StopIfGoingOnBatteries>
<AllowHardTerminate>true</AllowHardTerminate>
<StartWhenAvailable>false</StartWhenAvailable>
<RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable>
<AllowStartOnDemand>true</AllowStartOnDemand>
<Enabled>true</Enabled>
<Hidden>false</Hidden>
<RunOnlyIfIdle>false</RunOnlyIfIdle>
<WakeToRun>true</WakeToRun>
<ExecutionTimeLimit>P1D</ExecutionTimeLimit>
<Priority>7</Priority>
</Settings>
<Actions Context="Author">
<Exec>
<Command>C:\XXXX\XXXX\XXXX.exe</Command>
</Exec>
</Actions>
</Task>
I was in the same boat, had one task firing off 3 times, and another firing off 9 times. but also a bunch that only fired once as expected. Problem persisted after installing the hotfix as well.
After all my research turned up no good leads, my next step was going to be opening a support case with Microsoft. Before doing so I figured I would try deleting and recreating the task now that I've installed the patch. I started by only deleting and recreating the trigger (which was set to run once a day) and setting it for a different time. Bingo, my problem was fixed!
So I don't know what the key was, whether it was deleting and recreating the trigger after the patch was installed or if it was changing the time, but doing both worked.
Hope this helps!

Is it available multi task tag in <StartUp> or do I have to merge these cmd files to only one?

I am new Azure development and writing powershell script.
I want to run two cmd files for azure start up tasks. I added these files into solutions and set properties as "copy always".After I add new note into ServiceDefinition.csdef Here it is :
<Startup>
<Task commandLine="Startup\startupcmd.cmd > c:\logs\startuptasks.log" executionContext="elevated" taskType="background">
<Environment>
<Variable name="EMULATED">
<RoleInstanceValue xpath="/RoleEnvironment/Deployment/#emulated" />
</Variable>
</Environment>
</Task>
<Task commandLine="Startup\disableTimeout.cmd" executionContext="elevated" />
</Startup>
It's not deploying and getting this error : Instance 0 of role Web is busy
Now In my question : Is it available multi task tag in <StartUp> or do I have to merge these cmd files to only one ?
As per definition:
The Startup element describes a collection of tasks that run when the
role is started.
So yes, the answer to your concrete question is: Yes, you can define multiple startup tasks.
State Busy is almost fine, in terms it is bit better than cycling! What I would suggest it to enable Remote Desktop and connect to see what is going on with the start up task. Busy is set until all simple tasks have completed and returned 0 exit code. Your task may fail or may hang for a while and that's why you would see busy.

NServiceBus pipeline with Distributors

I'm building a processing pipeline with NServiceBus but I'm having trouble with the configuration of the distributors in order to make each step in the process scalable. Here's some info:
The pipeline will have a master process that says "OK, time to start" for a WorkItem, which will then start a process like a flowchart.
Each step in the flowchart may be computationally expensive, so I want the ability to scale out each step. This tells me that each step needs a Distributor.
I want to be able to hook additional activities onto events later. This tells me I need to Publish() messages when it is done, not Send() them.
A process may need to branch based on a condition. This tells me that a process must be able to publish more than one type of message.
A process may need to join forks. I imagine I should use Sagas for this.
Hopefully these assumptions are good otherwise I'm in more trouble than I thought.
For the sake of simplicity, let's forget about forking or joining and consider a simple pipeline, with Step A followed by Step B, and ending with Step C. Each step gets its own distributor and can have many nodes processing messages.
NodeA workers contain a IHandleMessages processor, and publish EventA
NodeB workers contain a IHandleMessages processor, and publish Event B
NodeC workers contain a IHandleMessages processor, and then the pipeline is complete.
Here are the relevant parts of the config files, where # denotes the number of the worker, (i.e. there are input queues NodeA.1 and NodeA.2):
NodeA:
<MsmqTransportConfig InputQueue="NodeA.#" ErrorQueue="error" NumberOfWorkerThreads="1" MaxRetries="5" />
<UnicastBusConfig DistributorControlAddress="NodeA.Distrib.Control" DistributorDataAddress="NodeA.Distrib.Data" >
<MessageEndpointMappings>
</MessageEndpointMappings>
</UnicastBusConfig>
NodeB:
<MsmqTransportConfig InputQueue="NodeB.#" ErrorQueue="error" NumberOfWorkerThreads="1" MaxRetries="5" />
<UnicastBusConfig DistributorControlAddress="NodeB.Distrib.Control" DistributorDataAddress="NodeB.Distrib.Data" >
<MessageEndpointMappings>
<add Messages="Messages.EventA, Messages" Endpoint="NodeA.Distrib.Data" />
</MessageEndpointMappings>
</UnicastBusConfig>
NodeC:
<MsmqTransportConfig InputQueue="NodeC.#" ErrorQueue="error" NumberOfWorkerThreads="1" MaxRetries="5" />
<UnicastBusConfig DistributorControlAddress="NodeC.Distrib.Control" DistributorDataAddress="NodeC.Distrib.Data" >
<MessageEndpointMappings>
<add Messages="Messages.EventB, Messages" Endpoint="NodeB.Distrib.Data" />
</MessageEndpointMappings>
</UnicastBusConfig>
And here are the relevant parts of the distributor configs:
Distributor A:
<add key="DataInputQueue" value="NodeA.Distrib.Data"/>
<add key="ControlInputQueue" value="NodeA.Distrib.Control"/>
<add key="StorageQueue" value="NodeA.Distrib.Storage"/>
Distributor B:
<add key="DataInputQueue" value="NodeB.Distrib.Data"/>
<add key="ControlInputQueue" value="NodeB.Distrib.Control"/>
<add key="StorageQueue" value="NodeB.Distrib.Storage"/>
Distributor C:
<add key="DataInputQueue" value="NodeC.Distrib.Data"/>
<add key="ControlInputQueue" value="NodeC.Distrib.Control"/>
<add key="StorageQueue" value="NodeC.Distrib.Storage"/>
I'm testing using 2 instances of each node, and the problem seems to come up in the middle at Node B. There are basically 2 things that might happen:
Both instances of Node B report that it is subscribing to EventA, and also that NodeC.Distrib.Data#MYCOMPUTER is subscribing to the EventB that Node B publishes. In this case, everything works great.
Both instances of Node B report that it is subscribing to EventA, however, one worker says NodeC.Distrib.Data#MYCOMPUTER is subscribing TWICE, while the other worker does not mention it.
In the second case, which seem to be controlled only by the way the distributor routes the subscription messages, if the "overachiever" node processes an EventA, all is well. If the "underachiever" processes EventA, then the publish of EventB has no subscribers and the workflow dies.
So, my questions:
Is this kind of setup possible?
Is the configuration correct? It's hard to find any examples of configuration with distributors beyond a simple one-level publisher/2-worker setup.
Would it make more sense to have one central broker process that does all the non-computationally-intensive traffic cop operations, and only sends messages to processes behind distributors when the task is long-running and must be load balanced?
Then the load-balanced nodes could simply reply back to the central broker, which seems easier.
On the other hand, that seems at odds with the decentralization that is NServiceBus's strength.
And if this is the answer, and the long running process's done event is a reply, how do you keep the Publish that enables later extensibility on published events?
The problem you have is that your nodes don't see each others list of subscribers. The reason you're having that problem is that your trying out a production scenario (scale-out) under the default NServiceBus profile (lite) which doesn't support scale-out but makes single-machine development very productive.
To solve the problem, run the NServiceBus host using the production profile as described on this page:
http://docs.particular.net/nservicebus/hosting/nservicebus-host/profiles
That will let different nodes share the same list of subscribers.
Other than that, your configuration is right on.