I am writing Spring batch for chunk processing.
I have Single Job and Single Step. Within that Step I have Chunks which are dynamically sized.
SingleJob -> SingleStep -> chunk_1..chunk_2..chunk_3...so on
Following is a case I am trying to implement,
If Today I ran a Job and only chunk_2 failed and rest of chunks ran successfully. Now Tomorrow I want to run/restart ONLY failed chunks i.e. in this case chunk_2. (I don't want to run whole Job/Step/Other successfully completed Chunks)
I see Spring batch allow to store metadata and using that it helps to restart Jobs. but I did not get if it is possible to restart specific chunk as discuss above.
Am I missing any concept or if it is possible then any pseudo code/theoretical explanation or reference will help.
I appreciate your response
That's how Spring Batch works in a restart scenario, it will continue where it left off in the previous failed run.
So in your example, if the in the first run chunk1 has been correctly processed and chunk2 failed, the next job execution will restart at chunk2.
Related
Can I restart job and process only skipped items after I have corrected the mistakes in file? I'm reading documentation and not finding currently this possibility. You can restart job if it is failed, but I'm thinking restarting job after it has been completed with some skipped items. If this cannot be achieved with configuration, what would be good way to implement it myself?
What I have done in a case similiar to yours, is to log each skipped item in a file.
Then I created a second job that load the file and process all logged items.
I have a Spring batch job which has two steps. The first step's writer writes to memory, that is stores the data in a java data structure.
Is this correct? Does the writer have to write to a persistent storage? If the second step fails, would the job be able to restart correctly if I wrote to the memory in the first step? Is my assumption that a commit doesn't mean anything if I do things this way correct?
A writer does not have to write to a persistent storage. However, If the job fails and the JVM is stopped, you will lose that data.
Using a persistent job repository ensures that restart meta-data can survive a JVM crash hence the ability to restart the job where it left off.
SpringBatch seems to be lacking the metadata for the job definition in the database.
In order to create a job instance in the database, the only thing it considers is jobName and jobParamter, "JobInstance createJobInstance(String jobName, JobParameters jobParameters);"
But,the object model of Job is rich enough to consider steps and listeners. So, if i create a new version of the existing job, by adding few additional steps, spring batch does not distinguish it from the previous version. Hence, if i ran the previous version today and run the updated version, spring batch does not run the updated version, as it feels that previous run was successful. At present, it seems like, the version number of the job, should be part of the name. Is this correct understanding ?
You are correct that the framework identifies each job instance by a unique combination of job name and (identifying) job parameters.
In general, if a job fails, you should be able to re-run with the same parameters to restart the failed instance. However, you cannot restart a completed instance. From the documentation:
JobInstance can be restarted multiple times in case of execution failure and it's lifecycle ends with first successful execution. Trying to execute an existing JobIntance that has already completed successfully will result in error. Error will be raised also for an attempt to restart a failed JobInstance if the Job is not restartable.
So you're right that the same job name and identifying parameters cannot be run multiple times. The design framework prevents this, regardless of what the business steps job performs. Again, ignoring what your job actually does, here's how it would work:
1) jobName=myJob, parm1=foo , parm2=bar -> runs and fails (assume some exception)
2) jobName=myJob, parm1=foo , parm2=bar -> restarts failed instance and completes
3) jobName=myJob, parm1=foo , parm2=bar -> fails on startup (as expected)
4) jobName=myJob, parm1=foobar, parm2=bar -> new params, runs and completes
The "best practices" we use are the following:
Each job instance (usually defined by run-date or filename we are processing) must define a unique set of parameters (otherwise it will fail per the framework design)
Jobs that run multiple times a day but just scan a work table or something use an incrementer to pass a integer parameter, which we increase by 1 upon each successful completion
Any failed job instances must be either restarted or abandoned before pushing code changes that affect the the job will function
What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.
In a job I am using MultiResourceItemReader to read files from folder, in the processor of this job I am need to invoke (initiate) a Job(to process file) for each file available it the folder.
I am not sure whether I should be having it as a Job or a Step. So I need to programmatically start a Job or Step.(which in turn triggers multiple steps)
in my MultiResourceItemReaderJob's processor
Please let me know how I can do it. any possible links/sample code will help.
Thank you very much