Spring batch complex conditional flow not working - spring-batch

I want to run a 3 steps spring batch job juste like this :
run step1
if step1 returns COMPLETED exit status then run step2
else fail the job
if step2 returns COMPLETED exit status then run step3
else fail the job
if step3 returns COMPLETED exit status then end the job
else fail the job
I implemented my need like this :
.start(step1())
.on(ExitStatus.FAILED.getExitCode()).fail()
.on(ExitStatus.COMPLETED.getExitCode()).to(step2())
.from(step2())
.on(ExitStatus.FAILED.getExitCode()).fail()
.on(ExitStatus.COMPLETED.getExitCode()).to(step3())
.from(step3())
.on(ExitStatus.FAILED.getExitCode()).fail()
.on(ExitStatus.COMPLETED.getExitCode()).end()
All the three steps are nominale steps composed of a reader, a processor and writer. step 1 and 2 are writing into database and step 3 to a file, and all step listeners are returning the COMPLETED exit code (org.springframework.batch.core.ExitStatus#COMPLETED)
private Step step1() {
return stepBuilderFactory.get("step1")
.<Step1InClass, Step1OutClass>chunk(CHUNK_SIZE)
.reader(step1Reader())
.processor(step1Processor)
.writer(step1Writer)
.listener(step1Writer)
.listener(step1StepListener)
.taskExecutor(taskExecutor())
.allowStartIfComplete(true)
.build();
}
private Step step2() {
return stepBuilderFactory.get("step2")
.<Step2InClass, Step2OutClass>chunk(1)
.reader(step2Reader())
.processor(step2Processor())
.faultTolerant()
.skip(Exception.class)
.skipLimit(Integer.MAX_VALUE)
.writer(step2Writer)
.listener(step2Writer)
.listener(step2StepListener)
.taskExecutor(taskExecutor())
.allowStartIfComplete(true)
.build();
}
private Step step3() {
return stepBuilderFactory.get("step3")
.<step3InClass, step3OutClass>chunk(CHUNK_SIZE)
.reader(step3Reader())
.processor(step3Processor)
.writer(step3Writer()) // a fileFlatFileItemWriter
.listener(step3Listener)
.listener(step3StepListener)
.taskExecutor(taskExecutor())
.allowStartIfComplete(true)
.build();
}
This runs only step1 and step2 but not step3 !

Related

why kafkaitemReader is always including last offset record of previous job run in the new job execution?

I am using spring batch kafkaItemReader in a job which is executed on a fixed delay of 10 seconds. Once the job with a chunk size of 1000 is completed, spring scheduler re-submits the same job again after a delay of 10 seconds. I am observing that KafkaReader is always including the last offset record in the subsequent job executions. Suppose, in the first job execution, records are processed from offset 1-1000, in my next job execution I am expecting kafkaItemReader to pick records from 1001 offset. But, in the next execution, kafkaItemReader is picking it up from offset 1000 (which is already processed).
Adding code blocks
//Job is getting submitted with scheduled task scheduler with below parameters
<task:scheduled-tasks>
<task:scheduled ref="runScheduler" method="run" fixed-delay="5000"/>
</task:scheduled-tasks>
//Job Parameters for each submission
String dateParam = new Date().toString();
JobParameters param =
new JobParametersBuilder().addString("date", dateParam).toJobParameters
//Below is the kafkaItemReader configuration
#Bean
public KafkaItemReader<String, String> kafkaItemReader() {
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "");
props.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, "SSL");
props.put(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, "");
props.put(SslConfigs.SSL_TRUSTSTORE_PASSWORD_CONFIG, "");
props.put(SslConfigs.SSL_KEYSTORE_LOCATION_CONFIG, "");
props.put(SslConfigs.SSL_KEYSTORE_PASSWORD_CONFIG, "");
props.put(SslConfigs.SSL_KEY_PASSWORD_CONFIG, "");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);
Map<TopicPartition,Long> partitionOffset = new HashMap<>();
return new KafkaItemReaderBuilder<String, String>()
.partitions(0)
.consumerProperties(props)
.name("customers-reader")
.saveState(true)
.pollTimeout(Duration.ofSeconds(10))
.topic("")
.partitionOffsets(partitionOffset)
.build();
}
#Bean
public Step kafkaStep(StepBuilderFactory stepBuilderFactory,ItemWriter testItemWriter,KafkaItemReader kafkaItemReader) throws Exception {
return stepBuilderFactory.get("kafkaStep")
.chunk(10)
.reader(kafkaItemReader)
.writer(testItemWriter)
.build();
}
#Bean
public Job kafkaJob(Step kafkaStep,JobBuilderFactory jobBuilderFactory) throws Exception {
return jobBuilderFactory.get("kafkaJob").incrementer(new RunIdIncrementer())
.start(kafkaStep)
.build();
}
Am i missing some config which is causing this behaviour? I don't see this behaviour if i stop and re-run the application, it picks the offset properly in this case.
You are running a new job instance on each shcedule (by using a different date as an identifying job parameter), but your reader is a singleton bean. This means it will be reused for each run without being reinitialized with the correct offset. You can make it step-scoped to have a new instance of the reader for each run:
#Bean
#StepScope
public KafkaItemReader<String, String> kafkaItemReader() {
...
}
This will give you the same behaviour as if you restart the application, which you said fixes the issue.

Launch JobLaunchRequest for each new file in AWS S3 with Spring Batch Integration

I'm following the docs: Spring Batch Integration combining with the Integration AWS for pooling the AWS S3.
But the batch execution per each file is not working in some situations.
The AWS S3 Pooling is working correctly, so when I put a new file or when I started the application and there's files in the bucket the application sync with the local directory:
#Bean
public S3SessionFactory s3SessionFactory(AmazonS3 pAmazonS3) {
return new S3SessionFactory(pAmazonS3);
}
#Bean
public S3InboundFileSynchronizer s3InboundFileSynchronizer(S3SessionFactory pS3SessionFactory) {
S3InboundFileSynchronizer synchronizer = new S3InboundFileSynchronizer(pS3SessionFactory);
synchronizer.setPreserveTimestamp(true);
synchronizer.setDeleteRemoteFiles(false);
synchronizer.setRemoteDirectory("remote-bucket");
//synchronizer.setFilter(new S3PersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "simpleMetadataStore"));
return synchronizer;
}
#Bean
#InboundChannelAdapter(value = IN_CHANNEL_NAME, poller = #Poller(fixedDelay = "30"))
public S3InboundFileSynchronizingMessageSource s3InboundFileSynchronizingMessageSource(
S3InboundFileSynchronizer pS3InboundFileSynchronizer) {
S3InboundFileSynchronizingMessageSource messageSource = new S3InboundFileSynchronizingMessageSource(pS3InboundFileSynchronizer);
messageSource.setAutoCreateLocalDirectory(true);
messageSource.setLocalDirectory(new FileSystemResource("files").getFile());
//messageSource.setLocalFilter(new FileSystemPersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "fsSimpleMetadataStore"));
return messageSource;
}
#Bean("s3filesChannel")
public PollableChannel s3FilesChannel() {
return new QueueChannel();
}
I followed the tutorial so I created the FileMessageToJobRequest I won't put the code here because it's the same as the docs
So I created the beans IntegrationFlow and FileMessageToJobRequest:
#Bean
public IntegrationFlow integrationFlow(
S3InboundFileSynchronizingMessageSource pS3InboundFileSynchronizingMessageSource) {
return IntegrationFlows.from(pS3InboundFileSynchronizingMessageSource,
c -> c.poller(Pollers.fixedRate(1000).maxMessagesPerPoll(1)))
.transform(fileMessageToJobRequest())
.handle(jobLaunchingGateway())
.log(LoggingHandler.Level.WARN, "headers.id + ': ' + payload")
.get();
}
#Bean
public FileMessageToJobRequest fileMessageToJobRequest() {
FileMessageToJobRequest fileMessageToJobRequest = new FileMessageToJobRequest();
fileMessageToJobRequest.setFileParameterName("input.file.name");
fileMessageToJobRequest.setJob(delimitedFileJob);
return fileMessageToJobRequest;
}
So in the JobLaunchingGateway I think is the problem:
If I created like this:
#Bean
public JobLaunchingGateway jobLaunchingGateway() {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
simpleJobLauncher.setTaskExecutor(new SyncTaskExecutor());
JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(simpleJobLauncher);
return jobLaunchingGateway;
}
Case 1 (Bucket is empty when the application starts):
I upload a new file in the AWS S3;
The pooling works and the file appears in the local directory;
But the transform/job isn't fired;
Case 2 (Bucket already has one file when application starts):
The job is launched:
2021-01-12 13:32:34.451 INFO 1955 --- [ask-scheduler-1] o.s.b.c.l.support.SimpleJobLauncher : Job: [SimpleJob: [name=arquivoDelimitadoJob]] launched with the following parameters: [{input.file.name=files/FILE1.csv}]
2021-01-12 13:32:34.524 INFO 1955 --- [ask-scheduler-1] o.s.batch.core.job.SimpleStepHandler : Executing step: [delimitedFileJob]
If I add a second file in S3, the job isn't launched as the case 1.
Case 3 (Bucket has more than one file):
The files are synchronized correctly in local directory
But the job is only executed once for the last file.
So following the docs I change my Gateway to:
#Bean
#ServiceActivator(inputChannel = IN_CHANNEL_NAME, poller = #Poller(fixedRate="1000"))
public JobLaunchingGateway jobLaunchingGateway() {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
simpleJobLauncher.setTaskExecutor(new SyncTaskExecutor());
//JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(jobLauncher());
JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(simpleJobLauncher);
//jobLaunchingGateway.setOutputChannel(replyChannel());
jobLaunchingGateway.setOutputChannel(s3FilesChannel());
return jobLaunchingGateway;
}
With this new gateway implementation, if I put a new file in S3 the application reacts but didn't transform giving the error:
Caused by: java.lang.IllegalArgumentException: The payload must be of type JobLaunchRequest. Object of class [java.io.File] must be an instance of class org.springframework.batch.integration.launch.JobLaunchRequest
And if there's two files in the bucket (when the apps starts) FILE1.csv and FILE2.csv, the job runs for the FILE1.csv correctly, but give the error above for the FILE2.csv.
What's the correct way to implement something like this?
Just to be clear I want to receive thousand of csv files in this bucket, read and process with Spring Batch, but I also need to get every new file asap from S3.
Thanks in advance.
The JobLaunchingGateway indeed expects from us only JobLaunchRequest as a payload.
Since you have that #InboundChannelAdapter(value = IN_CHANNEL_NAME, poller = #Poller(fixedDelay = "30")) on the S3InboundFileSynchronizingMessageSource bean definition, it is really wrong to have then #ServiceActivator(inputChannel = IN_CHANNEL_NAME for that JobLaunchingGateway without FileMessageToJobRequest transformer in between.
Your integrationFlow looks OK for me, but then you really need to remove that #InboundChannelAdapter from the S3InboundFileSynchronizingMessageSource bean and fully rely on the c.poller() configuration.
Another way is to leave that #InboundChannelAdapter, but then start the IntegrationFlow from the IN_CHANNEL_NAME not a MessageSource.
Since you have several poller against the same S3 source, plus both of then are based on the same local directory, it is not a surprise to see so many unexpected situations.

How to fix Spring batch partition step which is skipping the slave step?

I have a spring batch which will process date in the database with processed column set to flag N. There is slave step and a master step. Master step is the partition, by this there will be 10 partitioned slave step running concurrently. Now the problem is the partition step started, but it will skipp slave step and output successful right the way.
I already have another similar partition step working correctly. All the set up are the same. Just different step name, repository in item reader and different logic in itemprocessor etc.
I will provide presudo code
//item reader
itemreader(#Value("stepExecutionContext[to]" long to),#Value("stepExecutionContext[from]" long from),#Value("stepExecutionContext[id]") long id){
logger("partition id: {} process from: {} to: {}",id,to,from);
//logic, read chunk from to to
}
//item processor and writer, not much to say, just business logic.
//partitioner
public Map<String,ExecutionContext> partition(int grideSize){
Map<String,ExecutionContext> map = new HashMap<>();
int from = 1;
int range = 10;
for(i-gridSize) {
ExecutionContext context = new ExecutionContext();
context.put("from",from);
context.put("to",from+range);
from+=range;
map.put(partitionkey + "i");
}
return map;
}
//partition step
step partitionStep(){
return this.stepBuilderFactory.get("step1.master")
.partitioner("step1", partitioner)
.step(step1())
.gridSize(10)
.taskExecutor(taskExecutor)
.build();
}
//step1
step step1(){
return this.stepBuilderFactory.get("step1")
<pojo,pojo> chunk(1)
.reader(itemreader(null,null,null))
.processor(itemprocessor())
.writer(itemwriter())
.build()
}
//job
job partionJob(){
return this.jobBuilderFactory.get("partitionJob")
.start(partitionStep())
.build();
}
I was expected the logger in item reader will print out the information and start processing, because this is how it works with another partition step I had.
In my db, the batch_step_execution table shows the 1 master step(partition step) and 10 slave step(partitioned step) which is what I expected, but for slave step, the read count is 0, which should not be because in batch_step_execution_context table, partitioner info shows correctly, like
"id":0,"from":1,"to":10
itemreader should read from 1 to 10 and passed it to itemprocessor and then itemwriter save it.
I wonder what happened, all info is saved in spring batch meta table, why the slave step is still skipped? the map from paritioner isn't empty at all.
Need help.

Quartz only the first job in a row is executed

We have implemented recurring tasks using Quartz scheduler in our app. The user may schedule a recurring task starting at any time, even starting in the past. So for example, I can schedule a task to run monthly, starting on the 1st July, even though today is 17th July.
I would expect Quartz to run the first job immediately if it is in the past, and any subsequent jobs I throw at it. However, today I encountered a case when the task didn't get triggered instantly. The task was scheduled for 15th July, today is 17th July. Nothing happened. After I restarted the server and the code to schedule all the tasks in the DB ran, it did get triggered. Why would that happen ?
Code for scheduling the task below. Note that to make it recurring, we just reschedule it with the same code for another date we calculate (but that part of the code doesn't matter for the issue at hand).
Edit: Only the first job gets triggered, any subsequent jobs are not. If I try to use startNow() instead of startAt(Date), it still doesn't work, makes no difference.
JobDetail job = JobBuilder.newJob(ScheduledAppJob.class)
.withIdentity(stringId)
.build();
Trigger trigger = TriggerBuilder.newTrigger()
.withIdentity(stringId)
.startAt(date)
.build();
try
{
scheduler.scheduleJob(job, trigger);
dateFormat = new SimpleDateFormat("dd MMM yyyy, HH:mm:ss");
String recurringTime = dateFormat.format(date);
logger.info("Scheduling recurring job for " + recurringTime);
}
catch (SchedulerException se)
{
se.printStackTrace();
}
quartz.properties file, located under src/main (tried even in WEB-INF and WEB-INF/classes like suggested in the tutorial, but made no difference); even tried with 20 threadCount, still no difference:
org.quartz.scheduler.instanceName = AppScheduler
org.quartz.threadPool.threadCount = 3
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore
Seems to be working now. Haven't ran into any more problems. Could've been a config issue, as I have moved the config file in /src/main/resources.
Also try turning logging on in order to help with the debug:
log4j.logger.com.gargoylesoftware.htmlunit=DEBUG
We also added a JobTriggerListener to help with the logs:
private static class JobTriggerListener implements TriggerListener
{
private String name;
public JobTriggerListener(String name)
{
this.name = name;
}
public String getName()
{
return name;
}
public void triggerComplete(Trigger trigger, JobExecutionContext context,
Trigger.CompletedExecutionInstruction triggerInstructionCode)
{
}
public void triggerFired(Trigger trigger, JobExecutionContext context)
{
}
public void triggerMisfired(Trigger trigger)
{
logger.warn("Trigger misfired for trigger: " + trigger.getKey());
try
{
logger.info("Available threads: " + scheduler.getCurrentlyExecutingJobs());
}
catch (SchedulerException ex)
{
logger.error("Could not get currently executing jobs.", ex);
}
}
public boolean vetoJobExecution(Trigger trigger, JobExecutionContext context)
{
return false;
}
}
Check the updated config file:
#============================================================================
# Configure Main Scheduler Properties
#============================================================================
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.instanceName = MyAppScheduler
org.quartz.scheduler.instanceId = AUTO
#============================================================================
# Configure ThreadPool
#============================================================================
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 25
org.quartz.threadPool.threadPriority = 9
#============================================================================
# Configure JobStore
#============================================================================
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class = org.quartz.simpl.RAMJobStore

How to create quartz job that will runs every N seconds even if job takes more time

What I'm trying to achieve:
I have a secondly trigger that fires every 5 secods, and stateful job, that takes sometimes more than 5 seconds (7 for example) and what I have now
start: 00:00:00
end : 00:00:07
start: 00:00:07 < right after previous has finished
what I want :
start: 00:00:00
it should run at 00:00:05 but it hasn't
end : 00:00:07
start: 00:00:10 (5 seconds after previous, successive or not)
I have tried quartz.net version 2 and 1.
Job:
[PersistJobDataAfterExecution]
[DisallowConcurrentExecution]
public class StatefulJob : IJob (or IStatefulJob in 1.x version)
{
public void Execute(IJobExecutionContext context)
{
Console.WriteLine("StateFull START " + DateTime.Now.ToString());
Thread.Sleep(7000);
Console.WriteLine("StateFull END " + DateTime.Now.ToString());
}
}
Trigger:
var trigger1 = TriggerBuilder
.Create()
.WithSimpleSchedule(x =>
x.WithIntervalInSeconds(timeout)
.RepeatForever()
.Build();
EDIT
I have tried to use WithMisfireHandlingInstructionIgnoreMisfires(), but missfires happens due to scheduler being shutdown, or because there are no available threads, In my case - jobs does not execute because I use StatefulJob. Maybe I'm wrong, but behavior stays the same.
EDIT2
Ok, solution with 'running' flag works perfect in single threaded app. But if I runs this job in few threads (with different params) it would not work
So, is it possible to achieve behavior like I want using quartz ?
What if you let your job run concurrently, but amend it to do nothing if the job is already running, eg something like.
public class StatefulJob : IJob
{
private static bool Running;
public void Execute(IJobExecutionContext context)
{
if (Running)
return;
Running = true;
try
{
Console.WriteLine(" StateFull START " + DateTime.Now.ToString());
Thread.Sleep(7000);
Console.WriteLine(" StateFull END " + DateTime.Now.ToString());
}
finally
{
Running = false;
}
}
}