Launch JobLaunchRequest for each new file in AWS S3 with Spring Batch Integration

Launch JobLaunchRequest for each new file in AWS S3 with Spring Batch Integration - spring-batch

I'm following the docs: Spring Batch Integration combining with the Integration AWS for pooling the AWS S3.
But the batch execution per each file is not working in some situations.
The AWS S3 Pooling is working correctly, so when I put a new file or when I started the application and there's files in the bucket the application sync with the local directory:
#Bean
public S3SessionFactory s3SessionFactory(AmazonS3 pAmazonS3) {
return new S3SessionFactory(pAmazonS3);
}
#Bean
public S3InboundFileSynchronizer s3InboundFileSynchronizer(S3SessionFactory pS3SessionFactory) {
S3InboundFileSynchronizer synchronizer = new S3InboundFileSynchronizer(pS3SessionFactory);
synchronizer.setPreserveTimestamp(true);
synchronizer.setDeleteRemoteFiles(false);
synchronizer.setRemoteDirectory("remote-bucket");
//synchronizer.setFilter(new S3PersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "simpleMetadataStore"));
return synchronizer;
}
#Bean
#InboundChannelAdapter(value = IN_CHANNEL_NAME, poller = #Poller(fixedDelay = "30"))
public S3InboundFileSynchronizingMessageSource s3InboundFileSynchronizingMessageSource(
S3InboundFileSynchronizer pS3InboundFileSynchronizer) {
S3InboundFileSynchronizingMessageSource messageSource = new S3InboundFileSynchronizingMessageSource(pS3InboundFileSynchronizer);
messageSource.setAutoCreateLocalDirectory(true);
messageSource.setLocalDirectory(new FileSystemResource("files").getFile());
//messageSource.setLocalFilter(new FileSystemPersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "fsSimpleMetadataStore"));
return messageSource;
}
#Bean("s3filesChannel")
public PollableChannel s3FilesChannel() {
return new QueueChannel();
}
I followed the tutorial so I created the FileMessageToJobRequest I won't put the code here because it's the same as the docs
So I created the beans IntegrationFlow and FileMessageToJobRequest:
#Bean
public IntegrationFlow integrationFlow(
S3InboundFileSynchronizingMessageSource pS3InboundFileSynchronizingMessageSource) {
return IntegrationFlows.from(pS3InboundFileSynchronizingMessageSource,
c -> c.poller(Pollers.fixedRate(1000).maxMessagesPerPoll(1)))
.transform(fileMessageToJobRequest())
.handle(jobLaunchingGateway())
.log(LoggingHandler.Level.WARN, "headers.id + ': ' + payload")
.get();
}
#Bean
public FileMessageToJobRequest fileMessageToJobRequest() {
FileMessageToJobRequest fileMessageToJobRequest = new FileMessageToJobRequest();
fileMessageToJobRequest.setFileParameterName("input.file.name");
fileMessageToJobRequest.setJob(delimitedFileJob);
return fileMessageToJobRequest;
}
So in the JobLaunchingGateway I think is the problem:
If I created like this:
#Bean
public JobLaunchingGateway jobLaunchingGateway() {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
simpleJobLauncher.setTaskExecutor(new SyncTaskExecutor());
JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(simpleJobLauncher);
return jobLaunchingGateway;
}
Case 1 (Bucket is empty when the application starts):
I upload a new file in the AWS S3;
The pooling works and the file appears in the local directory;
But the transform/job isn't fired;
Case 2 (Bucket already has one file when application starts):
The job is launched:
2021-01-12 13:32:34.451 INFO 1955 --- [ask-scheduler-1] o.s.b.c.l.support.SimpleJobLauncher : Job: [SimpleJob: [name=arquivoDelimitadoJob]] launched with the following parameters: [{input.file.name=files/FILE1.csv}]
2021-01-12 13:32:34.524 INFO 1955 --- [ask-scheduler-1] o.s.batch.core.job.SimpleStepHandler : Executing step: [delimitedFileJob]
If I add a second file in S3, the job isn't launched as the case 1.
Case 3 (Bucket has more than one file):
The files are synchronized correctly in local directory
But the job is only executed once for the last file.
So following the docs I change my Gateway to:
#Bean
#ServiceActivator(inputChannel = IN_CHANNEL_NAME, poller = #Poller(fixedRate="1000"))
public JobLaunchingGateway jobLaunchingGateway() {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
simpleJobLauncher.setTaskExecutor(new SyncTaskExecutor());
//JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(jobLauncher());
JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(simpleJobLauncher);
//jobLaunchingGateway.setOutputChannel(replyChannel());
jobLaunchingGateway.setOutputChannel(s3FilesChannel());
return jobLaunchingGateway;
}
With this new gateway implementation, if I put a new file in S3 the application reacts but didn't transform giving the error:
Caused by: java.lang.IllegalArgumentException: The payload must be of type JobLaunchRequest. Object of class [java.io.File] must be an instance of class org.springframework.batch.integration.launch.JobLaunchRequest
And if there's two files in the bucket (when the apps starts) FILE1.csv and FILE2.csv, the job runs for the FILE1.csv correctly, but give the error above for the FILE2.csv.
What's the correct way to implement something like this?
Just to be clear I want to receive thousand of csv files in this bucket, read and process with Spring Batch, but I also need to get every new file asap from S3.
Thanks in advance.

The JobLaunchingGateway indeed expects from us only JobLaunchRequest as a payload.
Since you have that #InboundChannelAdapter(value = IN_CHANNEL_NAME, poller = #Poller(fixedDelay = "30")) on the S3InboundFileSynchronizingMessageSource bean definition, it is really wrong to have then #ServiceActivator(inputChannel = IN_CHANNEL_NAME for that JobLaunchingGateway without FileMessageToJobRequest transformer in between.
Your integrationFlow looks OK for me, but then you really need to remove that #InboundChannelAdapter from the S3InboundFileSynchronizingMessageSource bean and fully rely on the c.poller() configuration.
Another way is to leave that #InboundChannelAdapter, but then start the IntegrationFlow from the IN_CHANNEL_NAME not a MessageSource.
Since you have several poller against the same S3 source, plus both of then are based on the same local directory, it is not a surprise to see so many unexpected situations.

Related

Spring batch : Inputstream is getting closed in spring batch reader if writer takes more than 5 minutes in processing

Whats need to be achieve: Read a csv from sftp location and write again to different path and also save to db with Spring batch in springboot app.
Issue : 1. Reader is getting executed only once and writer as per the chunk size like print statement in reader gets printed only one time while in writer per chunk execution. It seems to be the default behaviour of FlatfileItemReader.
I am using SFTP channel to read the file from sftp location which is getting closed after read if writer processing time is huge.
So is there a way, I can always pass a new SFTP connection for eech chunk size or is there a way I can extend this readers input stream timeout as I dont see any option of timeout. In sftp configuration, I already tried increasing the timeout and idle time but of no use.
I have tried creating the new sftp connection in reader and pass it to stream but as reader is only getting initialized one, this does not help.
I already tried increasing the timeout and idle time but of no use.
Reader snippet:
private Step step (FileInputDTO input,
Map<String, Float> ratelist) throws
SftpException { return
stepBuilderFactory.get("Step").<DTO,
DTO>chunk(chunkSize)
.reader(buildReader(input)).writer(new
Writer(input, fileUtil, ratelist,
mapper,service))
.taskExecutor (taskExecutor)
.listener(stepListener)
.build();
}
private FlatFileItemReader<? extends
DTO> buildReader(FileInputDTO input)
throws SftpException {
//Create reader instance
FlatFileItemReader<DTO> reader = new
FlatFileItemReader<>();
log.info("reading file :starts" );"
//Set input file location
reader.setResource(new InputStreamResource(input.getChannel().get(input.getPath())));
//Set number of Lines to ships. Use it if file has header rows.
reader.setLinesToSkip(1);
//Other code
return reader;
}
SFTP configuration:
public SFTPUtil (Environment env, String sftpPassword) throws JSchException {
JSch jsch = new JSch();
log.debug("Creating SFTP channelSftp");
Session session = jSch.getSession(env.getProperty("sftp.remoteUserName"),
env.getProperty("sftp.remoteHost"), Integer.parseInt(env.getProperty("sftp.remotePort"))); session.setConfig(Constants.STRICT_HOST_KEY_CHECKING, Constants.NO);
session.setPassword(sftpPassword);
session.connect(Integer.parseInt(env.getProperty("sftp.sessionTimeout")));
Channel sftpchannel = session.openChannel (Constants.SFTP);
sftpchannel.connect(Integer.parseInt(env.getProperty("sftp.channelTimeout")));
this.channel = (ChannelSftp) sftpchannel;
log.debug("SFTP channelSftp connected");
}
public ChannelSftp get() throws CustomException {
if(channel == null) throw new CustomException("Channel creation failed");
return channel;
}

Spring Batch - getLastJobExecution() with sub-set of JobParameters?

I have a spring batch application which run a DB reading/writing job on a cron schedule. I want to cover the scenario where the application maybe stopped for 4 hours, and when it next runs it should use the start timestamp of the last job to determine the data it processes.
I have this code which registers the distinct job parameters, setting a new 'jobID' parameter each time.
#Override
#Scheduled(cron = "0 0 */1 * * ?")
public JobExecution startJob(String cdmrJob,String jobType) {
// lookup the job via the spring context or factory
Job job = applicationContext.getBean(cdmrJob.getJobBeanName(), Job.class);
String jobId = String.valueOf(System.currentTimeMillis());
JobParameters param = new JobParametersBuilder()
.addString("jobName", job.getName(),true)
.addDate("jobDate", new Date(),false)
.addString("jobUuid", UUID.randomUUID().toString(),false)
.addString("jobID", jobId, true)
.addString("jobType", jobType.name(),true)
.toJobParameters();
// get the last jobs start time if present..
getLastJobExecution(cdmrJob,jobType);
log.info("startJob({})",cdmrJob,StructuredArguments.entries(param.getParameters()));
JobExecution jobExecution = null;
try {
jobExecution = jobLauncher.run(job, param);
....
and then I have this code which uses the 'jobRepository' to search for the last instance of the matching job
#Override
public JobExecution getLastJobExecution(String cdmrJob,String jobType) {
// match the job name and type
Map<String, JobParameter> parameters = new HashMap<>();
parameters.put("jobName", new JobParameter(cdmrJob.getBatchJobName(), true));
parameters.put("jobType", new JobParameter(jobType.name(), true));
JobParameters jobParameters = new JobParameters(parameters);
JobExecution jobExecution = jobRepository.getLastJobExecution(cdmrJob.getBatchJobName(),jobParameters);
log.info("/getLastJobExecution{}->{}",param,jobExecution);
return jobExecution;
}
The problem is the 'jobExecution' is always null, since it seems 'jobRepository.getLastJobExecution' can't match the sub-set of job parameters.
When i set the 'jobID' parameter to be non-identiying the 'getLastJobExecution()' method matches and returns, but the batch complains that a completed job cannot be restarted.
Without reverting to writing a JDBC sql query - is there an approach via the batch API to run this type of sub-job parameter search?
I have seen these questions
Querying Spring Batch JobExecution with Batch Param values
Spring Batch Job taking previous execution parameters

why kafkaitemReader is always including last offset record of previous job run in the new job execution?

I am using spring batch kafkaItemReader in a job which is executed on a fixed delay of 10 seconds. Once the job with a chunk size of 1000 is completed, spring scheduler re-submits the same job again after a delay of 10 seconds. I am observing that KafkaReader is always including the last offset record in the subsequent job executions. Suppose, in the first job execution, records are processed from offset 1-1000, in my next job execution I am expecting kafkaItemReader to pick records from 1001 offset. But, in the next execution, kafkaItemReader is picking it up from offset 1000 (which is already processed).
Adding code blocks
//Job is getting submitted with scheduled task scheduler with below parameters
<task:scheduled-tasks>
<task:scheduled ref="runScheduler" method="run" fixed-delay="5000"/>
</task:scheduled-tasks>
//Job Parameters for each submission
String dateParam = new Date().toString();
JobParameters param =
new JobParametersBuilder().addString("date", dateParam).toJobParameters
//Below is the kafkaItemReader configuration
#Bean
public KafkaItemReader<String, String> kafkaItemReader() {
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "");
props.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, "SSL");
props.put(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, "");
props.put(SslConfigs.SSL_TRUSTSTORE_PASSWORD_CONFIG, "");
props.put(SslConfigs.SSL_KEYSTORE_LOCATION_CONFIG, "");
props.put(SslConfigs.SSL_KEYSTORE_PASSWORD_CONFIG, "");
props.put(SslConfigs.SSL_KEY_PASSWORD_CONFIG, "");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);
Map<TopicPartition,Long> partitionOffset = new HashMap<>();
return new KafkaItemReaderBuilder<String, String>()
.partitions(0)
.consumerProperties(props)
.name("customers-reader")
.saveState(true)
.pollTimeout(Duration.ofSeconds(10))
.topic("")
.partitionOffsets(partitionOffset)
.build();
}
#Bean
public Step kafkaStep(StepBuilderFactory stepBuilderFactory,ItemWriter testItemWriter,KafkaItemReader kafkaItemReader) throws Exception {
return stepBuilderFactory.get("kafkaStep")
.chunk(10)
.reader(kafkaItemReader)
.writer(testItemWriter)
.build();
}
#Bean
public Job kafkaJob(Step kafkaStep,JobBuilderFactory jobBuilderFactory) throws Exception {
return jobBuilderFactory.get("kafkaJob").incrementer(new RunIdIncrementer())
.start(kafkaStep)
.build();
}
Am i missing some config which is causing this behaviour? I don't see this behaviour if i stop and re-run the application, it picks the offset properly in this case.

You are running a new job instance on each shcedule (by using a different date as an identifying job parameter), but your reader is a singleton bean. This means it will be reused for each run without being reinitialized with the correct offset. You can make it step-scoped to have a new instance of the reader for each run:
#Bean
#StepScope
public KafkaItemReader<String, String> kafkaItemReader() {
...
}
This will give you the same behaviour as if you restart the application, which you said fixes the issue.

How to create Job in Kubernetes using Java API

Am able to create a job in the Kubernetes cluster using CLI (https://kubernetesbyexample.com/jobs/)
Is there a way to create a job inside the cluster using Java API ?

You can use Kubernetes Java Client to create any object such as Job. Referring from the example here
/*
* Creates a simple run to complete job that computes π to 2000 places and prints it out.
*/
public class JobExample {
private static final Logger logger = LoggerFactory.getLogger(JobExample.class);
public static void main(String[] args) {
final ConfigBuilder configBuilder = new ConfigBuilder();
if (args.length > 0) {
configBuilder.withMasterUrl(args[0]);
}
try (KubernetesClient client = new DefaultKubernetesClient(configBuilder.build())) {
final String namespace = "default";
final Job job = new JobBuilder()
.withApiVersion("batch/v1")
.withNewMetadata()
.withName("pi")
.withLabels(Collections.singletonMap("label1", "maximum-length-of-63-characters"))
.withAnnotations(Collections.singletonMap("annotation1", "some-very-long-annotation"))
.endMetadata()
.withNewSpec()
.withNewTemplate()
.withNewSpec()
.addNewContainer()
.withName("pi")
.withImage("perl")
.withArgs("perl", "-Mbignum=bpi", "-wle", "print bpi(2000)")
.endContainer()
.withRestartPolicy("Never")
.endSpec()
.endTemplate()
.endSpec()
.build();
logger.info("Creating job pi.");
client.batch().jobs().inNamespace(namespace).createOrReplace(job);
// Get All pods created by the job
PodList podList = client.pods().inNamespace(namespace).withLabel("job-name", job.getMetadata().getName()).list();
// Wait for pod to complete
client.pods().inNamespace(namespace).withName(podList.getItems().get(0).getMetadata().getName())
.waitUntilCondition(pod -> pod.getStatus().getPhase().equals("Succeeded"), 1, TimeUnit.MINUTES);
// Print Job's log
String joblog = client.batch().jobs().inNamespace(namespace).withName("pi").getLog();
logger.info(joblog);
} catch (KubernetesClientException e) {
logger.error("Unable to create job", e);
} catch (InterruptedException interruptedException) {
logger.warn("Thread interrupted!");
Thread.currentThread().interrupt();
}
}
}

If you want to launch a job using a static manifest yaml from inside the cluster, it should be easy using the official library.
This code worked for me.
ApiClient client = ClientBuilder.cluster().build(); //create in-cluster client
Configuration.setDefaultApiClient(client);
BatchV1Api api = new BatchV1Api(client);
V1Job job = new V1Job();
job = (V1Job) Yaml.load(new File("/tmp/template.yaml")); //load static yaml file
ApiResponse<V1Job> response = api.createNamespacedJobWithHttpInfo("default", job, "true", null, null);
You can also modify any kind of information of the job before launching it with the combination of getter and setter.
// set metadata-name
job.getMetadata().setName("newName");
// set spec-template-metadata-name
job.getSpec().getTemplate().getMetadata().setName("newName");

Quartz Cluster recovery mechanism

I run a simple controller with spring to test quartz capabilities.
#PostMapping(path = ["/api/v1/start/{jobKey}/{jobGroup}"])
fun start(#PathVariable jobKey: String, #PathVariable jobGroup: String): ResponseEntity<String> {
val simpleJob = JobBuilder
.newJob(SampleJob::class.java)
.requestRecovery(true)
.withIdentity(JobKey.jobKey(jobKey, jobGroup))
.build()
val sampleTrigger = TriggerBuilder
.newTrigger()
.withIdentity(jobKey, jobGroup)
.withSchedule(
SimpleScheduleBuilder
.repeatSecondlyForever(5)
.withMisfireHandlingInstructionIgnoreMisfires())
.build()
val scheduler = factory.scheduler
scheduler.jobGroupNames.contains(jobGroup)
if (scheduler.jobGroupNames.contains(jobGroup)) {
return ResponseEntity.ok("Scheduler exists.")
}
scheduler.scheduleJob(simpleJob, sampleTrigger)
scheduler.start()
return ResponseEntity.ok("Scheduler started.")
}
#PostMapping(path = ["/api/v1/stop/{jobKey}/{jobGroup}"])
fun stop(#PathVariable jobKey: String, #PathVariable jobGroup: String): String {
val scheduler = factory.scheduler
scheduler.interrupt(JobKey.jobKey(jobKey, jobGroup))
val jobGroupNames = scheduler.jobGroupNames
logger.info("Existing jobGroup names: {}", jobGroupNames)
return scheduler.deleteJob(JobKey.jobKey(jobKey, jobGroup)).toString()
}
Then I start two applications on different ports with the same code and start playing with it. Let's call them APP1 and APP2
I use PostgreSQL as JobStore.
So I run several scenarios.
1) Create the job with group1 and key1 in APP1
2) Try to create a job with group1 and key1 in APP2. - it gives the error that the job already started. The behavior is like I expected.
3) Stop APP1. I expect that the job will be executed in APP2, as it still exists in JobStore, but it didn't. Do I need to provide some additional configuration?
4) Start APP1, also nothing happens. Furthermore, the record for group1 and key1 still presented in DB and can't be started.
Do I need to modify shutdown behavior to remove job on the application shutdown and start jobs in another application? or I just need to configure the trigger in some another correct way?

My bad, that was a silly problem. I forget to start sheduler in my application
#Bean
open fun schedulerFactory(): SchedulerFactory {
val factory = StdSchedulerFactory()
factory.initialize(ClassPathResource("quartz.properties").inputStream)
factory.scheduler.start() // this line was missed
return factory
}