Multiple Spring Batch Partitioned Jobs Executing Concurrently - spring-batch

We have partitioned a large number of our jobs to improve the overall performance of our application. We are now investigating running several of these partitioned jobs in parallel (kicked off by an external scheduler). The jobs are all configured to use the same fixes reply queue. As a test, I generated a batch job that has a parallel flow where partitioned jobs are executed in parallel. On a single server (local testing) it works fine. When I try on a multiple server cluster I see the remote steps complete but the parent step does not ever finish. I see the messages in the reply queue but they are never read.
Is this an expected problem with out approach or can you suggest how we can try to resolve the problem?

Related

Spark Streaming scheduling best practices

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?
For a job that takes 15 seconds running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.
AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.
First, make sure that you really need Spark since from what you described it could be an overkill.
Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.
For streaming related jobs -> the key would be to avoid IO in your case - as the job seems to take only 15 seconds. Push your messages to a queue ( AWS SQS ). Have a AWS step function triggered by a Cloudwatch event (implements a schedule like Cron in your case every 30 mins - to call a AWS Step function) to read messages from SQS and process them in a lambda ideally.
So one option (serverless):
Streaming messages --> AWS SQS -> (every 30 mins cloudwatch triggers a step function ) -> which triggers a lambda service to process all messages in the queue
https://aws.amazon.com/getting-started/tutorials/scheduling-a-serverless-workflow-step-functions-cloudwatch-events/
Option 2:
Streaming messages ---> AWS SQS -> Process messages using a Python application/Java Spring application having a scheduled task that wakes up every 30 mins and reads messages from queue and processes it in memory.
I have used option 2 for solving analytical problems, although my analytical problem took 10 mins and was data intensive.Option 2 in addition, needs to monitor the virtual machine (container) where the process is running. On the other hand Option 1 is serverless. Finally it all comes down to the software stack you already have in place and the software needed to process the streaming data.

Spring batch slave step executing infinitely causing partition job waiting for updating status

I have spring-batch with spring boot application to process 60-70 millions of data. Application was built for using spring batch partitioning. I need to read customer ids from a file and then read some reference data from redis and oarcle DB and apply some business logic and write to PG DB .
Application working as expected and all our system testing completed. But when we went to PT testing we see few slave steps hang at random place(not consistent with file or line number). Step_execution table version keep increment but no data process. I have tried between 50-1000 partition with 5-25 million data . Only for 1 million a with 36 partition I was able to get completed status for all slaves and partition step. What might be the reason to hang some slave steps. If I re-run the job issue is not consistent like always not the same file(slave) hangs neither same number of slaves hang.

Parallelism in Spark Job server

We are working on Qubole with Spark version 2.0.2.
We have a multi-step process in which all the intermediate steps write their output to HDFS and later this output is used in the reporting layer.
As per our use case, we want to avoid writing to HDFS and keep all the intermediate output as temporary tables in spark and directly write the final reporting layer output.
For this implementation, we wanted to use Job server provided by Qubole but when we try to trigger multiple queries on the Job server, Job server is running my jobs sequentially.
I have observed the same behavior in Databricks cluster as well.
The cluster we are using is a 30 node, r4.2xlarge.
Does anyone has experience in running multiple jobs using job server ?
Community's help will be greatly appreciated !

Spring batch partition or using java multi threading?

Need to design multi threading with Spring batch. Spring batch partition or using java multi threading, Which one is a better choice? We have many processes, each process holds jobs and sub jobs. these sub jobs needs to be executed in parallel.How can I do the retry mechanism in partition??
Go for the partition with master-slave concept. I have tried this and it boots the performance in good amount.
Restart Scenario :
Once your partitioner starts and your items are divided to the slaves.
Lets say you have 3 slaves and each slave holds 1 file to process.
Manually delete some items in the file which is assigned to the Slave2 so that it should get failed(Either in reader or writer of your slave step).
Then restart the job. Now it should start reading from the file which was assigned to the Slave2.

Spring batch- Parallel processing

I am running the spring batch job in three machines. For example the database has 30 records, the batch job in each machine has to pick up unique 10 records and process it.
I read partitioning and Parallel processing and bit confused, which one is suitable?
Appreciate your help.
What you are describing is partitioning. Partitioning is when the input is broken up into partitions and each partition is processed in parallel. Spring Batch offers two different ways to execute partitioning, one is local using threads (via the TaskExecutorPartitionHandler). The other one is distributing the partitions via messages so they can be executed either locally or remotely via the MessageChannelPartitionHandler found in Spring Batch Admin's spring-batch-integration project. You can learn more about remote partitioning via my talk on multi-jvm batch processing here: http://www.youtube.com/watch?v=CYTj5YT7CZU