Spring batch slave step executing infinitely causing partition job waiting for updating status - spring-batch

I have spring-batch with spring boot application to process 60-70 millions of data. Application was built for using spring batch partitioning. I need to read customer ids from a file and then read some reference data from redis and oarcle DB and apply some business logic and write to PG DB .
Application working as expected and all our system testing completed. But when we went to PT testing we see few slave steps hang at random place(not consistent with file or line number). Step_execution table version keep increment but no data process. I have tried between 50-1000 partition with 5-25 million data . Only for 1 million a with 36 partition I was able to get completed status for all slaves and partition step. What might be the reason to hang some slave steps. If I re-run the job issue is not consistent like always not the same file(slave) hangs neither same number of slaves hang.

Related

Apache Flink - duplicate message processing during job deployments, with ActiveMQ as source

Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.

Spring batch partition or using java multi threading?

Need to design multi threading with Spring batch. Spring batch partition or using java multi threading, Which one is a better choice? We have many processes, each process holds jobs and sub jobs. these sub jobs needs to be executed in parallel.How can I do the retry mechanism in partition??
Go for the partition with master-slave concept. I have tried this and it boots the performance in good amount.
Restart Scenario :
Once your partitioner starts and your items are divided to the slaves.
Lets say you have 3 slaves and each slave holds 1 file to process.
Manually delete some items in the file which is assigned to the Slave2 so that it should get failed(Either in reader or writer of your slave step).
Then restart the job. Now it should start reading from the file which was assigned to the Slave2.

spring batch remote partitioning design advice

We have to process 4-5 million records and write them to file. So we are developing POC where our master step reads data from master table and creates partitions. Each partition has info about which rows needs to be processed by slave. Our slave just take those rows, applies biz rules and write them to file. This is working fine as remote slaves.
Now my question is can we make those remote slave multithreaded? Is yer then it there any impact on data consistency?

Multiple Spring Batch Partitioned Jobs Executing Concurrently

We have partitioned a large number of our jobs to improve the overall performance of our application. We are now investigating running several of these partitioned jobs in parallel (kicked off by an external scheduler). The jobs are all configured to use the same fixes reply queue. As a test, I generated a batch job that has a parallel flow where partitioned jobs are executed in parallel. On a single server (local testing) it works fine. When I try on a multiple server cluster I see the remote steps complete but the parent step does not ever finish. I see the messages in the reply queue but they are never read.
Is this an expected problem with out approach or can you suggest how we can try to resolve the problem?

Spring batch- Parallel processing

I am running the spring batch job in three machines. For example the database has 30 records, the batch job in each machine has to pick up unique 10 records and process it.
I read partitioning and Parallel processing and bit confused, which one is suitable?
Appreciate your help.
What you are describing is partitioning. Partitioning is when the input is broken up into partitions and each partition is processed in parallel. Spring Batch offers two different ways to execute partitioning, one is local using threads (via the TaskExecutorPartitionHandler). The other one is distributing the partitions via messages so they can be executed either locally or remotely via the MessageChannelPartitionHandler found in Spring Batch Admin's spring-batch-integration project. You can learn more about remote partitioning via my talk on multi-jvm batch processing here: http://www.youtube.com/watch?v=CYTj5YT7CZU