We have to process 4-5 million records and write them to file. So we are developing POC where our master step reads data from master table and creates partitions. Each partition has info about which rows needs to be processed by slave. Our slave just take those rows, applies biz rules and write them to file. This is working fine as remote slaves.
Now my question is can we make those remote slave multithreaded? Is yer then it there any impact on data consistency?
Related
I have spring-batch with spring boot application to process 60-70 millions of data. Application was built for using spring batch partitioning. I need to read customer ids from a file and then read some reference data from redis and oarcle DB and apply some business logic and write to PG DB .
Application working as expected and all our system testing completed. But when we went to PT testing we see few slave steps hang at random place(not consistent with file or line number). Step_execution table version keep increment but no data process. I have tried between 50-1000 partition with 5-25 million data . Only for 1 million a with 36 partition I was able to get completed status for all slaves and partition step. What might be the reason to hang some slave steps. If I re-run the job issue is not consistent like always not the same file(slave) hangs neither same number of slaves hang.
I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?
If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?
Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?
StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.
In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.
In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.
Need to design multi threading with Spring batch. Spring batch partition or using java multi threading, Which one is a better choice? We have many processes, each process holds jobs and sub jobs. these sub jobs needs to be executed in parallel.How can I do the retry mechanism in partition??
Go for the partition with master-slave concept. I have tried this and it boots the performance in good amount.
Restart Scenario :
Once your partitioner starts and your items are divided to the slaves.
Lets say you have 3 slaves and each slave holds 1 file to process.
Manually delete some items in the file which is assigned to the Slave2 so that it should get failed(Either in reader or writer of your slave step).
Then restart the job. Now it should start reading from the file which was assigned to the Slave2.
We have partitioned a large number of our jobs to improve the overall performance of our application. We are now investigating running several of these partitioned jobs in parallel (kicked off by an external scheduler). The jobs are all configured to use the same fixes reply queue. As a test, I generated a batch job that has a parallel flow where partitioned jobs are executed in parallel. On a single server (local testing) it works fine. When I try on a multiple server cluster I see the remote steps complete but the parent step does not ever finish. I see the messages in the reply queue but they are never read.
Is this an expected problem with out approach or can you suggest how we can try to resolve the problem?
I am running the spring batch job in three machines. For example the database has 30 records, the batch job in each machine has to pick up unique 10 records and process it.
I read partitioning and Parallel processing and bit confused, which one is suitable?
Appreciate your help.
What you are describing is partitioning. Partitioning is when the input is broken up into partitions and each partition is processed in parallel. Spring Batch offers two different ways to execute partitioning, one is local using threads (via the TaskExecutorPartitionHandler). The other one is distributing the partitions via messages so they can be executed either locally or remotely via the MessageChannelPartitionHandler found in Spring Batch Admin's spring-batch-integration project. You can learn more about remote partitioning via my talk on multi-jvm batch processing here: http://www.youtube.com/watch?v=CYTj5YT7CZU