Spring batch -Combining Remote Partitioning with remote chunking - spring-batch

Trying to see if I can design a job where I need both partitioning and remote chunking. We could have something like Table A holds rows (one of the columns in table A will be the partition key) and for every Row in Table A, we would have Table B that contains many child records for a given foreign/partition key in Table A . We would need to run a query that filters the partition keys from Table A based on a query and for every partition key, process all the child records in Table B (here again we would have several million records in Table B, so we would need parallelism for record processing and hence remote chunking)
What would be the right way to think through the spring batch job design for something like that?
enter image description here

so we would need parallelism for record processing and hence remote chunking
Not necessarily. Nothing prevents you from using remote chunking in the workers of a partitioned step, but IMO this would complicate things.
A simpler approach is to use multiple jobs. Each job would handle a different partition and process items in parallel using a multi-threaded step. In other words, the partition key is a job parameter here. This approach has the following advantages:
Easier to scale: since you have parallelism at two levels:
run multiple jobs in parallel using multiple JVMs (either on the same machine or on different machines)
and with-in each JVM, use multiple threads to process items in parallel.
Easier to implement: Remote partitioning and chunking are not the easiest setups to configure. Running multiple jobs where each one reads select * from TableA where partitionKey = ? items and uses a multi-threaded step (it requires a single line of code, adding a task executor .taskExecutor(taskExecutor)) is much easier.

Related

How to avoid duplicate record processing in multiple instance batch scripts

we are trying to create a batch script which reads millions of records and try to process them, Since the processing take longer time (e.g : more 6 hrs) we are planning to run multiple instances of batch scripts.
How to avoid multiple instances picking the same record for processing?
We tried the below approach
pre-assigning instances with range of records using a manager .(i.e, let say 3 instances and 6 million records, each instance will get 2 million record).
stamping the instance_id in each record after reading. Instances always picks the records which have null value in instance_id
Is there any other way to avoid duplicate record processing.
If you use Spring Batch, you have several options to scale your job. Here is a non-exhaustive list of options:
Create a job with a multi-threaded step: each thread will process a distinct chunk of data
Create a job with a partitioned step: each worker step is assigned a distinct partition (workers could be local threads or remote JVMs)
Create different job instances where each job instance is assigned a distinct partition
Please refer to the Scaling and Parallel Processing section from the documentation for more details.
Alternatively we can use DB Partitioning strategies to fix the duplicates record processing
Fixed and Even Break-Up of Record Set
Break up by a Key Column
Breakup by Views
Addition of a Processing Indicator
Extract Table to a Flat File
Use of a Hashing Column
Spring Document reference - click here

ksqlDB recommendations for deploying large set of queries

I am running a ksqlDB streaming application that consists of a large number of queries (>60 queries), including many joins and aggregations. My data comes from various sources, and requires plenty of manipulation to produce the desired processed data, hence the large number of queries. I've run this set of queries on a single machine, using interactive mode, and it produces the right results. But I observe an increasing consumer lag when I increase the amount of data fed into the application.
I read on ksqlDB's Capacity Planning page that I can scale by adding more servers, which is what I plan to do.
Under Important Sizing Factors, it's also stated that "You should avoid running a large number of queries on one ksqlDB cluster. Instead, use interactive mode to play with your data and develop sets of queries that function together. Then, run these in their own headless cluster." However, I am unsure how to do this- my queries are all dependent on each other.
Does anyone have any general recommendations on how to deploy a large number of interdependent ksql queries? As an added requirement, the data is refreshed each day and is independent for the each new day, so I need to do some sort of refresh of the queries each day.
I think that's just a recommendation if you can group queries that depend each other, and then split those groups into headless mode servers.
Another way, if you use interactive mode, is to partitioned your topics and add more ksql servers to your cluster. This will allow ksql to split the workload across the cluster, each server consuming and processing one partition. Say you have 4 partitions per topic and 2 servers, then you'll have 1 server processing 2 partitions and another server other 2 partitions. This should decrease the workload on each server.
Another improvement is to reduce the number of streams threads. Each query you create runs with 4 kafka streams threads by default. The more number of threads, the more parallel work is done in the server. With a large number of queries, performance decreases and lag is incremented. Try with 1 thread and see if that works. Set ksql.streams.num.stream.threads=1 in the ksql-server.properties to configure it.

Spring batch partitioning master can read database and pass data to workers?

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

Does PostgreSQL query partitions in parallel?

Postgres now has parallel queries. Are parallel queries used when the table is partitioned, the query is on the master table, and more than one partitions (child tables) are involved.
For example, I partition by the hour of the day. Then I want to count a type of event over more than one hour. The aggregation can be done on each partition, with the results added up at the end.
The alternative is to use a union between the partitions (child tables). In this case Postgres does parallel execution.
No, partitions are not queried in parallel. At this time (9.6) only table scans use parallel execution. The table is divided among the available workers, and each worker scans part of the table. At the end the primary worker combines the partial results.
A side effect of this is that the optimizer is more likely to chose a full table scan when parallel query execution is enabled.
As far as I can tell, there is no plan to parallelize execution based on partitions (or union all). A suggestion to add this has been added here.
Edit: My original answer was wrong. This answer has been completely revised.

Remote chunking with Spring Batch job distribution

I have a technical issue running my Spring batch jobs.
The Job simply reading records from the DB (MongoDB) , making some calculations on the record (aggregations) and writing the record result to another table.
Reading A , Processing A , writing to record B
B is an aggregations of many records of A.
I want to use remote chunking to vertically scaling my system causing the processing part be scaled and quick.
The problem I face that I need to synchronize the A records so processing them will not conflict when writing the result to B.
If I distribute 10 A records to 4 slaves they will conflict when writing the aggregate result to B .
Any idea , how to add synchronizing policy when sending messages from the master to the slaves ?
Thanks in advance ...
If you need to synchronize data like you're describing, I'd recommend not going with remote chunking and using partitioning instead. This would allow you to partition by A and eliminate the synchronization issues you're facing. It would also provide additional throughput as you'd be running one processor per slave (same as in remote chunking).