Spring batch partitioning master can read database and pass data to workers? - spring-batch

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.

In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

Related

How to avoid duplicate record processing in multiple instance batch scripts

we are trying to create a batch script which reads millions of records and try to process them, Since the processing take longer time (e.g : more 6 hrs) we are planning to run multiple instances of batch scripts.
How to avoid multiple instances picking the same record for processing?
We tried the below approach
pre-assigning instances with range of records using a manager .(i.e, let say 3 instances and 6 million records, each instance will get 2 million record).
stamping the instance_id in each record after reading. Instances always picks the records which have null value in instance_id
Is there any other way to avoid duplicate record processing.
If you use Spring Batch, you have several options to scale your job. Here is a non-exhaustive list of options:
Create a job with a multi-threaded step: each thread will process a distinct chunk of data
Create a job with a partitioned step: each worker step is assigned a distinct partition (workers could be local threads or remote JVMs)
Create different job instances where each job instance is assigned a distinct partition
Please refer to the Scaling and Parallel Processing section from the documentation for more details.
Alternatively we can use DB Partitioning strategies to fix the duplicates record processing
Fixed and Even Break-Up of Record Set
Break up by a Key Column
Breakup by Views
Addition of a Processing Indicator
Extract Table to a Flat File
Use of a Hashing Column
Spring Document reference - click here

Postgres Partitioning Query Performance when Partitioned for Delete

We are on Postgresql 12 and looking to partition a group of tables that are all related by Data Source Name. A source can have tens of millions of records and the whole dataset makes up about 900GB of space across the 2000 data sources. We don't have a good way to update these records so we are looking at a full dump and reload any time we need to update data for a source. This is why we are looking at using partitioning so we can load the new data into a new partition, detach (and later drop) the partition that currently houses the data, and then attach the new partition with the latest data. Queries will be performed via a single ID field. My concern is that since we are partitioning by source name and querying by an ID that isn't used in the partition definition that we won't be able to utilize any partition pruning and our queries will suffer for it.
How concerned should we be with query performance for this use case? There will be an index defined on the ID that is being queried, but based on the Postgres documentation it can add a lot of planning time and use a lot of memory to service queries that look at many partitions.
Performance will suffer, but it will depend on the number of partitions how much. The more partitions you have, the slower both planning and execution time will get, so keep the number low.
You can save on query planning time by defining a prepared statement and reusing it.

Spring batch -Combining Remote Partitioning with remote chunking

Trying to see if I can design a job where I need both partitioning and remote chunking. We could have something like Table A holds rows (one of the columns in table A will be the partition key) and for every Row in Table A, we would have Table B that contains many child records for a given foreign/partition key in Table A . We would need to run a query that filters the partition keys from Table A based on a query and for every partition key, process all the child records in Table B (here again we would have several million records in Table B, so we would need parallelism for record processing and hence remote chunking)
What would be the right way to think through the spring batch job design for something like that?
enter image description here
so we would need parallelism for record processing and hence remote chunking
Not necessarily. Nothing prevents you from using remote chunking in the workers of a partitioned step, but IMO this would complicate things.
A simpler approach is to use multiple jobs. Each job would handle a different partition and process items in parallel using a multi-threaded step. In other words, the partition key is a job parameter here. This approach has the following advantages:
Easier to scale: since you have parallelism at two levels:
run multiple jobs in parallel using multiple JVMs (either on the same machine or on different machines)
and with-in each JVM, use multiple threads to process items in parallel.
Easier to implement: Remote partitioning and chunking are not the easiest setups to configure. Running multiple jobs where each one reads select * from TableA where partitionKey = ? items and uses a multi-threaded step (it requires a single line of code, adding a task executor .taskExecutor(taskExecutor)) is much easier.

Spark dataframe saveAsTable is using a single task

We have a pipeline for which the initial stages are properly scalable - using several dozen workers apiece.
One of the last stages is
dataFrame.write.format(outFormat).mode(saveMode).
partitionBy(partColVals.map(_._1): _*).saveAsTable(tname)
For this stage we end up with a single worker. This clearly does not work for us - in fact the worker runs out of disk space - on top of being very slow.
Why would that command end up running on a single worker/single task only?
Update The output format was parquet. The number of partition columns did not affect the result (tried one column as well as several columns).
Another update None of the following conditions (as posited by an answer below) held:
coalesce or partitionBy statements
window / analytic functions
Dataset.limit
sql.shuffle.partitions
The problem is unlikely to be related in any way to saveAsTable.
A single task in a stage indicates that the input data (Dataset or RDD) has only a one partition. This is contrast to cases where there are multiple tasks but one or more have significantly higher execution time, which normally correspond to partitions containing positively skewed keys. Also you should confound a single task scenario with low CPU utilization. The former is usually a result of insufficient IO throughput (high CPU wait times are the most obvious indication of that), but in rare cases can be traced to usage of shared objects with low level synchronization primitives.
Since standard data sources don't shuffle data on write (including cases where partitionBy and bucketBy options are used) it is safe to assume that data has been repartitioned somewhere in the upstream code. Usually it means that one of the following happened:
Data has been explicitly moved to a single partition using coalesce(1) or repartition(1).
Data has been implicitly moved to a single partition for example with:
Dataset.limit
Window function applications with window definition lacking PARTITION BY clause.
df.withColumn(
"row_number",
row_number().over(Window.orderBy("some_column"))
)
sql.shuffle.partitions option is set to 1 and upstream code includes non-local operation on a Dataset.
Dataset is a result of applying a global aggregate function (without GROUP BY caluse). This usually not an issue, unless function is non-reducing (collect_list or comparable).
While there is no evidence that it is the problem here, in general case you should also possibility, data contains only a single partition all the way to the source. This usually when input is fetched using JDBC source, but the 3rd party formats can exhibit the same behavior.
To identify the source of the problem you should either check the execution plan for the input Dataset (explain(true)) or check SQL tab of the Spark Web UI.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_seconds¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.