Is there any alternative for column range partitioner in spring Bach remote partitioning? - db2

Just take a normal case where I am taking data from DB2 , doing some business on data and writing it into mongoDB. This I am doing with spring batch column range partition(Remote partitioning) but the problem is in my DB2 table there is no sequential column , so each partition is having different data count. Because of this load is different for each slave. My requirement is to distribute load in slaves equally.

You'll need to write your own implementation of a Partitioner In a partitioned job, the Partitioner is responsible for knowing how to divide up the data into the partitions. Spring Batch really only provides one out of the box, theMultiResourcePartitioner`. The column range one found in the framework is actually just an sample. You can read more about this interface and it's role in the documentation here: https://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/core/partition/support/Partitioner.html and here: https://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

Related

Kafka Connect: Single connector or connector per table approach

I have a database say test and we are having multiple Kafka Debezium Connectors on it.
Each connector is associate with one table.
My question is in terms of memory usage, which is a better approach:
One connector per database OR
One connector per table
I think it really depends on your use case. I don't think there is a general approach for all the usecases.
For example, at my current job, we decided to have 4 connectors that stream changes from the same database, but, each of them is streaming from a subset of tables. The main reason is that we don't want to have a single point of failure where a bad record inside DB can break all our usecases that rely on CDC, hence, we divided the tables and assigned them to a connector. Note that it's not good to have a lot of replication slots on the database also. So it really depends on your usecase.
Considering all performance factors, it is always a recommended approach to have a single source connector (multiple instances to share the load), with replicator or configuration file per database instance (test1, test2, test3 etc), having multiple tables, so the data ingress would be 1 table -> 1 topic.
You can have a better view at Oracle Golden Gate implementation pattern for the same.
https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka/

Spring batch -Combining Remote Partitioning with remote chunking

Trying to see if I can design a job where I need both partitioning and remote chunking. We could have something like Table A holds rows (one of the columns in table A will be the partition key) and for every Row in Table A, we would have Table B that contains many child records for a given foreign/partition key in Table A . We would need to run a query that filters the partition keys from Table A based on a query and for every partition key, process all the child records in Table B (here again we would have several million records in Table B, so we would need parallelism for record processing and hence remote chunking)
What would be the right way to think through the spring batch job design for something like that?
enter image description here
so we would need parallelism for record processing and hence remote chunking
Not necessarily. Nothing prevents you from using remote chunking in the workers of a partitioned step, but IMO this would complicate things.
A simpler approach is to use multiple jobs. Each job would handle a different partition and process items in parallel using a multi-threaded step. In other words, the partition key is a job parameter here. This approach has the following advantages:
Easier to scale: since you have parallelism at two levels:
run multiple jobs in parallel using multiple JVMs (either on the same machine or on different machines)
and with-in each JVM, use multiple threads to process items in parallel.
Easier to implement: Remote partitioning and chunking are not the easiest setups to configure. Running multiple jobs where each one reads select * from TableA where partitionKey = ? items and uses a multi-threaded step (it requires a single line of code, adding a task executor .taskExecutor(taskExecutor)) is much easier.

Spring batch partitioning master can read database and pass data to workers?

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.

How to create multiple Spark tasks to query Cassandra partitions

I have an application that is using Spark (with Spark Job Server) that uses a Cassandra store. My current setup is that of a client mode running with master=local[*]. So there is a single Spark executor which is also the driver process that is using all 8 cores of the machine. I have a Cassandra instance running on the same machine.
The Cassandra tables have a primary key of the form ((datasource_id, date), clustering_col_1...clustering_col_n) where date is a single day of the form "2019-02-07" and is part of a composite partition key.
In my Spark application, I am running a query like so:
df.filter(col("date").isin(days: _*))
In the Spark physical plan, I notice that these filters along with the filter for the "datasource_id" partition key are pushed up to the Cassandra CQL query.
For our biggest datasources, I know that the partitions are around 30MB in size. So I have the following setting in the Spark Job Server configuration:
spark.cassandra.input.split.size_in_mb = 1
However I notice that there is no parallelization in the Cassandra loading step. Though there are multiple Cassandra partitions that are >1MB, there are no additional spark partitions created. There is only a single task that does all the querying on a single core, thus taking ~20 secs to load data for a 1 month date range that corresponds to ~1 million rows.
I have tried the alternative approach below:
df union days.foldLeft(df)((df: DataFrame, day: String) => {
df.filter(col("date").equalTo(day))
})
This does indeed create a spark partition (or task) for every "day" partition in cassandra. However, for smaller datasources where the cassandra partitions are much smaller in size, this method proves to be quite expensive in terms of excessive tasks created and the overhead due to their coordination. For these datasources, it would be totally fine to lump many cassandra partitions into one spark partition. Hence why I thought using the spark.cassandra.input.split.size_in_mb configuration would prove useful in dealing with both small and large datasources.
Is my understanding wrong? Is there something else that I'm missing in order for this configuration to take effect?
P.S. I have also read the answers about using joinWithCassandraTable. However, our code relies on using DataFrame. Also, converting from a CassandraRDD to a DataFrame is not very viable for us since our schema is dynamic and cannot be specified using case classes.

Spark and sharded JDBC datasources

I have a production sharded cluster of PostgreSQL machines where sharding is handled at the application layer. (Created records are assigned a system generated unique identifier - not a UUID - which includes a 0-255 value indicating the shard # that record lives on.) This cluster is replicated in RDS so large read queries can be executed against it.
I'm trying to figure out the best option for accessing this data within Spark.
I was thinking of creating a small dataset (a text file) that contains only the shard names, i.e., integration-shard-0, integration-shard-1, etc. Then I'd partition this dataset across the Spark cluster so ideally each worker would only have a single shard name (but I'd have to handle cases where a worker has more than one shard). Then when I create a JdbcRDD I'd actually create 1..n such RDDs, one for each shard name residing on that worker, and merge the resulting RDDs together.
This seems like it would work but before I go down this path I wanted to see how other people have solved similar problems.
(I also have a separate Cassandra cluster available as second datacenter for analytic processing which I will be accessing with Spark.)
I ended up writing my own ShardedJdbcRDD for which the preliminary version can be found at the following gist:
https://gist.github.com/cfeduke/3bca88ed793ddf20ea6d
At the time I wrote it, this version doesn't support use from Java, only Scala. (I may update it.) It also doesn't have the same sub-partitioning scheme that JdbcRDD has, for which I will eventually create an overload constructor. Basically ShardedJdbcRDD will query your RDBMS shards across the cluster; if you have at least as many Spark slaves as shards, each slave will get one shard for its partition.
A future overloaded constructor will support the same range query that JdbcRDD has so if there are more Spark slaves in the cluster than shards the data can be broken up into smaller sets through range queries.